Strengths and weaknesses of typical R project workflows

Concept: Data Analysis Pipelines

  • Data goes in, answers, insights, all the magic comes out.
  • ‘Pipeline’ implies a process which is a kind of linear progression from inputs to outputs.
  • Contrast this with a process that looks more like a continuous loop, where the aim is to receive input data, react to it, and then rest waiting for the next piece of data.
    • E.g. a software Application
  • The linear aspect is often reflected in how we structure our data analysis projects.

Concept: Reproducible Data Analysis

  • The ‘reproducibility’ that is connected to data analysis pipeline tools is not the ‘replicability’ from the replication crisis in science. But the two are connected.
  • We can ‘reproduce’:
    • The same conclusion given the same input data, and following the same analysis process.
      • e.g. your colleague’s work on your computer
    • A valid analysis given different input data, and following the same analysis process.
      • Conclusion might be different
  • What are the reasons we might want to do this?
    Benefits of reproducibility?
    • In order to answer questions about, or make extensions to an analysis in the future
      • ‘Boomerang effect’
      • To be able to make realistic predictions about how long a data analysis will take
      • To ensure consistent conclusions are reached to related questions
        • Need consistent definitions for inputs and key metrics
      • In a nutshell: Reliability, Consistency
        • Without these you don’t have a viable data analysis capability

Reproducibility and Code

  • Code works in favour of reproducibility.
    • It’s not guaranteed, but well written code can produce a deterministic procedure for data analysis. Use the same dataset with the same code and you should reproduce the same answer.
  • In an ideal world every data analysis could be a single succinct script of beautifully aesthetic code, easily understood by humans and machines alike.
    • In practice this is rarely possible due to certain forces. What are those forces?
      Forces pulling apart that perfect script
      • Domain mismatch: need write a lot of code
      • External systems: need to tread lightly on them
      • Expensive computations: repeatedly performing them is infeasible
      • Division of labour
      • (?)

Classic approaches to R projects

Script per pipeline ‘stage’

The most common approach to balancing reproducibility versus other concerns is to break the pipeline up into discrete scripts that map to stages in the linear pipeline. These stages might be conceived of as something like:

  1. Acquire data
  2. Wrangle data
  3. Visualise data
  4. Model data
  5. Present findings

With variations as required by context.

A typical folder structure might look something like:

 .
├── data
│   ├── processed_data.Rds
│   └── raw_data.csv
├── doc
│   ├── exploratory_analysis.Rmd
│   └── report.Rmd
├── output
│   ├── insightful_plot.png
│   └── final_model.Rds
├── R
│   └── helpers.R
├── README.md
├── run.sh
└── scripts
    ├── 01_load_data.R
    ├── 02_wrangle_data.R
    ├── 03_visualise_data.R
    ├── 04_model_data.R
    └── 05_render_report.R

There’s a lot of variations on this idea. There might be multiple scripts per phase here, e.g. one per plot (figure)03_visualise_data.R, or one per model. Using more folders seems popular.

The key element is that the data analysis is broken down into a series of stages, each of which is captured by a single script file. Quite often these script files are numbered, with it to be implicitly understood that the the correct way to run the pipeline is to run the scripts in numerical order1.

If the author is diligent the README.md will contain information about how to run the pipeline, and may provide some kind of run.R or run.sh script which acts as the ‘entry-point’ for kicking off pipeline execution. This is intended to be the thing that you run to reproduce the author’s results.

This can be something of a trap because the author likely does not actually use the run.sh script as part of their workflow. - Why would this be so?
reasons for not using the ‘run everything’ entry-point.
  • Author likely taking advantage of R’s REPL for interactive development. Relies on operating on incomplete pipeline state.
    • Running whole pipeline is too slow. Author can’t iterate fast enough if they have to re-run all earlier stages just to make small tweaks to a later stage. E.g. playing with plot presentation
    • Running whole pipeline would have undesirable side-effects like pulling a large amount of data from an API.

So the workflow used in practice tends be some combination of: - Interactively run numbered scripts up to the one you want to work on, then manually step through code to create the prerequisite state in the global environment. - Shortcut the early stages of the pipeline by having them write intermediate output files that are read as inputs to later stages. Stages can be worked on independently.

When things go stale

It’s the real, yet informal, workflow that creates problems.

If we used this project structure, and always ran the pipeline from start to finish, we would always know whether our code was in a state consistent with valid outputs 2.

If we work interactively, as we inevitably will3, we create opportunities for the pipeline’s code and outputs (be they intermediate or final) to be in conflict. Here’s a few a examples:

  1. Changes are made to 02_wrangle_data.R to support better modeling in 04_model_data.R. We were so keen to write about the improved results, we forgot to re-run 03_visualise_data.R which also outputs some image files that are used in the report.Rmd. When we render report.Rmd it contains images with data that was dropped before running the new model, and our boss is confused as to why we didn’t remove them yet. We look silly.

  2. In the midst of running 04_model_data.R interactively we forget that creating a |> chain of data.table transformations can modify the head dataset in-place. We tweak a chain before running it again leading to some columns being transformed twice. When we run the modeling code to completion we get some unexpectedly good results, and save that model for later use in report.Rmd. We write the report out around these results, only to have everything sour at the last minute when we try to run the entire pipeline as one with run.sh and a completely different set of results appears.

Both of these examples are different aspects of the same core problem. Working interactively with code that can accumulate data i.e. files on disk, data.frames in the global environment, rows in a database, etc. creates the opportunity for the code and the accumulated data to be in an inconsistent state. Sometimes this accumulated data is itself referred to as ‘global state’ or simply ‘state’, and people might say our problem was caused by ‘stale state’, that is: we are working with data that is no longer representative of what our program would output, if we ran it from scratch.

Cycles vs Lines

  • The project structure is strongly linear: Each script is assumed to be fully dependent on those prior, just as each line of code is on the one before.
  • Our work pattern is strongly cyclic as we iteratively refine our reasoning, statistical methods, data visualisations etc. This is involves making smaller targeted changes to code at all stages of the pipeline.

There are forces that we have discussed that exert pressure to avoid running the pipeline in a linear fashion. This creates space for issues:

  • The pipeline either fails to run, or gives unexpected results when finally run in a complete pass. Reproducibility fail.
    • Concepts drift between script files as they are worked on piecemeal. E.g. the same dataset is loaded in multiple script files but referred to by different names. Coherency fail.

As we will see {targets} will remove the pressure to run the entire pipeline end to end, and allow us to work iteratively without the risk of these problems, perhaps faster than ever before.

Rmd / Quarto Monolith

The idea of keeping code and output synchronised is often introduced to motivate the use of literate programming tools like Rmarkdown or Quarto. They definitely have something to contribute here, these tools work very well for educational material (like this workshop!), but the format does not scale well to large and complex data science projects.

There are two main difficulties:

  1. Fundamentally the format is geared toward producing a single output which is a text of some kind. Complex data science projects often have a myriad of other outputs including models, datasets, and other documents. Possibly having your model run binned because pandoc balked at your markdown syntax is not sensible.

    Rmarkdown and Quarto offer a caching feature to try to mitigate this but it involves manual cache management, and does not give you control over serialisation formats which mean certain objects will be unable to be restored from cache correctly. It’s up to you to discover which.

  2. In projects that involve complex data wrangling or modeling a tension can develop between the text and the code, where the code needs to be complex, but the text is pitched at a different (usually higher) conceptual level. The two fight for the narrative thread, and make for a disjointed / confusing reading experience. I call this illiterate programming.

My advice is definitely do use Rmarkdown or Quarto, but avoid shoehorning an entire pipeline into the document. Have a pipeline produce the intermediate outputs separately which are then read into the document generation pipeline and given superficial coding treatment e.g. light wrangling into presentation layer plots or tables.

Introducing our project V1: ‘Classic R Project’

In this workshop we’re going to refactor this project into a {targets} pipeline.

Nominally the project is about fetching some species distribution data from an API, merging that with some weather data, training a species classification model, and producing a report.

  • Take a few minutes to poke around
  • Read script 01 and 02.
    • Is it clear why things in 02 are the way they are?
      • Is it clear why we do that H3 hex index thing?
      • Has anyone looked at real estate recently. Which gives you more context: clicking through all the beautiful images, or one look at the floor plan?
      • Or imagine reading a text book that had no table of contents?

Footnotes

  1. Occasionally this presents refactoring challenges, where a new stage needs to be added late in development and the author might roll with a 02b_ to save updating too many paths.↩︎

  2. Or would we? How do we decide what valid outputs are? More on this later.↩︎

  3. Probably one of the reasons you’re using R is the gloriously fluent conversations you can have with your data via the REPL. Working with rapid feedback when you need to simultaneously learn about data and program around it just feels way too good compared to the alternative where you have to wait for a heavy process to spool up and to run each time you have a simple question.↩︎