Automation

Note: This is an optional topic.

From today’s class, students are anticipated to be able to:

  • Use make
    • to record which files are inputs vs. intermediates vs. outputs
    • to capture how scripts and commands convert inputs to outputs
    • to re-run parts of an analysis that are out-of-date
  • Write a Makefile.
  • Interact with make in RStudio.
  • Use make from the shell.

Other tools aside from make (We won’t be covering these):

  • ProjectTemplate
  • remake for R

Resources

No video lecture for this optional topic. Written material:

Test Your Understanding

Use these questions to check your understanding of the material.

  1. True or False: You’ve opened Terminal, and are now about to run Rscript for a second time. Because you haven’t restarted the Terminal, the code will continue to build in the same session as before.
  2. You have information in script1.R that you’d like to pass to scipt2.R. True or false: the best way to pass the info to script2.R is by saving the final workspace from script1.R in an .RData or .rds file, and loading that it into script2.R.
  3. True or False: It’s almost always better to write an .Rmd file over an .R file, because you’re able to interlace markdown.
  4. True or false: makeing a phony target will always run its rules, whereas makeing a target file will only run its rules if the output needs updating.
  5. True or false: If a dependency file is not present on your computer, you can still call make error-free if the dependency is the target of another rule. This is true even if the dependency never gets made.
  6. True or false: The rules to make a target file will be run if either the target file or the dependencies are modified.
  7. True or false: Dependencies in make are accessed “lazily”, so that if the dependencies are never actually used when executing a rule, they don’t actually have any impact on the Makefile.

Why Automation

It often makes sense to break up a task (e.g. “analyze data and turn it into publication ready figures and tables”) into smaller chunks, e.g. “data cleaning” vs “summarizing and plotting” vs “model fitting”. This leads to a pipeline: a system where the code for some tasks (e.g. summarizing and plotting) depend on the output of others (e.g. data cleaning).

One of the major advantages to this paradigm: you no longer have to re-run all of the code every time you make a change. You only need to run the parts downstream from what you changed.

But how do we keep track of what needs to be re-run when we make changes in this system? We could do it by hand, but this is likely to cause human error (recall the reproducibility principle!). It’s much safer to automate. We will be learning how to use Makefiles for this purpose.

This will be challenging! But the payoff is huge for larger projects. Shaun Jackman gives an example of a Bioinformatics paper that is generated with a single Makefile that:

  • Downloads the data
  • Runs command-line programs
  • Performs the statistical analyses using R
  • Generates TSV tables
  • Renders figures using ggplot2
  • Renders supplementary material using RMarkdown
  • Renders the manuscript using Pandoc

And critically, knows which parts need to be run and which parts do not. Amazing, right?

Agenda

We will first work through stat545.com Chapter 35 to make sure that we all have make installed and that we can access it.

Once we get there, we’ll work through the activity in stat545.com Chapter 36 together.

Attribution

Written by Vincenzo Coia, with inspiration from Tiffany Timbers for the explanation of Makefiles, as well as the make activity from Shaun Jackman and Jenny Bryan created for this course prior to 2017.