https://github.com/wlandau/workflowhelper

Analyze multiple datasets in multiple ways with a smooth, efficient, parallelized, reproducible R workflow.
https://github.com/wlandau/workflowhelper
Last synced: 3 months ago
JSON representation
Analyze multiple datasets in multiple ways with a smooth, efficient, parallelized, reproducible R workflow.
Host: GitHub
URL: https://github.com/wlandau/workflowhelper
Owner: wlandau
Created: 2016-05-23T01:00:11.000Z (about 9 years ago)
Default Branch: main
Last Pushed: 2020-10-27T16:37:33.000Z (over 4 years ago)
Last Synced: 2025-02-14T12:40:35.961Z (5 months ago)
Language: R
Homepage:
Size: 150 KB
Stars: 1
Watchers: 4
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Try [`remakeGenerator`](https://github.com/wlandau/remakeGenerator)

[`remakeGenerator`](https://github.com/wlandau/remakeGenerator) will be the successor to `workflowHelper`.

[`remakeGenerator`](https://github.com/wlandau/remakeGenerator) is internally cleaner and more flexible and extensible than `workflowHelper`, and it is better suited to adapt with future updates to [`remake`](https://github.com/richfitz/remake). [`remakeGenerator`](https://github.com/wlandau/remakeGenerator) is tested and available for use.

# `workflowHelper`

This package helps to analyze multiple datasets in multiple ways. Your workflow will be

- **Reproducible**. Reproduce any analysis with one call to `plan_workflow()` and another to `make`.

- **Development-friendly**. Thanks to [`remake`](https://github.com/richfitz/remake), whenever you change your code, your next job will only recompute the affected tasks. This minimizes headache when your workflow is under heavy development and unexpected changes happen frequently.

- **Quick to set up**. Just provide the commands to generate datasets, analyze an arbitrary dataset, etc., and `workflowHelper` will arrange these commands in a workflow and manage your output.

- **Parallelizable**. Easily distribute your workflow over multiple parallel processes.

# Prerequisites

Before using this package, you should first learn about [`remake`](https://github.com/richfitz/remake). [GNU make](https://www.gnu.org/software/make/) is recommended but not totally necessary.

# Installation

Ensure that [R](https://www.r-project.org/) and [GNU make](https://www.gnu.org/software/make/) are installed, as well as the dependencies in the [`DESCRIPTION`](https://github.com/wlandau/workflowHelper/blob/master/DESCRIPTION). Open an R session and run 

```

library(devtools)

install_github("wlandau/workflowHelper")

```

Alternatively, you can build the package from the source and install it by hand. First, ensure that [git](https://git-scm.com/) is installed. Next, open a [command line program](http://linuxcommand.org/) such as [Terminal](https://en.wikipedia.org/wiki/Terminal_%28OS_X%29) and enter the following commands.

```

git clone [email protected]:wlandau/workflowHelper.git

R CMD build workflowHelper

R CMD INSTALL ...

```

where `...` is replaced by the name of the tarball produced by `R CMD build`.

## Windows users need [`Rtools`](https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows).

The example and tests sometimes use `system("make")` and similar commands. So if you're using the Windows operating system, you will need to install the [`Rtools`](https://github.com/stan-dev/rstan/wiki/Install-Rtools-for-Windows) package.

# Example 

You can run this example from start to finish with the `run_example_workflowHelper()` function. Alternatively, you can set up earlier stages with `write_example_workflowHelper()` or `setup_example_workflowHelper()` and then run the output manually with [`remake::make()`](https://github.com/richfitz/remake) or  [`make`](https://www.gnu.org/software/make/). Then, optionally, use the `clean_example_workflowHelper()` function to remove all the files generated by `run_example_workflowHelper()`. The details of the example are below. 

Suppose I want to 

1. Generate some datasets.

2. Analyze each dataset with multiple methods of analysis.

3. Compute summary statistics of each analysis of each dataset (model coefficients and mean squared error) and aggregate the summaries together.

4. Generate some tables, figures, and reports using those aggregated summaries.

I keep the functions to generate data, analyze data, etc. in [`code.R`](https://github.com/wlandau/workflowHelper/blob/master/inst/example/code.R), and the script to organize and set up the workflow is [`workflow.R`](https://github.com/wlandau/workflowHelper/blob/master/inst/example/workflow.R). There are also [`knitr`](http://yihui.name/knitr/) reports [`latex.Rnw`](https://github.com/wlandau/workflowHelper/blob/master/inst/example/latex.Rnw) and [`markdown.Rmd`](https://github.com/wlandau/workflowHelper/blob/master/inst/example/markdown.Rmd). You can generate these files with the `write_example_workflowHelper()` function. Typically, in your own workflows, you will write these files by hand.

## A walk through `workflow.R`

First, I list the R scripts containing my code and the packages dependencies.

```{r}

library(workflowHelper)

sources = strings(code.R)

packages = strings(MASS)

# packages = strings(MASS, rmarkdown, tools) # Uncomment before building pdf/html

```

The `strings` function converts R expressions into character strings, so I could have simply written `sources = "code.R"`.

Next, I list the commands to generate the datasets.

```{r}

datasets = commands(

  normal16 = normal_dataset(n = 16),

  poisson32 = poisson_dataset(n = 32),

  poisson64 = poisson_dataset(n = 64)

)

```

Be sure to give a unique name to each command (for example, `poisson_dataset(n = 32)` has the unique name `poisson32`). The `commands` function checks for names and returns a named character vector, so I could have simply written `datasets = c(normal16 = "normal_dataset(n = 16)", poisson32 = "poisson_dataset(n = 32)", poisson64 = "poisson_dataset(n = 64)")`. To generate 4 replicates of each kind of dataset, write `datasets = reps(datasets, 4)`.

Similarly, I specify the commands to analyze each dataset.

```{r}

analyses = commands(

  linear = linear_analysis(..dataset..),

  quadratic = quadratic_analysis(..dataset..)

)

```

The `..dataset..` wildcard stands for the current dataset being analyzed, which in this case is an object returned by `normal_dataset` or `poisson_dataset`. Wildcards are case-insensitive, so `..DATASET..` and `..dAtAsEt` will also work.

For summaries of the analyses, there is an additional `..analysis..` wildcard that stands for the current object returned by `linear_analysis` or `quadratic_analysis`. Like `..dataset..`, `..analysis..` is case-insensitive, so `..ANALYSIS..` will also work.

```{r}

summaries = commands(

  mse = mse_summary(..dataset.., ..analysis..),

  coef = coefficients_summary(..analysis..)

)

```

Next, I specify how to produce general output from the summaries, etc. Since `coef.csv` has a file extension, it will automatically be treated as a file target.

```{r}

output = commands(

  coef_table = do.call(I("rbind"), coef),

  coef.csv = write.csv(coef_table, target_name),

  mse_vector = unlist(mse)

)

```

Now, we're ready to specify plots. (Here, the a `plot: TRUE` line is automatically added to [`remake.yml`](https://github.com/richfitz/remake).)

```{r}

plots = commands(

  mse.pdf = hist(mse_vector, col = I("black"))

)

```

Finally, we can generate some reports.

```{r}

reports = commands(

  markdown.md = list("poisson32", "coef_table", "coef.csv"), # dependencies

  latex.tex = TRUE # no dependencies here

#  markdown.html = render("markdown.md", quiet = TRUE, clean = FALSE),

#  latex.pdf = texi2pdf("latex.tex", clean = FALSE)

)

```

Since `report.md` has a `.md` extension, [`remake`](https://github.com/richfitz/remake) will automatically look for `report.Rmd` and knit it to `report.md` with the `knitr` package. Similarly, [`remake`](https://github.com/richfitz/remake) will try to build `latex.tex` from `latex.Rnw`. In each case, the command is replaced with a character vector or list of characters denoting the dependencies of the report. These could be external files or

cached intermediate [`remake`](https://github.com/richfitz/remake) objects such as

datasets or analyses. In the latter case, objects are automatically exported for use inside R code chunks as described

[`here`](https://github.com/richfitz/remake/blob/master/doc/format.md).

If you want to render `markdown.md` to `markdown.html`, be sure to include `rmarkdown` in your packages. Similarly, to compile `latex.tex` to `latex.pdf`, include the `tools` package. I commented out the lines to build `markdown.html` and `latex.pdf` in order to increase portability, but you may uncomment them if your copy of R

is connected to copies of [LaTeX](https://www.latex-project.org/) and [Pandoc](http://pandoc.org/).

Optionally, I can prepend some lines to the overarching [Makefile](https://www.gnu.org/software/make/) for the workflow.

```{r}

begin = c("# This is my Makefile", "# Variables...")

```

The stages and elements of my workflow are now planned. To put them all together, I use `plan_workflow`, which calls `parallelRemake::write_makefile()`.

```{r}

plan_workflow(sources, packages, datasets, analyses, summaries, output, begin)

```

Optionally, I can pass additional arguments to `remake::make` using the `remake_args` argument to `plan_workflow`. For example, `plan_workflow(..., remake_args = list(verbose = FALSE))` is equivalent to `remake::make(..., verbose = F)` for each target. I cannot set `target_names` or `remake_file` this way. Also, if I want to suppress the writing of the Makefile, I can call `plan_workflow(..., makefile = NULL)`.

## Running the workflow

After running the `workflow.R` script above, I have a [`remake`](https://github.com/richfitz/remake)/[YAML](http://yaml.org/) file in my current working directory. To run the whole workflow in an R session with no parallel computing, simply open an R session and enter the following.

```{r}

library(remake)

make(remake_file = "remake.yml")

```

Thanks to [`remake`](https://github.com/richfitz/remake), if I change functions in `code.R` and then run `make` again, only the outdated parts of the workflow will be rebuilt.

Running `workflow.R` also produces a [Makefile](https://www.gnu.org/software/make/) in the current working directory. Using this master [Makefile](https://www.gnu.org/software/make/) and a [command line program](http://linuxcommand.org/), I have several options for running the workflow with parallel computing. Here are some examples.

- `make` runs the full workflow, only building results that are out of date or missing.

- `make -j ` is the same as above with the workflow distributed over `` parallel processes. Similarly, you can append `-j ` to any of the commands below to activate parallelism.

- `make datasets` just makes the datasets.

- `make analyses` just runs the analyses of all the datasets after ensuring that the datasets are up to date.

- `make summaries` computes individual summaries of each analysis of each dataset.

- `make aggregates` aggregates the summaries together.

- `make output` makes the final output of the workflow after ensuring all the previous results are up to date.

- `make clean` removes the files generated by `make`. If some of your files are produced by side effects, `make clean` might not remove them. In that case, updates to dependencies may not trigger the desired rebuilds, so you should read the next section. 

- `make reset` runs `make clean` and then removes the [Makefile](https://www.gnu.org/software/make/) and all its constituent [YAML](http://yaml.org/) files.

# Manual access to intermediate objects for debugging and testing

Intermediate objects such as datasets, analyses, and summaries are maintained in [`remake`](https://github.com/richfitz/remake)'s hidden [`storr`](https://github.com/richfitz/storr) cache. At any point in the workflow, you can reload them using `recall` and check the available ones using `recallable`. Let's go back to the example. First, I check to see the names of the objects I can reload.

```{r}

> recallable()

 [1] "coef"                     "coef_table"              

 [3] "mse"                      "mse_vector"              

 [5] "normal16"                 "normal16_linear"         

 [7] "normal16_linear_coef"     "normal16_linear_mse"     

 [9] "normal16_quadratic"       "normal16_quadratic_coef" 

[11] "normal16_quadratic_mse"   "poisson32"               

[13] "poisson32_linear"         "poisson32_linear_coef"   

[15] "poisson32_linear_mse"     "poisson32_quadratic"     

[17] "poisson32_quadratic_coef" "poisson32_quadratic_mse" 

[19] "poisson64"                "poisson64_linear"        

[21] "poisson64_linear_coef"    "poisson64_linear_mse"    

[23] "poisson64_quadratic"      "poisson64_quadratic_coef"

[25] "poisson64_quadratic_mse" 

> 

```

Then if I want to load `mse`, the list of summaries generated by `mse_summary` in `code.R`, I simply use `recall`.

```{r}

> recall("mse")

$normal16_linear

[1] 0.6394384

$normal16_quadratic

[1] 0.6394384

$poisson32_linear

[1] 4.991832

$poisson32_quadratic

[1] 4.991832

$poisson64_linear

[1] 3.613922

$poisson64_quadratic

[1] 3.613922

> 

```

**Important: do not manually access the files inside `.remake/objects` for serious jobs. Changes via functions like `recall()` and `recallable()` are not tracked and thus not reproducible.**

# High-performance computing

If you want to run `make -j` to distribute tasks over multiple nodes of a [Slurm](http://slurm.schedmd.com/) cluster, refer to the Makefile in [this post](http://plindenbaum.blogspot.com/2014/09/parallelizing-gnu-make-4-in-slurm.html) and write

```{r}

write_makefile(..., 

  begin = c(

    "SHELL=srun",

    ".SHELLFLAGS=  bash -c"))

```

in an R session, where `` stands for additional arguments to `srun`. Then, once the [Makefile](https://www.gnu.org/software/make/) is generated, you can run the workflow with

`nohup make -j [N] &` in the command line, where `[N]` is the number of simultaneous tasks.

For other task managers such as [PBS](https://en.wikipedia.org/wiki/Portable_Batch_System), such an approach may not be possible. Regardless of the system, be sure that all nodes point to the same working directory so that they share the same `.remake` [storr](https://github.com/richfitz/storr) cache.

# Use with the [downsize](https://github.com/wlandau/downsize) package

You may want to use the [downsize](https://github.com/wlandau/downsize) package within your custom R source code. That way, you can run a quick scaled-down version of your workflow for debugging and testing before you run the full workload. In the example, just include `downsize` in `packages` inside `workflow.R` and replace the top few lines of `code.R` with the following.

```{r}

library(downsize)

scale_down()

normal_dataset = function(n = 16){

  ds(data.frame(x = rnorm(n, 1), y = rnorm(n, 5)), nrow = 4)

}

poisson_dataset = function(n = 16){

  ds(data.frame(x = rpois(n, 1), y = rpois(n, 5)), nrow = 4)

}

```

The call `scale_down()` sets the `downsize` option to `TRUE`, which is a signal to the `ds` function. The command `ds(A, ...)` says "Downsize A to a smaller object when `getOption("downsize")` is `TRUE`". For the full scaled-up workflow, just delete the first two lines or replace `scale_down()` with `scale_up()`. Unfortunately, [`remake`](https://github.com/richfitz/remake) does not rebuild things when options are changed, so you'll have to run `make clean` whenever you change the `downsize` option.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/wlandau/workflowhelper

Awesome Lists containing this project

README