{"id":14068280,"url":"https://github.com/numeract/rflow","last_synced_at":"2026-05-28T19:30:40.574Z","repository":{"id":93562418,"uuid":"117606410","full_name":"numeract/rflow","owner":"numeract","description":"Flexible R Pipelines with Caching","archived":false,"fork":false,"pushed_at":"2018-09-01T16:05:15.000Z","size":623,"stargazers_count":12,"open_issues_count":5,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-07-30T03:55:10.319Z","etag":null,"topics":["cache","data-science","pipeline","r","rflow"],"latest_commit_sha":null,"homepage":"https://numeract.github.io/rflow/","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/numeract.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-15T23:26:56.000Z","updated_at":"2022-07-06T14:31:02.000Z","dependencies_parsed_at":"2023-08-26T15:45:40.713Z","dependency_job_id":null,"html_url":"https://github.com/numeract/rflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/numeract/rflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/numeract%2Frflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/numeract%2Frflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/numeract%2Frflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/numeract%2Frflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/numeract","download_url":"https://codeload.github.com/numeract/rflow/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/numeract%2Frflow/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33624202,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cache","data-science","pipeline","r","rflow"],"created_at":"2024-08-13T07:06:04.265Z","updated_at":"2026-05-28T19:30:40.556Z","avatar_url":"https://github.com/numeract.png","language":"R","funding_links":[],"categories":["R"],"sub_categories":[],"readme":"# rflow - Flexible R Pipelines with Caching\n\n[![Travis build status](https://travis-ci.org/numeract/rflow.svg?branch=master)](https://travis-ci.org/numeract/rflow)\n[![Coverage status](https://codecov.io/gh/numeract/rflow/branch/master/graph/badge.svg)](https://codecov.io/github/numeract/rflow?branch=master)\n[![CRAN status](https://www.r-pkg.org/badges/version/rflow)](https://cran.r-project.org/package=rflow)\n[![lifecycle](https://img.shields.io/badge/lifecycle-maturing-blue.svg)](https://www.tidyverse.org/lifecycle/#maturing)\n \n**The package is currently under active development, please expect major \nchanges while the API stabilizes.**\n\n\n## Motivation\n\nA common problem when processing data as part of a pipeline is avoiding \nunnecessary calculations. For example, if a function is called over and\nover with the same arguments, it should not recalculate the result each time\nbut it should provide the cached (pre-computed) result.\n\nWhile caching of the function output resolves the first problem, a second\nissue occurs when large data sets are being processed. In this case, hashing\nof the input arguments each time might take too long. This issue can be solved\nby hashing the data only once (as output) and then by noticing changes \nin the hash received by the downstream function. In other words, it is not \nthe data that flows through the pipeline (as is the case with standard function),\nbut hashes of the data.\n\nA third issue is output sub-setting. When working with a pipeline there is\noften the case (e.g. ETL, Machine Learning) that we need to pass the whole\ndata frame but the function is going to use only a subset (e.g. a CV fold).\nSince the main data frame has changes, caching of the result is no longer\nefficient. The solution involves hashing of the subset of interest which\ncan be done by introducing additional intermediate functions in the pipeline.\nHowever, there is a loss of efficiency due to excessive rehashing as the \nmain data frame passes through many functions.\n\nThe package `rflow` addresses these inefficiencies and makes pipelines as easy\nto use as in tidyverse.\n\n\n## Installation\n\n```\n# install.packages(\"devtools\")\ndevtools::install_github(\"numeract/rflow\")\n```\n\n\n## Use\n\n\n### Simple Example\n\n```\nx1 \u003c- 10\nx2 \u003c- 0.5\nx3 \u003c- 2\n\nf1 \u003c- function(a, b, c = 1) {a * b + c}\nf2 \u003c- function(d, e) {d / e}\n\n# passing the results downstream using functions\n(o1 \u003c- f1(x1, x2))  # 6\n(o2 \u003c- f2(o1, x3))  # 3\n\n\n# variant 1: declaring flows for each function using default options\nff1 \u003c- make_flow_fn(f1)\nff2 \u003c- make_flow_fn(f2)\n\n# passing to the downstream flow and collecting the results\nr1 \u003c- ff1(x1, x2)   # does not trigger re-calc\nr2 \u003c- ff2(r1, x3)   # does not trigger re-calc; first arg. is a flow arg.\ncollect(r1)         # 6\ncollect(r2)         # 3\n\n\n# variant 2: arguments and functions withing one call\nlibrary(dplyr)                          # makes life easier \nflow_fn(x1, x2, fn = f1) %\u003e%            # reuses cache created by ff1\n  flow_fn(x3, fn = f2) %\u003e%              # reuses cache created by ff2\n  collect()                             # 3, no actual re-calc takes place\n```\n\n\n### Pipelines\n\n1. Create your function, e.g. `f \u003c- function(...) {...}`\n- `rflow` works best with pure functions, i.e. functions\nthat depend only on their inputs (and not on variables outside the function \nframe) and do not produce any side effects (e.g. printing,  modifying variables \nin the global environment).\n\n2. \"flow\" the function: `ff \u003c- make_flow_fn(f))`\n\n3. When pipelining `ff` into another `rflow` function, simply supply `ff()`\nas an argument, for example: `ff(x) %\u003e% ff2(y) %\u003e% ff3(z)`\n\n4. At the end of the `rflow` pipeline you must use `collect()` to collect\nthe actual data (and not just the cached structure). Alternatively,\nuse `flow_ns_sink()` to dump the data into an environment or a \n`Shiny::reactiveValues` name space.\n\n\n### Shiny\n\nShiny from RStudio uses reactive values to know what changes took place and \nwhat to recompute. It is thus possible to use a series of reactive elements \nin Shiny to prevent expensive re-computations from taking place. Example:\n\n```\nrv1 \u003c- reactive({ \n    ... input$x ... \n})\n\nrv2 \u003c- reactive({ \n    ... rv1() .... input$y ... \n})\n\nrv3 \u003c- reactive({ \n    ... rv2() .... input$z ... \n})\n```\n\nThe downside is that we need one reactive element for each function in the \npipeline - this makes data processing dependent on UI / Shiny. Using `rflow`, \nwe can separate the UI from the data processing, maintaining the caching\nnot only for the current state but for all previously computed states.\n\n```\nrv \u003c- reactive({ \n    rf1(input$x, ...) %\u003e%\n    rf2(input$y, ...) %\u003e%\n    rf3(input$z, ...) %\u003e%\n    collect()\n})\n```\n\nWhile a similar workflow can be achieved with package `memoise`, it suffers from\nseveral disadvantages (below).\n\n\n### Output Subset \n\n(to be updated)\n\n\n## Other frameworks\n\n\n### Memoise\n\nPackage [memoise](https://github.com/r-lib/memoise) \nby Hadley Wickham, Jim Hester and others was the main source of inspiration.\nMemoise is elegant, fast, simple to use, but it suffers from certain limitations \nthat we hope to overcome in this package:\n\n- excessive [rehashing of inputs](https://github.com/r-lib/memoise/issues/31)\n- only one cache layer (although its cache framework is extensible)\n- no input/output sub-setting, it uses the complete set of arguments provided\n- no reactivity (yet to be implemented in `rflow`)\n\n\n### Drake\n\nPackage [drake](https://github.com/ropensci/drake) by Will Landau and others \nprovides a complete framework for large data sets, including using\nfiles as inputs and outputs. The downside is that it requires additional \noverhead to get started and its focus is on the pipeline as a whole. If your\nwork requires many hours of computations (which increases the value of each \nresult), the overhead due to the setup has a relatively lower cost - in this\nscenario `drake` is an excellent choice.\n\nPackage `rflow` is somewhere between `memoise` and `drake`:\n\n- one can start using `rflow` right away, with minimal overhead\n- allows focusing on the data processing (e.g., EDA) and not on the framework\n\n\n## TODO list\n\n- reactivity \n- multi-layer cache (with file locking)\n- files sinks\n- parallel processing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnumeract%2Frflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnumeract%2Frflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnumeract%2Frflow/lists"}