Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/apuapaquola/rf
rf
https://github.com/apuapaquola/rf
Last synced: 3 months ago
JSON representation
rf
- Host: GitHub
- URL: https://github.com/apuapaquola/rf
- Owner: apuapaquola
- License: gpl-3.0
- Created: 2015-12-02T17:32:33.000Z (almost 9 years ago)
- Default Branch: master
- Last Pushed: 2024-06-07T14:46:07.000Z (5 months ago)
- Last Synced: 2024-06-21T12:27:45.216Z (5 months ago)
- Language: Python
- Homepage:
- Size: 78.1 KB
- Stars: 10
- Watchers: 4
- Forks: 5
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: COPYING
Awesome Lists containing this project
- awesome-starred - apuapaquola/rf - rf (others)
README
## rf - A minimalist framework for reproducible computation
```
pip install git+https://github.com/apuapaquola/rf.git
```## Preprint
http://biorxiv.org/content/early/2015/12/09/033654
## OverviewWe propose a simple and intuitive way to organize computational analyses using a directory structure constructed according to 3 simple principles.
Consider the following directory structure:
.
└── nodeA
├── _h
├── _m
├── nodeB
│ ├── _h
│ └── _m
└── nodeC
├── _h
└── _mIn this tree, nodeA has two children: nodeB and nodeC.
We can think of these nodes as steps in a computational pipeline, in which nodeB and nodeC depend on the results of computation performed in nodeA.
This is principle 1: use of a directory structure to represent dependencies between analysis steps.
Each node has two special subdirectories: `_h` and `_m` with distinct purposes. We put documentation, code and other human-generated data that describe this analysis step in directory `_h`. For this reason, we call `_h` the "human" directory. Similarly, we use directory `_m` to store the results of computation of this analysis step. For this reason, we call `_m` the "machine" directory.
This is principle 2: separation of user-generated data from program-generated data.
In the "human" directory we put a file named `run`. `run` is a script that is supposed to be run without arguments from the "machine" directory. This script is responsible to call the necessary programs that will do the computation in the analysis step and generate the contents of `_m`.
This is principle 3: use of driver scripts. [doi: 10.1371/journal.pcbi.1000424]
These 3 principles are desirable to help keep analysis organized, reproducible and easier to understand.
A directory structure is a intuitive way to represent data dependencies. Let's say we are at some `_m` directory looking at output files, and we wonder how these files were generated. A pwd command will display the full path to that directory, which has a sequence of names of analysis steps involved in the generation of these files.
Separation of computer-generated data from human-generated data is also nice. It is a way to make sure that users don't edit output files. It is also useful to know which files are program-generated, so we know which files are OK to delete because they can be computed again.
Running driver scripts without arguments is a way to make sure computation doesn't depend on manually specified parameters, which are easy to forget.
## Version control
The separation of computer-generated data from human-generated data makes it easy to use version control systems for an analysis tree.
In the current implementation we use git for `_h` and git-annex for `_m`. For some operations that involve more than one call to git or git annex, we provide a wrapper command `rf`.
Using git, users can collaborate and share analyses trees in a similar they can do with code.
## Status
This framework is in early stage of development, and contributors are very welcome.
Current work includes, and some of these will be available soon:
* Apptainer and Docker support.
* More and better documentation.
* Concrete examples of analysis, mostly focusing on Bioinformatics.
* Use cases.
* Improvement of the manuscript.
* Support for git lfs.## Contributing
Please do!
## License
GNU GPLv3