https://github.com/cphyc/dotry
Python workflow for efficient and reproducible science
https://github.com/cphyc/dotry
Last synced: 9 days ago
JSON representation
Python workflow for efficient and reproducible science
- Host: GitHub
- URL: https://github.com/cphyc/dotry
- Owner: cphyc
- License: gpl-2.0
- Created: 2017-05-06T13:55:08.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2017-05-07T13:49:28.000Z (about 9 years ago)
- Last Synced: 2025-02-23T19:15:16.126Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 47.9 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
- License: LICENSE
Awesome Lists containing this project
README
# IMPORTANT
This program is not stable enough for a daily use! Use at your own risk.
Moreover the following guide may not be up-to-date.
# Dotry
Have you ever had to rerun a whole set of simulation because you couldn't remember if a given file was up to date with a function? Dotry stands for *Do*n't *Tr*ust *Y*ourself but your computer…
This tool aims at preventing such an issue by introducing the concept of `TaskManager`. Whenever you create a task manager, you can register function into it. These functions will automagically be parsed to find the input/output files, register them in the manager. Each time you run one of the register function, the task manager will run all the functions required to provide an up-to-date input it.
Moreover the task manager stores all the file in a `data` directory (which you can customize).
## How to use?
Imagine you have two functions:
import numpy as np
def A():
data_in = np.random.rand(10, 10)
data_out = data_in**2
np.save('output_A.dat', data_out)
def B():
data_in = np.load('output_A')
data_out = data_in - 10
np.save('output_B.dat', data_out)
For much longer functions, it is easy to forget about which file is up-to-date with each function. To solve this issue, you can wrap your function using a task manager, so that all the outputs files are up-to-date with all the inputs and the function's definitions. First, create a task manager instance:
import numpy as np
import scyframework as sf
tm = sf.TaskManager()
then, modify you functions so that the task manager can handle them
@tm.register
def A():
data_in = np.random.rand(10, 10)
data_out = data_in**2
np.save('output_A.dat', data_out)
@tm.register
def B():
data_in = np.load('output_A.dat')
data_out = data_in - 10
np.save('output_B.dat')
We're almost there, you then need to wrap all the input/output statements using `tm.din` (inputs) and `tm.dout` (outputs) to finally have
import numpy as np
import scyframework as sf
tm = sf.TaskManager()
@tm.register
def A():
np.random.seed(1234)
data_in = np.random.rand(10, 10)
data_out = data_in**2
np.save(tm.dout('output_A.dat'), data_out)
@tm.register
def B():
data_in = np.load(tm.din('output_A.dat'))
data_out = data_in - 10
np.save(tm.dout('output_B.dat'), data_out)
The `tm.din` and `tm.dout` statements have two roles. First, it converts the filename to a full path so that your data isn't messing up with your local folder.
For example here:
> print(tm.din('output_B.dat'))
/path/to/current/folder/data/output_B
Then it helps the taskmanager to discover your inputs/outputs so that it knows which functions provides which data. Now if you run the file