https://github.com/bionode/bionode-watermill-tutorial
This is a tutorial for bionode-watermill
https://github.com/bionode/bionode-watermill-tutorial
Last synced: 8 months ago
JSON representation
This is a tutorial for bionode-watermill
- Host: GitHub
- URL: https://github.com/bionode/bionode-watermill-tutorial
- Owner: bionode
- Created: 2017-06-09T09:38:20.000Z (about 9 years ago)
- Default Branch: master
- Last Pushed: 2018-04-26T18:17:44.000Z (about 8 years ago)
- Last Synced: 2024-12-27T18:11:50.652Z (over 1 year ago)
- Language: JavaScript
- Size: 25.4 KB
- Stars: 1
- Watchers: 3
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# bionode-watermill for dummies!
* [Objective](#objective)
* [First things first](#first-things-first)
* [Defining a task](#defining-a-task)
* [Input/output](#inputoutput)
* [Using orchestrators](#using-orchestrators)
* [join](#join)
* [junction](#junction)
* [fork](#fork)
* [Useful links](#useful-links)
## Objective
This tutorial is intended for those that attempt to assemble a bioinformatics
pipeline using bionode-watermill for the first time.
## First things first
This tutorial assumes that you have installed `npm`, `git` and `node`. Node.js required for the full tutorial should be version 7 or higher.
To setup and test the scripts within this tutorial follow these simple steps:
* `git clone https://github.com/bionode/bionode-watermill-tutorial.git`
* `cd bionode-watermill-tutorial`
* `npm install bionode-watermill`
## Defining a task
Watermill is a tool that lets you orchestrate tasks. So, lets first
understand how to define a **task**.
To define a **task** we first need to require bionode-watermill:
```javascript
const watermill = require('bionode-watermill')
const task = watermill.task /* have to specify task because watermill object
has more variables*/
```
After, we can use task variable to define a given task:
* Using standard javascript style:
```javascript
// this is a kiss example of how tasks work with shell
const simpleTask = task({
output: '*.txt', // checks if output file matches the specified pattern
params: 'test_file.txt', //defines parameters to be passed to the
// task function
name: 'This is the task name' //defines the name of the task
}, function(resolvedProps) {
const params = resolvedProps.params
return 'touch ' + params
}
)
```
* Or you can also do something like the following in ES6 syntax, using arrow
functions:
```javascript
// this is a kiss example of how tasks work with shell
const simpleTask = task({
output: '*.txt', // checks if output file matches the specified pattern
params: 'test_file.txt', /*defines parameters to be passed to the
task function*/
name: 'This is the task name' //defines the name of the task
}, ({ params }) => `touch ${params}`
)
```
Note: [Template literals](https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Template_literals)
are very useful since they allow to include place holders (${ }) within
strings. Template literals are enclosed by the back-tick (\` \`) as exemplified
above.
Then after defining the task, it may be executed like this:
```javascript
// runs the task and returns a promise, and can also return a callback
simpleTask()
```
This task will create a new file (empty) inside a directory named
"data/\/".
You may also notice that a 'bunch' of text was outputted to terminal and it
can be useful for debugging your pipelines.
The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_task.js).
You can test it by running: `node simple_task.js`
### Input/output
Although already discussed [elsewhere](https://github.com/bionode/bionode-watermill/blob/master/docs/Task.md#input-and-output)
within bionode-watermill documentation, in this tutorial I intend to explain
how input/output are managed by bionode-watermill.
First, you can either hardcore input to something like:
```javascript
{ input: 'ERR1229296.sra' }
```
or instead you can specify glob patterns which are in fact better explained
[here](https://github.com/bionode/bionode-watermill/blob/master/docs/Task.md#input-and-output).
But, basically, what you need to know is that you can specify input to
something like:
```javascript
{ input: '*.sra' }
```
This tells bionode-watermill to crawl within the `data` directory in search
for the first hit that matches this pattern. So, pay attention when specifying
this glob patterns if you have multiple `.sra` files within this folder or
generated by other tasks that are not your target task (the last one that
generated a `.sra` file in this example). To circumvent this you can provide
file names that you can easily manage. For instance if you have one file
named `ERR1229296.sra` and another one `ERR1229297.sra` and you want just
the first one, you can easily pass the input as follows:
```javascript
{ input: '*6.sra' }
```
or of course hardcode it.
Output works in a very similar way, however there are a few specificities
that the user must be aware of:
- Output object is not the output filename, it is used only to match the file
extension to the expected result of the task. So despite necessary for
proper resolving the task.
```javascript
// this won't work!!!
{ output: 'myfile.txt' }
// rather you should provide this as follows:
{
output: '*.txt',
params: { output: 'myfile.txt' }
}
```
Remember, task.output is used to match the output file pattern and if you
want to specify a given filename to the output you need to use task.params
.output object instead where you can freely specify the output file name.
## Using orchestrators
[What are orchestrators?](https://github.com/bionode/bionode-watermill#what-are-orchestrators)
* ### Join
**Join** is an operator that lets you run a number of tasks in a given order.
For instance if we are interested in creating a file and writing to it
in two different instances. But let's first define a new task so we can
perform it after the task that we called `simpleTask`:
```javascript
const writeToFile = task({
input: '*.txt', // specifies the pattern of the expected input
output: '*.txt', // checks if output file matches the specified pattern
name: 'Write to file' //defines the name of the task
}, ({ input }) => `echo "some string" >> ${input}`
)
```
So, task `writeToFile` writes "some string" to the file that we have just
created in task `simpleTask`. However, to do so, we need the file to be
created first and only then write something to it.
In order to achieve this we use `join`:
Before applying the pipeline first we need to require **join**
```javascript
// === WATERMILL ===
const {
task,
join
} = require('bionode-watermill')
```
And then,
```javascript
// this is a kiss example of how join works
const pipeline = join(simpleTask, writeToFile)
//executes the join itself
pipeline()
```
This operation will generate two directories inside `data` folder, one which
is responsible for the first task (`simpleTask`) that will create a new
file called `test_file.txt`, and a second task (`writeToFile`) that will do
a symlink to `test_file.txt` and write to it, since we have indicated that
we would like to write for the same file as the input. Note that once again
files will be inside a directory named "data/\/" (but in this case you
will have two directories with distinct uids).
The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_join.js).
You can test the above example by running: `node simple_join.js`
* ### Junction
Unlike **join**, **junction** allows to run multiple tasks in parallel.
However, we will have to create a new task since if we simply replace in the
previous pipeline **join** with **junction**, we will end up with a file
named `test_file.txt` with nothing written inside, because if you create the
file and write to it at the same time, write won't work, but the file will be
created.
But first, don't forget to:
```javascript
// === WATERMILL ===
const {
task,
join,
junction
} = require('bionode-watermill')
```
And only then:
```javascript
// this will not produce the file with text in it!
const pipeline = junction(simpleTask, writeToFile)
```
So, we will define a new simple task:
```javascript
const writeAnotherFile = task({
output:'*.file', // specifies the pattern of the expected input
params: 'another_test_file.file', /* checks if output file matches the
specified pattern*/
name: 'Yet another task'
}, ({ params }) => `touch ${params} | echo "some new string" >> ${params}`
)
```
And then execute the new pipeline:
```javascript
// this is a kiss example of how junction works
const pipeline = junction(
join(simpleTask, writeToFile), /* this "joint" tasks will be executed at the
same time as the task bellow */
writeAnotherFile
)
//executes the pipeline itself
pipeline()
```
This new pipeline consists on creating two files and writing text to them. Note
that in `writeAnotherFile` task in this task pipe is used
in shell ("|") along with the shell commands `touch` and `echo`. That is a
feature that bionode-watermill also supports. Of course, these are simple
tasks that can be performed only with shell commands (but they are merely
illustrative). Instead, as mentioned above you can use javascript **callback**
functions or **promises** as the final return of a **task**.
Nevertheless, if you browse to `data` folder, you should have three folders
(because you have three tasks). One with the text file generated in the first
task, another one with a symlink for the first task (that was used to write
to this file) and finally a third one in which you should have the file
generated and written in the third task (named `another_test_file.file`).
The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_junction.js).
You can test the above example by running: `node simple_junction.js`
* ### Fork
While **junction** handles two or more tasks at the same time, **fork**
allows to pass the output of two or more different tasks to the next task.
Imagine you have two different files being generated in two different tasks
and want to process them using the same task in the next step. In this case
bionode-watermill uses **fork**, to split the pipeline in two distinct
branches that after will be processed independently.
If you have something like:
```javascript
join(
taskA,
fork(taskB, taskC),
taskD
)
```
This will result in something like this: ```taskA -> taskB -> taskD'``` and
```taskA -> taskC -> taskD''```, with two distinct final outputs for the
pipeline. This is a quite useful feature to benchmark programs or if you are
interested in running multiple programs that do the same type of analyses
and compare the results of both analyses.
Importantly, the same type of pipeline with **junction** instead of **fork**,
```javascript
join(
taskA,
junction(taskB, taskC),
taskD
)
```
would result in the following workflow: ```taskA -> taskB, taskC -> taskD```,
where taskD has only one final result.
But enough talk, lets get to work!
First:
```javascript
// === WATERMILL ===
const {
task,
join,
fork
} = require('bionode-watermill')
```
For the fork tutorial, two functions will be defined. These functions
create a file and write to it:
```javascript
const simpleTask1 = task({
output: '*.txt', // checks if output file matches the specified pattern
params: 'test_file.txt', //defines parameters to be passed to the
// task function
name: 'task1: creating file 1' //defines the name of the task
}, ({ params }) => `touch ${params} | echo "this is a string from first file" >> ${params}`
)
const simpleTask2 = task({
output:'*.txt', // specifies the pattern of the expected input
params: 'another_test_file.txt', /* checks if output file matches the
specified pattern*/
name: 'task 2: creating file 2'
}, ({ params }) => `touch ${params} | echo "this is a string from second file" >> ${params}`
)
```
Then, a task to be performed after the fork, which will add the same text to
these files:
```javascript
const appendFiles = task({
input: '*.txt', // specifies the pattern of the expected input
output: '*.txt', // checks if output file matches the specified patters
name: 'Write to files' //defines the name of the task
}, ({ input }) => `echo "after fork string" >> ${input}`
)
```
And finally our pipeline execution:
```javascript
// this is a kiss example of how fork works
const pipeline = join(
fork(simpleTask1, simpleTask2),
appendFiles
)
//executes the pipeline itself
pipeline()
```
This should result in four output directories in our `data` folder. Notice
that contrarily to **junction**, where three tasks would render three output
directories, with **fork** the result of our pipeline are four output
directories, where the outputs from `simpleTask1` and `simpleTask2` where
both processed by task `appendFiles`.
The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_fork.js).
You can test the above example by running: `node simple_fork.js`
## Useful links
* [How to require bionode-watermill inside my project?](https://github.com/bionode/GSoC17/blob/master/notes/running_watermill.md)
* [Prefer javascript standard syntax? Then use the following URL](https://github.com/bionode/bionode-watermill-tutorial/tree/master/js_standard_tutorial)
* [Is this not challenging enough? Then try our other example pipelines](https://github.com/bionode/bionode-watermill/tree/master/examples/pipelines)
* [A pipeline to perform mapping with bowtie and bwa in parallel](https://github.com/bionode/bionode-watermill/tree/master/examples/pipelines/two-mappers)