https://github.com/bionode/bionode-watermill-tutorial

This is a tutorial for bionode-watermill
https://github.com/bionode/bionode-watermill-tutorial
Last synced: 8 months ago
JSON representation
This is a tutorial for bionode-watermill
Host: GitHub
URL: https://github.com/bionode/bionode-watermill-tutorial
Owner: bionode
Created: 2017-06-09T09:38:20.000Z (about 9 years ago)
Default Branch: master
Last Pushed: 2018-04-26T18:17:44.000Z (about 8 years ago)
Last Synced: 2024-12-27T18:11:50.652Z (over 1 year ago)
Language: JavaScript
Size: 25.4 KB
Stars: 1
Watchers: 3
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # bionode-watermill for dummies!

* [Objective](#objective)

* [First things first](#first-things-first)

* [Defining a task](#defining-a-task)

    * [Input/output](#inputoutput)

* [Using orchestrators](#using-orchestrators)

    * [join](#join)

    * [junction](#junction)

    * [fork](#fork)

* [Useful links](#useful-links)

## Objective

This tutorial is intended for those that attempt to assemble a bioinformatics 

pipeline using bionode-watermill for the first time.

## First things first

This tutorial assumes that you have installed `npm`, `git` and `node`. Node.js required for the full tutorial should be version 7 or higher.

To setup and test the scripts within this tutorial follow these simple steps:

* `git clone https://github.com/bionode/bionode-watermill-tutorial.git`

* `cd bionode-watermill-tutorial`

* `npm install bionode-watermill`

## Defining a task

Watermill is a tool that lets you orchestrate tasks. So, lets first 

understand how to define a **task**. 

 

To define a **task** we first need to require bionode-watermill:

```javascript

const watermill = require('bionode-watermill') 

const task = watermill.task  /* have to specify task because watermill object

 has more variables*/

```

After, we can use task variable to define a given task:

* Using standard javascript style:

```javascript

// this is a kiss example of how tasks work with shell

const simpleTask = task({

  output: '*.txt', // checks if output file matches the specified pattern

  params: 'test_file.txt',  //defines parameters to be passed to the

    // task function

  name: 'This is the task name' //defines the name of the task

}, function(resolvedProps) {

    const params = resolvedProps.params

    return 'touch ' + params

  }

)

```

* Or you can also do something like the following in ES6 syntax, using arrow 

functions:

```javascript

// this is a kiss example of how tasks work with shell

const simpleTask = task({

  output: '*.txt', // checks if output file matches the specified pattern

  params: 'test_file.txt',  /*defines parameters to be passed to the

     task function*/

  name: 'This is the task name' //defines the name of the task

}, ({ params }) => `touch ${params}`

)

```

Note: [Template literals](https://developer.mozilla.org/en/docs/Web/JavaScript/Reference/Template_literals)

are very useful since they allow to include place holders (${ }) within 

strings. Template literals are enclosed by the back-tick (\` \`) as exemplified 

above.

Then after defining the task, it may be executed like this:

```javascript

// runs the task and returns a promise, and can also return a callback

simpleTask()

```

This task will create a new file (empty) inside a directory named 

"data/\/".

You may also notice that a 'bunch' of text was outputted to terminal and it 

can be useful for debugging your pipelines.

The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_task.js).

You can test it by running: `node simple_task.js`

### Input/output

Although already discussed [elsewhere](https://github.com/bionode/bionode-watermill/blob/master/docs/Task.md#input-and-output)

within bionode-watermill documentation, in this tutorial I intend to explain 

how input/output are managed by bionode-watermill. 

First, you can either hardcore input to something like:

```javascript

{ input: 'ERR1229296.sra' }

```

or instead you can specify glob patterns which are in fact better explained 

[here](https://github.com/bionode/bionode-watermill/blob/master/docs/Task.md#input-and-output).

But, basically, what you need to know is that you can specify input to 

something like:

```javascript

{ input: '*.sra' }

```

This tells bionode-watermill to crawl within the `data` directory in search 

for the first hit that matches this pattern. So, pay attention when specifying

 this glob patterns if you have multiple `.sra` files within this folder or 

 generated by other tasks that are not your target task (the last one that 

 generated a `.sra` file in this example). To circumvent this you can provide

  file names that you can easily manage. For instance if you have one file 

  named `ERR1229296.sra` and another one `ERR1229297.sra` and you want just 

  the first one, you can easily pass the input as follows:

  

```javascript

{ input: '*6.sra' }

```

or of course hardcode it. 

Output works in a very similar way, however there are a few specificities 

that the user must be aware of: 

- Output object is not the output filename, it is used only to match the file

 extension to the expected result of the task. So despite necessary for 

 proper resolving the task.

```javascript

// this won't work!!!

{ output: 'myfile.txt' }

// rather you should provide this as follows:

{ 

  output: '*.txt',

  params: { output: 'myfile.txt' }

}

```

Remember, task.output is used to match the output file pattern and if you 

want to specify a given filename to the output you need to use task.params

.output object instead where you can freely specify the output file name.

## Using orchestrators

[What are orchestrators?](https://github.com/bionode/bionode-watermill#what-are-orchestrators)

* ### Join

**Join** is an operator that lets you run a number of tasks in a given order. 

For instance if we are interested in creating a file and writing to it 

in two different instances. But let's first define a new task so we can 

perform it after the task that we called `simpleTask`:

```javascript

const writeToFile = task({

  input: '*.txt', // specifies the pattern of the expected input

  output: '*.txt', // checks if output file matches the specified pattern

  name: 'Write to file' //defines the name of the task

}, ({ input }) => `echo "some string" >> ${input}`

)

```

So, task `writeToFile` writes "some string" to the file that we have just 

created in task `simpleTask`. However, to do so, we need the file to be 

created first and only then write something to it.

In order to achieve this we use `join`:

Before applying the pipeline first we need to require **join** 

```javascript

// === WATERMILL ===

const {

  task,

  join

} = require('bionode-watermill')

```

And then,

```javascript

// this is a kiss example of how join works

const pipeline = join(simpleTask, writeToFile)

//executes the join itself

pipeline()

```

This operation will generate two directories inside `data` folder, one which 

is responsible for the first task (`simpleTask`) that will create a new

 file called `test_file.txt`, and a second task (`writeToFile`) that will do 

 a symlink to `test_file.txt` and write to it, since we have indicated that 

 we would like to write for the same file as the input. Note that once again 

 files will be inside a directory named "data/\/" (but in this case you 

 will have two directories with distinct uids).

The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_join.js).

You can test the above example by running: `node simple_join.js`

* ### Junction

Unlike **join**, **junction** allows to run multiple tasks in parallel. 

However, we will have to create a new task since if we simply replace in the 

previous pipeline **join** with **junction**, we will end up with a file 

named `test_file.txt` with nothing written inside, because if you create the 

file and write to it at the same time, write won't work, but the file will be

 created. 

 

 But first, don't forget to:

 ```javascript

 // === WATERMILL ===

 const {

   task,

   join,

   junction

 } = require('bionode-watermill')

 ```

 And only then:

 ```javascript

 // this will not produce the file with text in it!

const pipeline = junction(simpleTask, writeToFile)

```

So, we will define a new simple task:

```javascript

const writeAnotherFile = task({

  output:'*.file', // specifies the pattern of the expected input

  params: 'another_test_file.file', /* checks if output file matches the

  specified pattern*/

  name: 'Yet another task'

}, ({ params }) => `touch ${params} | echo "some new string" >> ${params}`

)

```

And then execute the new pipeline:

```javascript

// this is a kiss example of how junction works

const pipeline = junction(

  join(simpleTask, writeToFile),  /* this "joint" tasks will be executed at the

  same time as the task bellow */

  writeAnotherFile

)

//executes the pipeline itself

pipeline()

```

This new pipeline consists on creating two files and writing text to them. Note 

that in `writeAnotherFile` task in this task pipe is used 

 in shell ("|") along with the shell commands `touch` and `echo`. That is a 

 feature that bionode-watermill also supports. Of course, these are simple 

 tasks that can be performed only with shell commands (but they are merely 

 illustrative). Instead, as mentioned above you can use javascript **callback** 

 functions or **promises** as the final return of a **task**.

 

Nevertheless, if you browse to `data` folder, you should have three folders 

(because you have three tasks). One with the text file generated in the first

 task, another one with a symlink for the first task (that was used to write 

 to this file) and finally a third one in which you should have the file 

 generated and written in the third task (named `another_test_file.file`). 

The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_junction.js).

You can test the above example by running: `node simple_junction.js`

* ### Fork

While **junction** handles two or more tasks at the same time, **fork** 

allows to pass the output of two or more different tasks to the next task. 

Imagine you have two different files being generated in two different tasks 

 and want to  process them using the same task in the next step. In this case 

 bionode-watermill uses **fork**, to split the pipeline in two distinct 

 branches that after will be processed independently. 

 

 If you have something like:

 ```javascript 

 join(

   taskA,

   fork(taskB, taskC),

   taskD

 )

 ```

 This will result in something like this:  ```taskA -> taskB -> taskD'``` and 

 ```taskA -> taskC -> taskD''```, with two distinct final outputs for the 

 pipeline. This is a quite useful feature to benchmark programs or if you are

  interested in running multiple programs that do the same type of analyses 

  and compare the results of both analyses.

  

  Importantly, the same type of pipeline with **junction** instead of **fork**,

   ```javascript 

   join(

     taskA,

     junction(taskB, taskC),

     taskD

   )

   ```

   would result in the following workflow: ```taskA -> taskB, taskC -> taskD```,

    where taskD has only one final result.

    

 But enough talk, lets get to work!

 

  First:

  

  ```javascript

  // === WATERMILL ===

  const {

    task,

    join,

    fork

  } = require('bionode-watermill')

  ```

 

 For the fork tutorial, two functions will be defined. These functions 

 create a file and write to it:

 

 ```javascript

const simpleTask1 = task({

    output: '*.txt', // checks if output file matches the specified pattern

    params: 'test_file.txt',  //defines parameters to be passed to the

    // task function

    name: 'task1: creating file 1' //defines the name of the task

  }, ({ params }) => `touch ${params} | echo "this is a string from first file" >> ${params}`

)

const simpleTask2 = task({

    output:'*.txt', // specifies the pattern of the expected input

    params: 'another_test_file.txt', /* checks if output file matches the

     specified pattern*/

    name: 'task 2: creating file 2'

  }, ({ params }) => `touch ${params} | echo "this is a string from second file" >> ${params}`

)

```

Then, a task to be performed after the fork, which will add the same text to 

these files:

```javascript

const appendFiles = task({

    input: '*.txt', // specifies the pattern of the expected input

    output: '*.txt', // checks if output file matches the specified patters

    name: 'Write to files' //defines the name of the task

  }, ({ input }) => `echo "after fork string" >> ${input}`

)

```

And finally our pipeline execution:

```javascript

// this is a kiss example of how fork works

const pipeline = join(

  fork(simpleTask1, simpleTask2),

  appendFiles

)

//executes the pipeline itself

pipeline()

```

This should result in four output directories in our `data` folder. Notice 

that contrarily to **junction**, where three tasks would render three output 

directories, with **fork** the result of our pipeline are four output 

directories, where the outputs from `simpleTask1` and `simpleTask2` where 

both processed by task `appendFiles`.

The above example is available [here](https://github.com/bionode/bionode-watermill-tutorial/blob/master/simple_fork.js).

You can test the above example by running: `node simple_fork.js`

 

## Useful links

* [How to require bionode-watermill inside my project?](https://github.com/bionode/GSoC17/blob/master/notes/running_watermill.md)

* [Prefer javascript standard syntax? Then use the following URL](https://github.com/bionode/bionode-watermill-tutorial/tree/master/js_standard_tutorial)

* [Is this not challenging enough? Then try our other example pipelines](https://github.com/bionode/bionode-watermill/tree/master/examples/pipelines)

    * [A pipeline to perform mapping with bowtie and bwa in parallel](https://github.com/bionode/bionode-watermill/tree/master/examples/pipelines/two-mappers)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bionode/bionode-watermill-tutorial

Awesome Lists containing this project

README