{"id":13424934,"url":"https://github.com/grailbio/reflow","last_synced_at":"2025-03-15T18:36:10.244Z","repository":{"id":46277635,"uuid":"107628620","full_name":"grailbio/reflow","owner":"grailbio","description":"A language and runtime for distributed, incremental data processing in the cloud","archived":false,"fork":false,"pushed_at":"2023-10-18T23:58:09.000Z","size":7300,"stargazers_count":965,"open_issues_count":28,"forks_count":52,"subscribers_count":47,"default_branch":"master","last_synced_at":"2024-10-26T23:55:56.652Z","etag":null,"topics":["analysis-pipeline","aws","bioinformatics-pipeline","cloud-computing","data-science","golang","language","runtime","scientific-computing"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/grailbio.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-10-20T03:37:20.000Z","updated_at":"2024-09-22T22:54:44.000Z","dependencies_parsed_at":"2023-10-20T20:05:00.135Z","dependency_job_id":null,"html_url":"https://github.com/grailbio/reflow","commit_stats":{"total_commits":1128,"total_committers":44,"mean_commits":"25.636363636363637","dds":0.625,"last_synced_commit":"90deddd72f8f1b489cab0812e2827c299f77dd19"},"previous_names":[],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Freflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Freflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Freflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/grailbio%2Freflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/grailbio","download_url":"https://codeload.github.com/grailbio/reflow/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243775957,"owners_count":20346298,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis-pipeline","aws","bioinformatics-pipeline","cloud-computing","data-science","golang","language","runtime","scientific-computing"],"created_at":"2024-07-31T00:01:00.967Z","updated_at":"2025-03-15T18:36:05.231Z","avatar_url":"https://github.com/grailbio.png","language":"Go","readme":"![Reflow](reflow.svg)\n\n[![Gitter](https://badges.gitter.im/Join%20Chat.svg)](https://gitter.im/grailbio/reflow?utm_source=badge\u0026utm_medium=badge\u0026utm_campaign=pr-badge\u0026utm_content=badge) [![Build Status](https://travis-ci.org/grailbio/reflow.svg?branch=master)](https://travis-ci.org/grailbio/reflow)\n\nReflow is a system for incremental data processing in the cloud.\nReflow enables scientists and engineers to compose existing tools\n(packaged in Docker images) using ordinary programming constructs.\nReflow then evaluates these programs in a cloud environment,\ntransparently parallelizing work and memoizing results. Reflow was\ncreated at [GRAIL](http://grail.com/) to manage our NGS (next\ngeneration sequencing) bioinformatics workloads on\n[AWS](https://aws.amazon.com), but has also been used for many other\napplications, including model training and ad-hoc data analyses.\n\nReflow comprises:\n\n- a functional, lazy, type-safe domain specific language for writing workflow programs;\n- a runtime for evaluating Reflow programs [incrementally](https://en.wikipedia.org/wiki/Incremental_computing), coordinating cluster execution, and transparent memoization;\n- a cluster scheduler to dynamically provision and tear down resources from a cloud provider (AWS currently supported).\n\nReflow thus allows scientists and engineers to write straightforward\nprograms and then have them transparently executed in a cloud\nenvironment. Programs are automatically parallelized and distributed\nacross multiple machines, and redundant computations (even across\nruns and users) are eliminated by its memoization cache. Reflow\nevaluates its programs\n[incrementally](https://en.wikipedia.org/wiki/Incremental_computing):\nwhenever the input data or program changes, only those outputs that\ndepend on the changed data or code are recomputed.\n\nIn addition to the default cluster computing mode, Reflow programs\ncan also be run locally, making use of the local machine's Docker\ndaemon (including Docker for Mac).\n\nReflow was designed to support sophisticated, large-scale\nbioinformatics workflows, but should be widely applicable to\nscientific and engineering computing workloads. It was built \nusing [Go](https://golang.org).\n\nReflow joins a [long\nlist](https://github.com/pditommaso/awesome-pipeline) of systems\ndesigned to tackle bioinformatics workloads, but differ from these in\nimportant ways:\n\n- it is a vertically integrated system with a minimal set of external dependencies; this allows Reflow to be \"plug-and-play\": bring your cloud credentials, and you're off to the races;\n- it defines a strict data model which is used for transparent memoization and other optimizations;\n- it takes workflow software seriously: the Reflow DSL provides type checking, modularity, and other constructors that are commonplace in general purpose programming languages;\n- because of its high level data model and use of caching, Reflow computes [incrementally](https://en.wikipedia.org/wiki/Incremental_computing): it is always able to compute the smallest set of operations given what has been computed previously.\n\n## Table of Contents\n\n- [Quickstart - AWS](#quickstart---aws)\n- [Simple bioinformatics workflow](#simple-bioinformatics-workflow)\n- [1000align](#1000align)\n- [Documentation](#documentation)\n- [Developing and building Reflow](#developing-and-building-reflow)\n- [Debugging Reflow runs](#debugging-reflow-runs)\n- [A note on Reflow's EC2 cluster manager](#a-note-on-reflows-ec2-cluster-manager)\n- [Setting up a TaskDB](#setting-up-a-taskdb)\n- [Support and community](#support-and-community)\n\n\n## Getting Reflow\n\nYou can get binaries (macOS/amd64, Linux/amd64) for the latest\nrelease at the [GitHub release\npage](https://github.com/grailbio/reflow/releases).\n\nIf you are developing Reflow,\nor would like to build it yourself,\nplease follow the instructions in the section\n\"[Developing and building Reflow](#developing-and-building-reflow).\"\n\n## Quickstart - AWS\n\nReflow is distributed with an EC2 cluster manager, and a memoization\ncache implementation based on S3. These must be configured before\nuse. Reflow maintains a configuration file in `$HOME/.reflow/config.yaml`\nby default (this can be overridden with the `-config` option). Reflow's\nsetup commands modify this file directly. After each step, the current \nconfiguration can be examined by running `reflow config`.\n\nNote Reflow must have access to AWS credentials and configuration in the\nenvironment (`AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`) while\nrunning these commands.\n\n\t% reflow setup-ec2\n\t% reflow config\n\tcluster: ec2cluster\n    ec2cluster:\n      securitygroup: \u003ca newly created security group here\u003e\n      maxpendinginstances: 5\n      maxhourlycostusd: 10\n      disktype: gp3\n      instancesizetodiskspace:\n        2xlarge: 300\n        3xlarge: 300\n        ...\n      diskslices: 0\n      ami: \"\"\n\t  sshkey: []\n      keyname: \"\"\n      cloudconfig: {}\n      instancetypes:\n      - c3.2xlarge\n      - c3.4xlarge\n      ...\n    reflow: reflowversion,version=\u003chash\u003e\n    tls: tls,file=$HOME/.reflow/reflow.pem\n\nAfter running `reflow setup-ec2`, we see that Reflow created a new\nsecurity group (associated with the account's default VPC), and\nconfigured the cluster to use some default settings. Feel free to\nedit the configuration file (`$HOME/.reflow/config.yaml`) to your\ntaste. If you want to use spot instances, add a new key under `ec2cluster`:\n`spot: true`.\n\nReflow only configures one security group per account: Reflow will reuse\na previously created security group if `reflow setup-ec2` is run anew.\nSee `reflow setup-ec2 -help` for more details.\n\nNext, we'll set up a cache. This isn't strictly necessary, but we'll\nneed it in order to use many of Reflow's sophisticated caching and\nincremental computation features. On AWS, Reflow implements a cache\nbased on S3 and DynamoDB. A new S3-based cache is provisioned by\n`reflow setup-s3-repository` and `reflow setup-dynamodb-assoc`, each\nof which takes one argument naming the S3 bucket and DynamoDB table\nname to be used, respectively. The S3 bucket is used to store file\nobjects while the DynamoDB table is used to store associations\nbetween logically named computations and their concrete output. Note\nthat S3 bucket names are global, so pick a name that's likely to be\nunique.\n\n\t% reflow setup-s3-repository reflow-quickstart-cache\n\treflow: creating s3 bucket reflow-quickstart-cache\n\treflow: created s3 bucket reflow-quickstart-cache\n\t% reflow setup-dynamodb-assoc reflow-quickstart\n\treflow: creating DynamoDB table reflow-quickstart\n\treflow: created DynamoDB table reflow-quickstart\n\t% reflow config\n\tassoc: dynamodb,table=reflow-quickstart\n\trepository: s3,bucket=reflow-quickstart-cache\n\n\t\u003crest is same as before\u003e\n\nThe setup commands created the S3 bucket and DynamoDB table as\nneeded, and modified the configuration accordingly.\n\nAdvanced users can also optionally [setup a taskdb](#setting-up-a-taskdb).\n\nWe're now ready to run our first \"hello world\" program!\n\nCreate a file called \"hello.rf\" with the following contents:\n\n\tval Main = exec(image := \"ubuntu\", mem := GiB) (out file) {\"\n\t\techo hello world \u003e\u003e{{out}}\n\t\"}\n\nand run it:\n\n\t% reflow run hello.rf\n\treflow: run ID: 6da656d1\n\t\tec2cluster: 0 instances:  (\u003c=$0.0/hr), total{}, waiting{mem:1.0GiB cpu:1 disk:1.0GiB\n\treflow: total n=1 time=0s\n\t\t\tident      n   ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)\n\t\t\thello.Main 1   1      0B\n\n\ta948904f\n\nHere, Reflow started a new `t2.small` instance (Reflow matches the workload with\navailable instance types), ran `echo hello world` inside of an Ubuntu container,\nplaced the output in a file, and returned its SHA256 digest. (Reflow represents\nfile contents using their SHA256 digest.)\n\nWe're now ready to explore Reflow more fully.\n\n## Simple bioinformatics workflow\n\nLet's explore some of Reflow's features through a simple task:\naligning NGS read data from the 1000genomes project. Create\na file called \"align.rf\" with the following. The code is commented\ninline for clarity.\n\n\t// In order to align raw NGS data, we first need to construct an index\n\t// against which to perform the alignment. We're going to be using\n\t// the BWA aligner, and so we'll need to retrieve a reference sequence\n\t// and create an index that's usable from BWA.\n\n\t// g1kv37 is a human reference FASTA sequence. (All\n\t// chromosomes.) Reflow has a static type system, but most type\n\t// annotations can be omitted: they are inferred by Reflow. In this\n\t// case, we're creating a file: a reference to the contents of the\n\t// named URL. We're retrieving data from the public 1000genomes S3\n\t// bucket.\n\tval g1kv37 = file(\"s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz\")\n\t\n\t// Here we create an indexed version of the g1kv37 reference. It is\n\t// created using the \"bwa index\" command with the raw FASTA data as\n\t// input. Here we encounter another way to produce data in reflow:\n\t// the exec. An exec runs a (Bash) script inside of a Docker image,\n\t// placing the output in files or directories (or both: execs can\n\t// return multiple values). In this case, we're returning a\n\t// directory since BWA stores multiple index files alongside the raw\n\t// reference. We also declare that the image to be used is\n\t// \"biocontainers/bwa\" (the BWA image maintained by the\n\t// biocontainers project).\n\t//\n\t// Inside of an exec template (delimited by {\" and \"}) we refer to\n\t// (interpolate) values in our environment by placing expressions\n\t// inside of the {{ and }} delimiters. In this case we're referring\n\t// to the file g1kv37 declared above, and our output, named out.\n\t//\n\t// Many types of expressions can be interpolated inside of an exec,\n\t// for example strings, integers, files, and directories. Strings\n\t// and integers are rendered using their normal representation,\n\t// files and directories are materialized to a local path before\n\t// starting execution. Thus, in this case, {{g1kv37}} is replaced at\n\t// runtime by a path on disk with a file with the contents of the\n\t// file g1kv37 (i.e.,\n\t// s3://1000genomes/technical/reference/human_g1k_v37.fasta.gz)\n\tval reference = exec(image := \"biocontainers/bwa:v0.7.15_cv3\", mem := 6*GiB, cpu := 1) (out dir) {\"\n\t\t# Ignore failures here. The file from 1000genomes has a trailer\n\t\t# that isn't recognized by gunzip. (This is not recommended practice!)\n\t\tgunzip -c {{g1kv37}} \u003e {{out}}/g1k_v37.fa || true\n\t\tcd {{out}}\n\t\tbwa index -a bwtsw g1k_v37.fa\n\t\"}\n\t\n\t// Now that we have defined a reference, we can define a function to\n\t// align a pair of reads against the reference, producing an output\n\t// SAM-formatted file. Functions compute expressions over a set of\n\t// abstract parameters, in this case, a pair of read files. Unlike almost\n\t// everywhere else in Reflow, function parameters must be explicitly\n\t// typed.\n\t//\n\t// (Note that we're using a syntactic short-hand here: parameter lists can \n\t// be abbreviated. \"r1, r2 file\" is equivalent to \"r1 file, r2 file\".)\n\t//\n\t// The implementation of align is a straightforward invocation of \"bwa mem\".\n\t// Note that \"r1\" and \"r2\" inside of the exec refer to the function arguments,\n\t// thus align can be invoked for any set of r1, r2.\n\tfunc align(r1, r2 file) = \n\t\texec(image := \"biocontainers/bwa:v0.7.15_cv3\", mem := 20*GiB, cpu := 16) (out file) {\"\n\t\t\tbwa mem -M -t 16 {{reference}}/g1k_v37.fa {{r1}} {{r2}} \u003e {{out}}\n\t\t\"}\n\n\t// We're ready to test our workflow now. We pick an arbitrary read\n\t// pair from the 1000genomes data set, and invoke align. There are a\n\t// few things of note here. First is the identifier \"Main\". This\n\t// names the expression that's evaluated by `reflow run` -- the\n\t// entry point of the computation. Second, we've defined Main to be\n\t// a block. A block is an expression that contains one or more\n\t// definitions followed by an expression. The value of a block is the\n\t// final expression. Finally, Main contains a @requires annotation.\n\t// This instructs Reflow how many resources to reserve for the work\n\t// being done. Note that, because Reflow is able to distribute work,\n\t// if a single instance is too small to execute fully in parallel,\n\t// Reflow will provision additional compute instances to help along.\n\t// @requires thus denotes the smallest possible instance\n\t// configuration that's required for the program.\n\t@requires(cpu := 16, mem := 24*GiB, disk := 50*GiB)\t\n\tval Main = {\n\t\tr1 := file(\"s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_1.filt.fastq.gz\")\n\t\tr2 := file(\"s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_2.filt.fastq.gz\")\n\t\talign(r1, r2)\n\t}\n\nNow we're ready to run our module. First, let's run `reflow doc`.\nThis does two things. First, it typechecks the module (and any\ndependent modules), and second, it prints documentation for the\npublic declarations in the module. Identifiers that begin with an\nuppercase letter are public (and may be used from other modules);\nothers are not.\n\n\t% reflow doc align.rf\n\tDeclarations\n\t\n\tval Main (out file)\n\t    We're ready to test our workflow now. We pick an arbitrary read pair from the\n\t    1000genomes data set, and invoke align. There are a few things of note here.\n\t    First is the identifier \"Main\". This names the expression that's evaluated by\n\t    `reflow run` -- the entry point of the computation. Second, we've defined Main\n\t    to be a block. A block is an expression that contains one or more definitions\n\t    followed by an expression. The value of a block is the final expression. Finally,\n\t    Main contains a @requires annotation. This instructs Reflow how many resources\n\t    to reserve for the work being done. Note that, because Reflow is able to\n\t    distribute work, if a single instance is too small to execute fully in parallel,\n\t    Reflow will provision additional compute instances to help along. @requires thus\n\t    denotes the smallest possible instance configuration that's required for the\n\t    program.\n\nThen let's run it:\n\n\t% reflow run align.rf\n\treflow: run ID: 82e63a7a\n\tec2cluster: 1 instances: c5.4xlarge:1 (\u003c=$0.7/hr), total{mem:29.8GiB cpu:16 disk:250.0GiB intel_avx512:16}, waiting{}, pending{}\n\t82e63a7a: elapsed: 2m30s, executing:1, completed: 3/5\n\t  align.reference:  exec ..101f9a082e1679c16d23787c532a0107537c9c # Ignore failures here. The f..bwa index -a bwtsw g1k_v37.fa  2m4s\n\nReflow launched a new instance: the previously launched instance (a\n`t2.small`) was not big enough to fit the requirements of align.rf.\nNote also that Reflow assigns a run name for each `reflow run`\ninvocation. This can be used to look up run details with the `reflow\ninfo` command. In this case:\n\n\t% reflow info 82e63a7a\n\t82e63a7aee201d137f8ade3d584c234b856dc6bdeba00d5d6efc9627bd988a68 (run)\n\t    time:      Wed Dec 12 10:45:04 2018\n\t    program:   /Users/you/align.rf\n\t    phase:     Eval\n\t    alloc:     ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1\n\t    resources: {mem:28.9GiB cpu:16 disk:245.1GiB intel_avx:16 intel_avx2:16 intel_avx512:16}\n\t    log:       /Users/you/.reflow/runs/82e63a7aee201d137f8ade3d584c234b856dc6bdeba00d5d6efc9627bd988a68.execlog\n\nHere we see that the run is currently being performed on the alloc named\n`ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1`.\nAn alloc is a resource reservation on a single machine. A run can\nmake use of multiple allocs to distribute work across multiple\nmachines. The alloc is a URI, and the first component is the real \nhostname. You can ssh into the host in order to inspect what's going on.\nReflow launched the instance with your public SSH key (as long as it was\nset up by `reflow setup-ec2`, and `$HOME/.ssh/id_rsa.pub` existed at that time).\n\n\t% ssh core@ec2-34-213-42-76.us-west-2.compute.amazonaws.com\n\t...\n\nAs the run progresses, Reflow prints execution status of each task on the\nconsole.\n\n\t...\n\talign.Main.r2:    intern s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_2.filt.fastq.gz                         23s\n\talign.Main.r1:    intern done 1.8GiB                                                                                          23s\n\talign.g1kv37:     intern done 851.0MiB                                                                                        23s\n\talign.reference:  exec ..101f9a082e1679c16d23787c532a0107537c9c # Ignore failures here. The f..bwa index -a bwtsw g1k_v37.fa  6s\n\nHere, Reflow started downloading r1 and r2 in parallel with creating the reference.\nCreating the reference is an expensive operation. We can examine it while it's running\nwith `reflow ps`:\n\n\t% reflow ps \n\t3674721e align.reference 10:46AM 0:00 running 4.4GiB 1.0 6.5GiB bwa\n\nThis tells us that the only task that's currently running is bwa to compute the reference.\nIt's currently using 4.4GiB of memory, 1 cores, and 6.5GiB GiB of disk space. By passing the -l\noption, reflow ps also prints the task's exec URI.\n\n\t% reflow ps -l\n\t3674721e align.reference 10:46AM 0:00 running 4.4GiB 1.0 6.5GiB bwa ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e\n\nAn exec URI is a handle to the actual task being executed. It\nglobally identifies all tasks, and can be examined with `reflow info`:\n\n\t% reflow info ec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e\n\tec2-34-213-42-76.us-west-2.compute.amazonaws.com:9000/5a0adaf6c879efb1/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e (exec)\n\t    state: running\n\t    type:  exec\n\t    ident: align.reference\n\t    image: index.docker.io/biocontainers/bwa@sha256:0529e39005e35618c4e52f8f56101f9a082e1679c16d23787c532a0107537c9c\n\t    cmd:   \"\\n\\t# Ignore failures here. The file from 1000genomes has a trailer\\n\\t# that isn't recognized by gunzip. (This is not recommended practice!)\\n\\tgunzip -c {{arg[0]}} \u003e {{arg[1]}}/g1k_v37.fa || true\\n\\tcd {{arg[2]}}\\n\\tbwa index -a bwtsw g1k_v37.fa\\n\"\n\t      arg[0]:\n\t        .: sha256:8b6c538abf0dd92d3f3020f36cc1dd67ce004ffa421c2781205f1eb690bdb442 (851.0MiB)\n\t      arg[1]: output 0\n\t      arg[2]: output 0\n\t    top:\n\t         bwa index -a bwtsw g1k_v37.fa\n\nHere, Reflow tells us that the currently running process is \"bwa\nindex...\", its template command, and the SHA256 digest of its inputs.\nPrograms often print helpful output to standard error while working;\nthis output can be examined with `reflow logs`:\n\n\t% reflow logs ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e\n\t\n\tgzip: /arg/0/0: decompression OK, trailing garbage ignored\n\t[bwa_index] Pack FASTA... 18.87 sec\n\t[bwa_index] Construct BWT for the packed sequence...\n\t[BWTIncCreate] textLength=6203609478, availableWord=448508744\n\t[BWTIncConstructFromPacked] 10 iterations done. 99999990 characters processed.\n\t[BWTIncConstructFromPacked] 20 iterations done. 199999990 characters processed.\n\t[BWTIncConstructFromPacked] 30 iterations done. 299999990 characters processed.\n\t[BWTIncConstructFromPacked] 40 iterations done. 399999990 characters processed.\n\t[BWTIncConstructFromPacked] 50 iterations done. 499999990 characters processed.\n\t[BWTIncConstructFromPacked] 60 iterations done. 599999990 characters processed.\n\t[BWTIncConstructFromPacked] 70 iterations done. 699999990 characters processed.\n\t[BWTIncConstructFromPacked] 80 iterations done. 799999990 characters processed.\n\t[BWTIncConstructFromPacked] 90 iterations done. 899999990 characters processed.\n\t[BWTIncConstructFromPacked] 100 iterations done. 999999990 characters processed.\n\t[BWTIncConstructFromPacked] 110 iterations done. 1099999990 characters processed.\n\t[BWTIncConstructFromPacked] 120 iterations done. 1199999990 characters processed.\n\t[BWTIncConstructFromPacked] 130 iterations done. 1299999990 characters processed.\n\t[BWTIncConstructFromPacked] 140 iterations done. 1399999990 characters processed.\n\t[BWTIncConstructFromPacked] 150 iterations done. 1499999990 characters processed.\n\t[BWTIncConstructFromPacked] 160 iterations done. 1599999990 characters processed.\n\t[BWTIncConstructFromPacked] 170 iterations done. 1699999990 characters processed.\n\t[BWTIncConstructFromPacked] 180 iterations done. 1799999990 characters processed.\n\t[BWTIncConstructFromPacked] 190 iterations done. 1899999990 characters processed.\n\t[BWTIncConstructFromPacked] 200 iterations done. 1999999990 characters processed.\n\t[BWTIncConstructFromPacked] 210 iterations done. 2099999990 characters processed.\n\t[BWTIncConstructFromPacked] 220 iterations done. 2199999990 characters processed.\n\t[BWTIncConstructFromPacked] 230 iterations done. 2299999990 characters processed.\n\t[BWTIncConstructFromPacked] 240 iterations done. 2399999990 characters processed.\n\t[BWTIncConstructFromPacked] 250 iterations done. 2499999990 characters processed.\n\t[BWTIncConstructFromPacked] 260 iterations done. 2599999990 characters processed.\n\t[BWTIncConstructFromPacked] 270 iterations done. 2699999990 characters processed.\n\t[BWTIncConstructFromPacked] 280 iterations done. 2799999990 characters processed.\n\t[BWTIncConstructFromPacked] 290 iterations done. 2899999990 characters processed.\n\t[BWTIncConstructFromPacked] 300 iterations done. 2999999990 characters processed.\n\t[BWTIncConstructFromPacked] 310 iterations done. 3099999990 characters processed.\n\t[BWTIncConstructFromPacked] 320 iterations done. 3199999990 characters processed.\n\t[BWTIncConstructFromPacked] 330 iterations done. 3299999990 characters processed.\n\t[BWTIncConstructFromPacked] 340 iterations done. 3399999990 characters processed.\n\t[BWTIncConstructFromPacked] 350 iterations done. 3499999990 characters processed.\n\t[BWTIncConstructFromPacked] 360 iterations done. 3599999990 characters processed.\n\t[BWTIncConstructFromPacked] 370 iterations done. 3699999990 characters processed.\n\t[BWTIncConstructFromPacked] 380 iterations done. 3799999990 characters processed.\n\t[BWTIncConstructFromPacked] 390 iterations done. 3899999990 characters processed.\n\t[BWTIncConstructFromPacked] 400 iterations done. 3999999990 characters processed.\n\t[BWTIncConstructFromPacked] 410 iterations done. 4099999990 characters processed.\n\t[BWTIncConstructFromPacked] 420 iterations done. 4199999990 characters processed.\n\t[BWTIncConstructFromPacked] 430 iterations done. 4299999990 characters processed.\n\t[BWTIncConstructFromPacked] 440 iterations done. 4399999990 characters processed.\n\t[BWTIncConstructFromPacked] 450 iterations done. 4499999990 characters processed.\n\nAt this point, it looks like everything is running as expected.\nThere's not much more to do than wait. Note that, while creating an\nindex takes a long time, Reflow only has to compute it once. When\nit's done, Reflow memoizes the result, uploading the resulting data\ndirectly to the configured S3 cache bucket. The next time the\nreference expression is encountered, Reflow will use the previously\ncomputed result. If the input file changes (e.g., we decide to use\nanother reference sequence), Reflow will recompute the index again.\nThe same will happen if the command (or Docker image) that's used to\ncompute the index changes. Reflow keeps track of all the dependencies\nfor a particular sub computation, and recomputes them only when\ndependencies have changed. This way, we always know what is being\ncomputed is correct (the result is the same as if we had computed the\nresult from scratch), but avoid paying the cost of redundant\ncomputation.\n\nAfter a little while, the reference will have finished generating,\nand Reflow begins alignment. Here, Reflow reports that the reference\ntook 52 minutes to compute, and produced 8 GiB of output.\n\n      align.reference:  exec done 8.0GiB                                                                                            52m37s\n      align.align:      exec ..101f9a082e1679c16d23787c532a0107537c9c bwa mem -M -t 16 {{reference}..37.fa {{r1}} {{r2}} \u003e {{out}}  4s\n\nIf we query (\"info\") the reference exec again, Reflow reports precisely what\nwas produced:\n\n\t% reflow info ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e\n\tec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/3674721e2d9e80b325934b08973fb3b1d3028b2df34514c9238be466112eb86e (exec)\n\t    state: complete\n\t    type:  exec\n\t    ident: align.reference\n\t    image: index.docker.io/biocontainers/bwa@sha256:0529e39005e35618c4e52f8f56101f9a082e1679c16d23787c532a0107537c9c\n\t    cmd:   \"\\n\\t# Ignore failures here. The file from 1000genomes has a trailer\\n\\t# that isn't recognized by gunzip. (This is not recommended practice!)\\n\\tgunzip -c {{arg[0]}} \u003e {{arg[1]}}/g1k_v37.fa || true\\n\\tcd {{arg[2]}}\\n\\tbwa index -a bwtsw g1k_v37.fa\\n\"\n\t      arg[0]:\n\t        .: sha256:8b6c538abf0dd92d3f3020f36cc1dd67ce004ffa421c2781205f1eb690bdb442 (851.0MiB)\n\t      arg[1]: output 0\n\t      arg[2]: output 0\n\t    result:\n\t      list[0]:\n\t        g1k_v37.fa:     sha256:2f9cd9e853a9284c53884e6a551b1c7284795dd053f255d630aeeb114d1fa81f (2.9GiB)\n\t        g1k_v37.fa.amb: sha256:dd51a07041a470925c1ebba45c2f534af91d829f104ade8fc321095f65e7e206 (6.4KiB)\n\t        g1k_v37.fa.ann: sha256:68928e712ef48af64c5b6a443f2d2b8517e392ae58b6a4ab7191ef7da3f7930e (6.7KiB)\n\t        g1k_v37.fa.bwt: sha256:2aec938930b8a2681eb0dfbe4f865360b98b2b6212c1fb9f7991bc74f72d79d8 (2.9GiB)\n\t        g1k_v37.fa.pac: sha256:d62039666da85d859a29ea24af55b3c8ffc61ddf02287af4d51b0647f863b94c (739.5MiB)\n\t        g1k_v37.fa.sa:  sha256:99eb6ff6b54fba663c25e2642bb2a6c82921c931338a7144327c1e3ee99a4447 (1.4GiB)\n\nIn this case, \"bwa index\" produced a number of auxiliary index\nfiles. These are the contents of the \"reference\" directory.\n\nWe can again query Reflow for running execs, and examine the\nalignment. We see now that the reference is passed in (argument 0),\nalong side the read pairs (arguments 1 and 2). \n\n\t% reflow ps -l\n\t6a6c36f5 align.align 5:12PM 0:00 running 5.9GiB 12.3 0B  bwa ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/6a6c36f5da6ee387510b0b61d788d7e4c94244d61e6bc621b43f59a73443a755\n\t% reflow info ec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/6a6c36f5da6ee387510b0b61d788d7e4c94244d61e6bc621b43f59a73443a755\n\tec2-34-221-0-157.us-west-2.compute.amazonaws.com:9000/0061a20f88f57386/6a6c36f5da6ee387510b0b61d788d7e4c94244d61e6bc621b43f59a73443a755 (exec)\n\t    state: running\n\t    type:  exec\n\t    ident: align.align\n\t    image: index.docker.io/biocontainers/bwa@sha256:0529e39005e35618c4e52f8f56101f9a082e1679c16d23787c532a0107537c9c\n\t    cmd:   \"\\n\\t\\tbwa mem -M -t 16 {{arg[0]}}/g1k_v37.fa {{arg[1]}} {{arg[2]}} \u003e {{arg[3]}}\\n\\t\"\n\t      arg[0]:\n\t        g1k_v37.fa:     sha256:2f9cd9e853a9284c53884e6a551b1c7284795dd053f255d630aeeb114d1fa81f (2.9GiB)\n\t        g1k_v37.fa.amb: sha256:dd51a07041a470925c1ebba45c2f534af91d829f104ade8fc321095f65e7e206 (6.4KiB)\n\t        g1k_v37.fa.ann: sha256:68928e712ef48af64c5b6a443f2d2b8517e392ae58b6a4ab7191ef7da3f7930e (6.7KiB)\n\t        g1k_v37.fa.bwt: sha256:2aec938930b8a2681eb0dfbe4f865360b98b2b6212c1fb9f7991bc74f72d79d8 (2.9GiB)\n\t        g1k_v37.fa.pac: sha256:d62039666da85d859a29ea24af55b3c8ffc61ddf02287af4d51b0647f863b94c (739.5MiB)\n\t        g1k_v37.fa.sa:  sha256:99eb6ff6b54fba663c25e2642bb2a6c82921c931338a7144327c1e3ee99a4447 (1.4GiB)\n\t      arg[1]:\n\t        .: sha256:0c1f85aa9470b24d46d9fc67ba074ca9695d53a0dee580ec8de8ed46ef347a85 (1.8GiB)\n\t      arg[2]:\n\t        .: sha256:47f5e749123d8dda92b82d5df8e32de85273989516f8e575d9838adca271f630 (1.7GiB)\n\t      arg[3]: output 0\n\t    top:\n\t         /bin/bash -e -l -o pipefail -c ..bwa mem -M -t 16 /arg/0/0/g1k_v37.fa /arg/1/0 /arg/2/0 \u003e /return/0 .\n\t         bwa mem -M -t 16 /arg/0/0/g1k_v37.fa /arg/1/0 /arg/2/0\n\nNote that the read pairs are files. Files in Reflow do not have\nnames; they are just blobs of data. When Reflow runs a process that\nrequires input files, those anonymous files are materialized on disk,\nbut the filenames are not meaningful. In this case, we can see from\nthe \"top\" output (these are the actual running processes, as reported\nby the OS), that the r1 ended up being called \"/arg/1/0\" and r2\n\"/arg/2/0\". The output is a file named \"/return/0\".\n\nFinally, alignment is complete. Aligning a single read pair took\naround 19m, and produced 13.2 GiB of output. Upon completion, Reflow\nprints runtime statistics and the result.\n\n\treflow: total n=5 time=1h9m57s\n\t        ident           n   ncache transfer runtime(m) cpu            mem(GiB)    disk(GiB)      tmp(GiB)\n\t        align.align     1   0      0B       17/17/17   15.6/15.6/15.6 7.8/7.8/7.8 12.9/12.9/12.9 0.0/0.0/0.0\n\t        align.Main.r2   1   0      0B\n\t        align.Main.r1   1   0      0B\n\t        align.reference 1   0      0B       51/51/51   1.0/1.0/1.0    4.4/4.4/4.4 6.5/6.5/6.5    0.0/0.0/0.0\n\t        align.g1kv37    1   0      0B\n\t\n\tbecb0485\n\nReflow represents file values by the SHA256 digest of the file's\ncontent. In this case, that's not very useful: you want the file,\nnot its digest. Reflow provides mechanisms to export data. In this\ncase let's copy the resulting file to an S3 bucket.\n\nWe'll make use of the \"files\" system module to copy the aligned file\nto an external S3 bucket. Modify align.rf's `Main` to the following\n(but pick an S3 bucket you own), and then run it again. Commentary is\ninline for clarity.\n\n\t@requires(cpu := 16, mem := 24*GiB, disk := 50*GiB)\t\n\tval Main = {\n\t\tr1 := file(\"s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_1.filt.fastq.gz\")\n\t\tr2 := file(\"s3://1000genomes/phase3/data/HG00103/sequence_read/SRR062640_2.filt.fastq.gz\")\n\t\t// Instantiate the system modules \"files\" (system modules begin\n\t\t// with $), assigning its instance to the \"files\" identifier. To\n\t\t// view the documentation for this module, run `reflow doc\n\t\t// $/files`.\n\t\tfiles := make(\"$/files\")\n\t\t// As before.\n\t\taligned := align(r1, r2)\n\t\t// Use the files module's Copy function to copy the aligned file to\n\t\t// the provided destination.\n\t\tfiles.Copy(aligned, \"s3://marius-test-bucket/aligned.sam\")\n\t}\n\nAnd run it again:\n\n\t% reflow run align.rf\n\treflow: run ID: 9f0f3596\n\treflow: total n=2 time=1m9s\n\t        ident         n   ncache transfer runtime(m) cpu mem(GiB) disk(GiB) tmp(GiB)\n\t        align_2.align 1   1      0B\n\t        align_2.Main  1   0      13.2GiB\n\t\n\tval\u003c.=becb0485 13.2GiB\u003e\n\n\nHere we see that Reflow did not need to recompute the aligned file;\nit is instead retrieved from cache. The reference index generation is\nskipped altogether.  Status lines that indicate \"xfer\" (instead of\n\"run\") means that Reflow is performing a cache transfer in place of\nrunning the computation. Reflow claims to have transferred a 13.2 GiB\nfile to `s3://marius-test-bucket/aligned.sam`. Indeed it did:\n\n\t% aws s3 ls s3://marius-test-bucket/aligned.sam\n\t2018-12-13 16:29:49 14196491221 aligned.sam.\n\n## 1000align\n\nThis code was modularized and generalized in\n[1000align](https://github.com/grailbio/reflow/tree/master/doc/1000align). Here,\nfastq, bam, and alignment utilities are split into their own\nparameterized modules. The toplevel module, 1000align, is\ninstantiated from the command line. Command line invocations (`reflow\nrun`) can pass module parameters through flags (strings, booleans,\nand integers):\n\n\t% reflow run 1000align.rf -help\n\tusage of 1000align.rf:\n\t  -out string\n\t    \tout is the target of the output merged BAM file (required)\n\t  -sample string\n\t    \tsample is the name of the 1000genomes phase 3 sample (required)\n\nFor example, to align the full sample from above, we can invoke\n1000align.rf with the following arguments:\n\n\t% reflow run 1000align.rf -sample HG00103 -out s3://marius-test-bucket/HG00103.bam\n\nIn this case, if your account limits allow it, Reflow will launch\nadditional EC2 instances in order to further parallelize the work to\nbe done. (Since we're aligning multiple pairs of FASTQ files).\nIn this run, we can see that Reflow is aligning 5 pairs in parallel\nacross 2 instances (four can fit on the initial m4.16xlarge instance).\n\n\t% reflow ps -l\n\te74d4311 align.align.sam 11:45AM 0:00 running 10.9GiB 31.8 6.9GiB   bwa ec2-34-210-201-193.us-west-2.compute.amazonaws.com:9000/6a7ffa00d6b0d9e1/e74d4311708f1c9c8d3894a06b59029219e8a545c69aa79c3ecfedc1eeb898f6\n\t59c561be align.align.sam 11:45AM 0:00 running 10.9GiB 32.7 6.4GiB   bwa ec2-34-210-201-193.us-west-2.compute.amazonaws.com:9000/6a7ffa00d6b0d9e1/59c561be5f627143108ce592d640126b88c23ba3d00974ad0a3c801a32b50fbe\n\tba688daf align.align.sam 11:47AM 0:00 running 8.7GiB  22.6 2.9GiB   bwa ec2-18-236-233-4.us-west-2.compute.amazonaws.com:9000/ae348d6c8a33f1c9/ba688daf5d50db514ee67972ec5f0a684f8a76faedeb9a25ce3d412e3c94c75c\n\t0caece7f align.align.sam 11:47AM 0:00 running 8.7GiB  25.9 3.4GiB   bwa ec2-18-236-233-4.us-west-2.compute.amazonaws.com:9000/ae348d6c8a33f1c9/0caece7f38dc3d451d2a7411b1fcb375afa6c86a7b0b27ba7dd1f9d43d94f2f9\n\t0b59e00c align.align.sam 11:47AM 0:00 running 10.4GiB 22.9 926.6MiB bwa ec2-18-236-233-4.us-west-2.compute.amazonaws.com:9000/ae348d6c8a33f1c9/0b59e00c848fa91e3b0871c30da3ed7e70fbc363bdc48fb09c3dfd61684c5fd9\n\nWhen it completes, an approximately 17GiB BAM file is deposited to s3:\n\n\t% aws s3 ls s3://marius-test-bucket/HG00103.bam\n\t2018-12-14 15:27:33 18761607096 HG00103.bam.\n\n## A note on Reflow's EC2 cluster manager\n\nReflow comes with a built-in cluster manager, which is responsible\nfor elastically increasing or decreasing required compute resources.\nThe AWS EC2 cluster manager keeps track of instance type availability\nand account limits, and uses these to launch the most appropriate set\nof instances for a given job. When instances become idle, they will\nterminate themselves if they are idle for more than 10 minutes; idle\ninstances are reused when possible.\n\nThe cluster manager may be configured under the \"ec2cluster\" key in \nReflow's configuration. Its parameters are documented by\n[godoc](https://godoc.org/github.com/grailbio/reflow/ec2cluster#Config).\n(Formal documentation is forthcoming.)\n\n## Setting up a TaskDB\nSetting up a TaskDB is entirely optional. The TaskDB is used to store a record of reflow runs,\ntheir sub-tasks (mainly `exec`s), the EC2 instances that were instantiated, etc.\nIt provides the following benefits:\n\n* Tools such as `reflow info` will work better, especially in a multi-user environment.\n\n  That is, if you have a single AWS account and share it with other users to run `reflow`, then\na TaskDB allows you to monitor and query info about all runs within the account (using `reflow ps`, `reflow info`, etc)\n* Determine cost of a particular run (included in the output of `reflow info`)\n* Determine cost of the cluster (`reflow ps -p` - see documentation using `reflow ps --help`)\n\nThe following command can be used to setup a TaskDB (refer documentation:\n\n\t% reflow setup-taskdb -help\n\nNote that the same dynamodb table and S3 bucket which were used to setup the cache (see above),\ncould optionally be used here.  But note that this (TaskDB) feature comes with a cost (DynamoDB),\nand by keeping them separate, the costs can be managed independently.\n\nExample:\n\n\t% reflow setup-taskdb \u003ctable_name\u003e \u003cs3_bucket_name\u003e\n\treflow: attempting to create DynamoDB table ...\n\treflow: created DynamoDB table ...\n\treflow: waiting for table to become active; current status: CREATING\n\treflow: created secondary index Date-Keepalive-index\n\treflow: waiting for table to become active; current status: UPDATING\n\treflow: waiting for index Date-Keepalive-index to become active; current status: CREATING\n\t...\n\treflow: created secondary index RunID-index\n\treflow: waiting for table to become active; current status: UPDATING\n\treflow: waiting for index RunID-index to become active; current status: CREATING\n\t...\n\n\n## Documentation\n\n- [Language summary](LANGUAGE.md)\n- [Go package docs](https://godoc.org/github.com/grailbio/reflow)\n\n## Developing and building Reflow\n\nReflow is implemented in Go, and its packages are go-gettable. \nReflow is also a [Go module](https://github.com/golang/go/wiki/Modules)\nand uses modules to fix its dependency graph.\n\nAfter checking out the repository, \nthe usual `go` commands should work, e.g.:\n\n\t% go test ./...\n\nThe package `github.com/grailbio/reflow/cmd/reflow`\n(or subdirectory `cmd/reflow` in the repository)\ndefines the main command for Reflow.\nBecause Reflow relies on being able to\ndistribute its current build,\nthe binary must be built using the `buildreflow` tool\ninstead of the ordinary Go tooling.\nCommand `buildreflow` acts like `go build`,\nbut also cross compiles the binary \nfor the remote target (Linux/amd64 currently supported),\nand embeds the cross-compiled binary.\n\n\t% cd $CHECKOUT/cmd/reflow\n\t% go install github.com/grailbio/reflow/cmd/buildreflow\n\t% buildreflow\n\n## Debugging Reflow runs\n\nThe `$HOME/.reflow/runs` directory contains logs, traces and other \ninformation for each Reflow run. If the run you're looking for is\nno longer there, the `info` and `cat` tools can be used if you have \nthe run ID:\n\n\t% reflow info 2fd5a9b6\n\trunid    user       start   end    RunLog   EvalGraph Trace\n    2fd5a9b6 username   4:41PM  4:41PM 29a4b506 90f40bfc  4ec75aac\n    \n    % reflow cat 29a4b506 \u003e /tmp/29a4b506.runlog\n\n    # fetch the evalgraph data, pass to the dot tool to generate an svg image (viewable in your browser)\n    % reflow cat 90f40bfc | dot -Tsvg \u003e /tmp/90f40bfc.svg\n\nFor more information about tracing, see: [doc/tracing.md](doc/tracing.md).\n\n## Support and community\n\nPlease join us on on [Gitter](https://gitter.im/grailbio/reflow) or \non the [mailing list](https://groups.google.com/forum/#!forum/reflowlets)\nto discuss Reflow.\n\n\n","funding_links":[],"categories":["Go","Workflow and Pipeline Management","Opensource"],"sub_categories":["Data Pipeline Orchestration"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrailbio%2Freflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgrailbio%2Freflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgrailbio%2Freflow/lists"}