https://github.com/databiosphere/analysis_pipeline_wdl
Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
https://github.com/databiosphere/analysis_pipeline_wdl
pipeline topmed topmed-pipeline wdl workflow
Last synced: 9 months ago
JSON representation
Collection of WDL workflows based off the University of Washington TOPMed DCC Best Practices for GWAS. The WDL structure was based upon CWLs written by the Seven Bridges development team.
- Host: GitHub
- URL: https://github.com/databiosphere/analysis_pipeline_wdl
- Owner: DataBiosphere
- Created: 2021-04-15T21:17:32.000Z (almost 5 years ago)
- Default Branch: main
- Last Pushed: 2023-04-19T21:41:02.000Z (almost 3 years ago)
- Last Synced: 2025-06-01T06:52:38.045Z (10 months ago)
- Topics: pipeline, topmed, topmed-pipeline, wdl, workflow
- Language: wdl
- Homepage:
- Size: 47.4 MB
- Stars: 6
- Watchers: 4
- Forks: 3
- Open Issues: 15
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# TOPMed Analysis Pipeline — WDL Version
[](https://github.com/openwdl/wdl/blob/main/versions/1.0/SPEC.md)
This project is a Workflow Description Language (WDL) implementation of several components of the University of Washington [TOPMed pipeline](https://github.com/UW-GAC/analysis_pipeline), purposefully done in a way that closely mimics [the CWL version of the UW Pipeline](https://github.com/UW-GAC/analysis_pipeline_cwl). In other words, this is a WDL that mimics a CWL that mimics a Python pipeline. All three pipelines use the same underlying R scripts which do most of the actual analysis, making their results directly comparable. We have also used checker workflows to verify that results are scientifically equivalent.
## Features
* This pipeline is very similar to the CWL version, and while the main differences between the two [are documented](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/_documentation_/for%20users/cwl-vs-wdl-user.md), testing indicates they are functionally equivalent -- so much so that files generated by the CWL are used as truth files for the WDL
* As it works in a Docker container, it does not have any external dependencies other than the usual setup required for [WDL](https://software.broadinstitute.org/wdl/documentation/quickstart)
* Contains multiple checker workflows for validating sets of known inputs and expected outputs
* Open-access sample data is provided, based upon sample data provided by UWGAC, itself based upon 1000 Genomes data
* Autoscaling of executor's disk size based upon the size of input files, with the option for the user to add more storage on top of that
* Support for [preemptible VMs](https://cloud.google.com/compute/docs/instances/preemptible) on Google backends
* Documentation of inputs, how each workflow works, and WDL-specific workarounds
## Usage
These workflows are tested on both [Terra](https://terra.bio/) and a local installation of [Cromwell](http://cromwell.readthedocs.io/en/develop/). Example files are provided in `test-data-and-truths` and in `gs://topmed_workflow_testing/UWGAC_WDL/`.
Essentially all workflows which take in chromosome-level files share filename requirements. For these files, the chromosome must be included in the filename with the format `chr##` where `##` is the name of the chromosome (1-24 or X, Y). Chromosome can be included at any part of the filename provided they follow this format. For instance, data_subset_chr1.gds, data_chr1_subset.gds, and chr1_data_subset.gds are all valid names, while data_chromosome1_subset.gds and data_subset_c1.gds are not valid. Note that the association aggregate, LD prune, and null model workflows additionally require that you have greater than one input GDS file (ie, input at least chr1 and chr2).
For more information on specific runtime attributes for specific tasks, see [the further reading section](https://github.com/DataBiosphere/analysis_pipeline_WDL/main/README.md#further-reading). The default runtime attributes provided in these pipelines were based on the provided test data, which is probably **much** smaller than what you will be using. With that in mind, if you are using your own data, please be sure to adjust your runtime attributes appropriately.
### Running on Terra (recommended)
For Terra users, it is recommended to import via [Dockstore](https://dockstore.org/organizations/bdcatalyst/collections/UWGACAncestryRelatedness). Importing the correct JSON file for your workflow at the workflow field entry page will fill in test data and recommended runtime attributes for said test data. For example, load `vcf-to-gds-terra.json` for `vcf-to-gds.wdl`.
### Running on your local machine
Much preliminary testing and development of these pipelines was done by running Cromwell in "local mode," but we do not recommend this approach for doing actual analysis. Cromwell does not manage resources well on local executions. As a result, these pipelines (LD pruning especially) may get their processes killed by your OS and/or lock up Docker, even if running on downsampled data. These issues can *generally* be avoided by changing the concurrent job limit in your Cromwell configuration, and if you set this limit, you should find that all of the sample data in this repo should run on any Cromwell-and-Docker compatible machine. [See instructions here](https://docs.dockstore.org/en/develop/getting-started/getting-started-with-wdl.html#setting-up-the-dockstore-cli) for how to set the concurrent job limit in the Dockstore CLI.
### Running on an HPC
These workflows have not been extensively tested in an HPC environment, but provided your HPC supports Cromwell and Docker, they should work as expected. You may wish to run the checker workflows before doing actual analysis to ensure everything is running smoothly.
## Further reading
**general notes**
* [documentation on checker workflows](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/_documentation_/for%20users/checker.md)
* [documentation on CWL-WDL differences for users](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/_documentation_/for%20users/cwl-vs-wdl-user.md)
* [documentation on CWL-WDL differences for advanced users/devs](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/_documentation_/for%20developers/cwl-vs-wdl-dev.md)
**workflow-specific**
* Association testing -- aggregate: [assoc-aggregate](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/assoc-aggregate/readme.md)
* Kinship: [KING IBDSEG](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/king/README.md)
* Linkage disequilibrium pruning: [ld-pruning](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/ld-pruning/README.md)
* Null model generation: [null-model](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/null-model/README.md)
* [pc-air](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/pc-air/README.md)
* [pc-relate](https://github.com/DataBiosphere/analysis_pipeline_WDL/tree/main/pc-relate/README.md)
* VCF to GDS file conversion: [vcf-to-gds](https://github.com/DataBiosphere/analysis_pipeline_WDL/blob/main/vcf-to-gds/README.md)
------
#### Contact
Ash O'Farrell (aofarrel@ucsc.edu)