https://github.com/bu-isciii/openebench_gmi
https://github.com/bu-isciii/openebench_gmi
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/bu-isciii/openebench_gmi
- Owner: BU-ISCIII
- License: gpl-3.0
- Created: 2019-02-19T12:38:57.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2020-01-02T13:11:23.000Z (over 6 years ago)
- Last Synced: 2025-07-29T17:46:09.218Z (11 months ago)
- Language: HTML
- Size: 1.97 MB
- Stars: 1
- Watchers: 2
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://www.gnu.org/licenses/gpl-3.0) [](https://sci-f.github.io)
# Nextflow pipeline using containers for an Outbreak detection challenge using OpenEbench platform
This repository intends to be a nextflow + container implementation of OpenEbench workflow for an Outbreak detection challenge.
## How to use it
```Bash
git clone https://github.com/BU-ISCIII/openebench_gmi.git
cd openebench_gmi.git
git submodule init
git submodule update
nextflow run main.nf -profile docker
```
Parameters available:
```Bash
nextflow run main.nf --help
```
```
Usage:
nextflow run BU-ISCIII/openebench_gmi --tree_test {test.newick.file} --goldstandard_dir {golden.folder.path} --assess_dir {assessment.path} --public_ref_dir {path.to.info.ref.dataset} --event_id {event.id}
Mandatory arguments:
--tree_test Path to input data (must be surrounded with quotes).
--goldstandard_dir Path to reference data. Golden datasets.
--public_ref_dir Path where public dataset info is stored for validation.
--assess_dir Path where benchmark data is stored.
--event_id Event identifier.
--participant_id Participant identifier.
--tree_format Format tree ["nexus","newick"].
Other options:
--outdir The output directory where the results will be saved
```
## Datasets
First of all, needed datasets have been collected in: [datasets folder](datasets)
1. **Input dataset:** fastq input data obtained from [GMI WGS standards and benchmarks repository](https://github.com/globalmicrobialidentifier-WG3/datasets). [Here](datasets/inputDataset/Readme.me) you can find instructions for download.
2. **Gold standard dataset:** confirmed phylogeny for the outbreak being investigated.
3. **Input dataset ids:** input dataset ids in .txt and .json format.
4. **Test dataset:** a test tree for comparing with gold standard result. In this case just the same golden dataset. Robinson-Foulds metrics must be 0.
5. **benchmark_data**: path where benchmark results are stored.
## Nextflow pipeline and containers
Second, a pipeline has been developed which is splitted in three steps following OpenEbench specifications following this [repo](https://github.com/inab/opeb-submission) as an example:
### Nextflow processes
1. **Validation and data preprocessing:**
1. *Check results format:*
- Tree input: User input tree format is validated, nexus and newick formats are allowed being newick the canonical format. If format validated, a tree is outputted in the canonical format (.nwk).
- VCF input:
2. *Get query ids:*
- Tree input: ids are extracted for user input tree in newick or nexus format. IDs are writed in: queryids.json
3. *Validate query ids:*
- Tree input: query ids are validated against ref input ids.
2. **Metrics:**
1. *Precision/Recall calculation:* common (TP), source (FP) and ref(FN) edges are calculated in the comparison of ref and test tree topologies. Recall and precision are calculated using this values and stored in a json file called {participant_id}_snprecision.json.
2. *Robinson-Foulds metric calculation:* Normalized Robinson-Foulds test is performed between user tree and every participant tree already analyzed and stored in the benchmark_data folder in order to compare their topologies. Result value is writted to participant_matrix.json file.
3. **Data visualization and consolidation:**
1. Precision/Recall graph is created, classifying each participant inside a quartile.
2. A all participant vs all participant heatmap is created usign normalized robinson-foulds matrix.
### Containers info
Each step runs in its own container. Containers are built using a Dockerfile recipe which makes use of [SCI-F](https://sci-f.github.io/) recipes for software installation. All scif recipes are available in [scif_app_recipes repository](https://github.com/BU-ISCIII/scif_app_recipes). Singularity recipes are also provided (Not yet adapted in nextflow pipeline).