https://github.com/seandavi/curatedmetagenomicsnextflow
Curated Metagenomics Data Nextflow workflows
https://github.com/seandavi/curatedmetagenomicsnextflow
bioinformatics metagenomics nextflow r01ca230551
Last synced: 6 months ago
JSON representation
Curated Metagenomics Data Nextflow workflows
- Host: GitHub
- URL: https://github.com/seandavi/curatedmetagenomicsnextflow
- Owner: seandavi
- Created: 2020-06-23T16:28:03.000Z (over 5 years ago)
- Default Branch: main
- Last Pushed: 2025-03-21T21:33:07.000Z (7 months ago)
- Last Synced: 2025-03-21T22:27:47.741Z (7 months ago)
- Topics: bioinformatics, metagenomics, nextflow, r01ca230551
- Language: Nextflow
- Homepage:
- Size: 93.8 KB
- Stars: 7
- Watchers: 6
- Forks: 6
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Curated Metagenomics NextFlow Pipeline
A NextFlow pipeline for processing metagenomics data, implementing the curatedMetagenomics workflow.
## Overview
This pipeline processes raw sequencing data through multiple steps:
1. FASTQ extraction with `fasterq-dump`
2. Quality control with `KneadData`
3. Taxonomic profiling with `MetaPhlAn`
4. Functional profiling with `HUMAnN` (optional)## Usage
Basic usage:
```bash
nextflow run main.nf --metadata_tsv samples.tsv
```With specific parameters:
```bash
nextflow run main.nf --metadata_tsv samples.tsv --skip_humann --publish_dir results
```## Parameters
### General Pipeline Parameters
| Parameter | Description | Default |
| -------------- | -------------------------------------- | ------------- |
| `metadata_tsv` | Path to TSV file with sample metadata | `samples.tsv` |
| `publish_dir` | Directory to publish results | `results` |
| `store_dir` | Directory to store reference databases | `databases` |
| `cmgd_version` | Curated Metagenomic Data version | `4` |### Process Control Parameters
| Parameter | Description | Default |
| ------------- | -------------------------------- | ------- |
| `skip_humann` | Skip HUMAnN functional profiling | `false` |### MetaPhlAn Parameters
| Parameter | Description | Default |
| ----------------- | ---------------------- | -------- |
| `metaphlan_index` | MetaPhlAn index to use | `latest` |### HUMAnN Parameters
| Parameter | Description | Default |
| ------------ | --------------------------- | ------------------ |
| `chocophlan` | ChocoPhlAn database version | `full` |
| `uniref` | UniRef database version | `uniref90_diamond` |## Input Format
The `metadata_tsv` file should be a tab-separated values file with at least the following columns:
- `sample_id`: Unique sample identifier
- `NCBI_accession`: SRA accession number(s), separated by semicolons for multiple filesExample:
```
sample_id NCBI_accession
sample1 SRR1234567
sample2 SRR2345678;SRR2345679
```## Output
Results will be organized by sample in the `publish_dir` directory:
```
results/
├── sample1/
│ ├── fasterq_dump/
│ ├── kneaddata/
│ ├── metaphlan_lists/
│ ├── metaphlan_markers/
│ ├── strainphlan_markers/
│ └── humann/
├── sample2/
│ └── ...
```## Profiles
The pipeline comes with several execution profiles:
- `local`: For local execution
- `google`: For execution on Google Cloud Batch
- `anvil`: For execution on AnVIL
- `alpine`: For execution on Alpine HPC
- `unitn`: For execution on UNITN PBS ProExample:
```bash
nextflow run main.nf -profile google --metadata_tsv samples.tsv
```## Dependencies
This pipeline requires:
- Nextflow 22.10.0 or later
- Container support (Docker, Singularity, etc.)
- AWS CLI (for data retrieval from SRA)