{"id":26572311,"url":"https://github.com/shandley/dada2-workflow","last_synced_at":"2026-02-16T01:36:13.943Z","repository":{"id":279883502,"uuid":"940205081","full_name":"shandley/dada2-workflow","owner":"shandley","description":"Generalized dada2 workflow. Refer to the wiki for help and troubleshooting.","archived":false,"fork":false,"pushed_at":"2025-03-10T11:19:55.000Z","size":14809,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-23T00:34:28.105Z","etag":null,"topics":["16s","16s-rrna","asv","dada2","microbiome","phyloseq"],"latest_commit_sha":null,"homepage":"","language":"R","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shandley.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-02-27T19:32:01.000Z","updated_at":"2025-03-10T11:19:59.000Z","dependencies_parsed_at":"2025-02-28T09:23:50.793Z","dependency_job_id":"e82846c6-83f5-4e3d-928c-d7f5c00c22a7","html_url":"https://github.com/shandley/dada2-workflow","commit_stats":null,"previous_names":["shandley/dada2-workflow"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/shandley/dada2-workflow","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shandley%2Fdada2-workflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shandley%2Fdada2-workflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shandley%2Fdada2-workflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shandley%2Fdada2-workflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shandley","download_url":"https://codeload.github.com/shandley/dada2-workflow/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shandley%2Fdada2-workflow/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264837928,"owners_count":23671119,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["16s","16s-rrna","asv","dada2","microbiome","phyloseq"],"created_at":"2025-03-23T00:33:14.420Z","updated_at":"2026-02-16T01:36:08.901Z","avatar_url":"https://github.com/shandley.png","language":"R","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DADA2 Optimized Workflow for 16S rRNA Sequencing Analysis\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nAn optimized workflow for processing 16S rRNA gene amplicon data using the [DADA2](https://benjjneb.github.io/dada2/) package in R. This workflow identifies exact amplicon sequence variants (ASVs) with higher resolution than traditional OTU-based methods while implementing advanced optimizations for improved performance, reliability, and insights.\n\n## Overview\n\nThis repository contains two main components:\n\n1. **dada2_workflow_optimize.Rmd** - An enhanced, performance-optimized RMarkdown workflow that implements the complete DADA2 pipeline\n2. **dashboard.R** - An interactive Shiny dashboard for visualizing and exploring results\n\n*Note: A basic DADA2 workflow (dada2_workflow.Rmd) is also included for users who want the simplest possible implementation.*\n\n## Key Features of the Optimized Workflow\n\n- **Automatic sequencing platform detection** - Identifies platform based on read length and quality patterns\n- **Parameter optimization** - Tunes filtering parameters to your specific sequence data characteristics\n- **Adaptive truncation lengths** - Dynamically adjusts to optimize read quality and overlap\n- **Expected error threshold optimization** - Balances quality control and read retention\n- **Primer detection** - Automatically identifies primer sequences for amplicon size calculations\n- **Memory-optimized processing** - Efficient batched operations with automatic memory management\n- **Enhanced checkpointing system** - Robust recovery from interruptions with comprehensive tracking\n- **Parallelized execution** - Automatic multi-core utilization with adaptive worker allocation\n- **Reference-based taxonomy confidence scoring** - Bootstrap confidence values for all taxonomic assignments\n- **Multi-method taxonomy assignment** - Combines results from multiple classifiers for improved accuracy\n- **Phylogenetic tree construction** - Integrates phylogenetic information using optimized alignment methods\n- **Rarefaction analysis** - Depth optimization with saturation detection for proper diversity comparisons\n- **Detailed quality visualization** - Enhanced plots with quality interpretation zones\n- **Multi-run support** - Process and integrate data from multiple sequencing runs with batch effect analysis\n- **Comprehensive reporting** - Generate detailed HTML/PDF reports with code hiding option\n\n## Dashboard Features\n\nThe interactive dashboard (`dashboard.R`) provides advanced visualization and analysis capabilities:\n\n- **Overview Panel** - Sample metrics, read statistics, and ASV summaries\n- **Quality Control** - Filtering performance, read tracking, and quality metric distributions\n- **Alpha Diversity** - Multiple indices with statistical comparisons between groups\n- **Beta Diversity** - Multiple ordination methods (PCoA, NMDS, t-SNE, UMAP) with statistical tests\n- **Taxonomy Explorer** - Interactive hierarchical visualization of taxonomic composition\n- **ASV Browser** - Searchable ASV table with sequence information and abundance patterns\n- **Differential Abundance** - Multiple testing methods for identifying biomarkers between groups\n- **Batch Effect Analysis** - For multi-run studies, quantifies and visualizes run effects\n- **Normalization Methods** - Compare various count normalization approaches\n- **Export Options** - Download plots, tables, and processed data in multiple formats\n\n## Requirements\n\n- R ≥ 4.0.0\n- Required R packages:\n  - dada2\n  - ggplot2\n  - phyloseq\n  - Biostrings\n  - ShortRead\n  - tidyverse\n  - future (for parallelization)\n  - DECIPHER (for improved taxonomy \u0026 phylogeny)\n  - vegan (for diversity analyses)\n\n## Getting Started\n\n1. Clone this repository:\n   ```bash\n   git clone https://github.com/yourusername/dada2-workflow.git\n   cd dada2-workflow\n   ```\n\n2. Install required R packages:\n   ```r\n   install.packages(c(\"ggplot2\", \"tidyverse\", \"argparse\", \"future\", \"future.apply\"))\n   if (!requireNamespace(\"BiocManager\", quietly = TRUE))\n       install.packages(\"BiocManager\")\n   BiocManager::install(c(\"dada2\", \"phyloseq\", \"Biostrings\", \"ShortRead\", \"DECIPHER\"))\n   ```\n\n3. Run the optimized workflow using one of the following methods:\n\n   **Method 1: RStudio**\n   - Open `dada2_workflow_optimize.Rmd` in RStudio \n   - Update parameters in the YAML header if needed\n   - Execute by running all code chunks\n\n   **Method 2: Command Line**\n   - For a single sequencing run:\n     ```bash\n     Rscript run_dada2_workflow.R\n     ```\n   \n   - For multi-run analysis with batch effect correction:\n     ```bash\n     Rscript run_dada2_workflow.R --multi-run --run-dir path/to/run_directory\n     ```\n\n4. View results in the interactive dashboard:\n   ```bash\n   Rscript run_dashboard.R\n   ```\n   Or for advanced options:\n   ```bash\n   Rscript run_dashboard.R --optimize --cores 4 --multi-run\n   ```\n\n## Directory Structure\n\n### Single Run Mode\nPlace your fastq files in the `data/` directory:\n```\ndata/\n  ├── sample1_R1.fastq.gz\n  ├── sample1_R2.fastq.gz\n  ├── sample2_R1.fastq.gz\n  ├── sample2_R2.fastq.gz\n  └── ...\n```\n\n### Multi-Run Mode\nOrganize your data with each run in a separate subdirectory:\n```\ndata/\n  ├── run1/\n  │   ├── sample1_R1.fastq.gz\n  │   ├── sample1_R2.fastq.gz\n  │   └── ...\n  ├── run2/\n  │   ├── sampleA_R1.fastq.gz\n  │   ├── sampleA_R2.fastq.gz\n  │   └── ...\n  └── run3/\n      ├── sampleX_R1.fastq.gz\n      ├── sampleX_R2.fastq.gz\n      └── ...\n```\n\n## Command-Line Options\n\nThe `run_dada2_workflow.R` script provides the following options:\n\n```\nusage: run_dada2_workflow.R [-h] [-m] [-d RUN_DIR] [-b] [-r] [-f FORMAT]\n                            [-o OUTPUT_DIR] [-n OUTPUT_FILE] [--cores CORES]\n                            [--optimize]\n\nRun optimized DADA2 workflow for 16S rRNA amplicon sequence processing\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -m, --multi-run       Enable multi-run processing mode\n  -d RUN_DIR, --run-dir RUN_DIR\n                        Directory containing run subdirectories (for multi-run\n                        mode)\n  -b, --big-data        Enable big data mode with optimized memory management\n  -r, --report          Generate HTML report\n  -f FORMAT, --format FORMAT\n                        Output format for report (e.g., html_document,\n                        pdf_document)\n  -o OUTPUT_DIR, --output-dir OUTPUT_DIR\n                        Output directory for reports\n  -n OUTPUT_FILE, --output-file OUTPUT_FILE\n                        Base output filename for reports\n  --cores CORES         Number of CPU cores to use for parallelization\n  --optimize            Enable additional performance optimizations\n```\n\n## Multi-Run Processing and Batch Effect Analysis\n\nFor studies with samples across multiple sequencing runs, the workflow:\n\n1. Processes each run separately through the sample inference step with run-specific error models\n2. Merges sequence tables from all runs while preserving run information\n3. Performs chimera removal and taxonomic assignment on the combined data\n4. Provides batch effect detection and correction methods:\n   - PERMANOVA to test for significant run effects\n   - Beta dispersion analysis to check homogeneity across runs\n   - Batch effect visualization with ordination methods\n   - Optional normalization methods specifically for batch correction\n\nThis approach gives you:\n- More accurate error models specific to each sequencing run\n- Detection of potential batch effects that could bias results\n- Methods to correct or account for batch effects in downstream analyses\n- Better integration of data from different sequencing platforms or centers\n\n## Output Files\n\nThe workflow produces comprehensive output files in the `results/` directory:\n\n- `seqtab_nochim.csv`: ASV count table\n- `taxonomy.csv`: Taxonomic assignments for each ASV\n- `taxonomy_with_confidence.csv`: Taxonomy with bootstrap confidence scores\n- `phyloseq_object.rds`: R object for downstream analysis\n- `ASVs.fasta`: FASTA file containing ASV sequences\n- `filter_summary.csv`: Quality filtering statistics\n- `chimera_summary.csv`: Chimera detection statistics\n- `read_tracking_detailed.csv`: Read counts through each pipeline step\n- `rarefaction_curves.rds`: Data for rarefaction analysis\n- `workflow_summary.rds`: Complete statistics about the analysis run\n\n## Contributing\n\nContributions to improve this workflow are welcome. Please feel free to submit a pull request.\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## References\n\n- Callahan BJ, McMurdie PJ, Rosen MJ, Han AW, Johnson AJA, Holmes SP (2016). \"DADA2: High-resolution sample inference from Illumina amplicon data.\" Nature Methods, 13, 581-583. doi: 10.1038/nmeth.3869\n- McMurdie PJ, Holmes S (2013). \"phyloseq: An R package for reproducible interactive analysis and graphics of microbiome census data.\" PLoS ONE, 8(4):e61217\n- Murali A, Bhargava A, Wright ES (2018). \"IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences.\" Microbiome, 6, 140. doi: 10.1186/s40168-018-0521-5\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshandley%2Fdada2-workflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshandley%2Fdada2-workflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshandley%2Fdada2-workflow/lists"}