{"id":50922578,"url":"https://github.com/aomlomics/tourmaline","last_synced_at":"2026-06-16T20:01:07.220Z","repository":{"id":43108737,"uuid":"125841708","full_name":"aomlomics/tourmaline","owner":"aomlomics","description":"Amplicon sequence processing workflow using QIIME 2 and Snakemake","archived":false,"fork":false,"pushed_at":"2025-12-18T17:00:41.000Z","size":59636,"stargazers_count":46,"open_issues_count":32,"forks_count":22,"subscribers_count":5,"default_branch":"V2","last_synced_at":"2025-12-21T20:40:09.966Z","etag":null,"topics":["amplicon-sequence-variants","dada2","deblur","noaa-omics-software","snakemake"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aomlomics.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2018-03-19T10:36:47.000Z","updated_at":"2025-12-18T17:00:45.000Z","dependencies_parsed_at":"2023-01-20T02:22:41.543Z","dependency_job_id":"fcee4e88-e128-4de2-bd66-4a92671dd9cf","html_url":"https://github.com/aomlomics/tourmaline","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/aomlomics/tourmaline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aomlomics%2Ftourmaline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aomlomics%2Ftourmaline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aomlomics%2Ftourmaline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aomlomics%2Ftourmaline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aomlomics","download_url":"https://codeload.github.com/aomlomics/tourmaline/tar.gz/refs/heads/V2","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aomlomics%2Ftourmaline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34421326,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amplicon-sequence-variants","dada2","deblur","noaa-omics-software","snakemake"],"created_at":"2026-06-16T20:01:06.540Z","updated_at":"2026-06-16T20:01:07.213Z","avatar_url":"https://github.com/aomlomics.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cimg src=\"png/tourmaline_banner.png\" alt=\"png/tourmaline_banner\" width=\"100%\"/\u003e\n\n[![DOI](https://zenodo.org/badge/125841708.svg)](https://zenodo.org/badge/latestdoi/125841708)\n\n# Tourmaline 2\n\nTourmaline 2 is an amplicon sequence processing workflow for Illumina sequence data that uses [QIIME 2](https://qiime2.org) and the software packages it wraps. Tourmaline 2 manages commands, inputs, and outputs using the [Snakemake](https://snakemake.readthedocs.io/en/stable/) workflow management system.\n\n## Major changes in v2 vs. v1\n\n**To use the Legacy v1 version of Tourmaline**, check out the [V1 branch](https://github.com/aomlomics/tourmaline/tree/V1) of this repository!\n\n### Run via tourmaline.sh script\n\nInstead of interacting with Snakemake rules directly, the main way to run Tourmaline 2 is through the `tourmaline.sh` script. This script allows you to run one or more of the workflow steps at a time, specify specific config files, and set the maximum number of cores. You must be located in the tourmaline directory when running it, however you can set the output file destinations to anywhere.\n\nUsage:\n\n```bash\nconda activate snakemake-tour2\n./tourmaline.sh --step [qaqc,repseqs,taxonomy] --configfile [config1,config2,config3] --cores N\n```\n\nYou can still run individual snakemake rules as before. Each of the three steps (explained more below) has its own Snakefile, so you must specify the correct snakefile when running an individual rule.\n\n### Providing externally-generated data\n\nUnlike Tourmaline 1, you can start any of the three workflow steps with data from an external program, so long as it is formatted correctly. For example, if you already have ASV sequences and just want to assign taxonomy with Tourmaline, you can format them for QIIME 2 (code to help with this below) and just provide the file path in your config file.\n\n## Overview\n\nTourmaline 2 is a modular Snakemake pipeline for processing DNA metabarcoding data. The pipeline consists of three main steps, plus an optional fourth step:\n\n### Step 1. Sequence quality assurance and quality control\n\n* Called \"qaqc\" in Tourmaline 2 code.\n* Processes raw fastq files (paired-end or single-end data).\n* Provides sequence quality plots for demultiplexed raw and/or trimmed reads.\n* Optionally trims primer sequences from raw reads.\n* Creates a QIIME 2 sequence artifact.\n\n### Step 2. Representative sequences (denoising and ASV generation)\n\n* Called \"repseqs\" in Tourmaline 2 code.\n* Generates ASVs using the specified method (DADA2 or Deblur).\n* Optional filtering based on length, abundance, and prevalence.\n* Produces feature table and representative sequences.\n\n### Step 3. Taxonomy assignment\n\n* Called \"taxonomy\" in Tourmaline 2 code.\n* Generates taxonomic assignments and visualizations.\n* Assigns taxonomy using one of four methods:\n  * [Naive Bayes classifier as implemented in QIIME 2](https://docs.qiime2.org/2024.10/plugins/available/feature-classifier/classify-sklearn/)\n  * [Consensus BLAST as implemented in QIIME 2](https://docs.qiime2.org/2024.10/plugins/available/feature-classifier/classify-consensus-blast/)\n  * [Consensus VSEARCH as implemented in QIIME 2](https://docs.qiime2.org/2024.10/plugins/available/feature-classifier/classify-consensus-vsearch/)\n  * [Anacapa's Bowtie 2 and BLCA method](https://github.com/limey-bean/Anacapa?tab=readme-ov-file#step-3-taxonomic-assignment-using-bowtie-2-and-blca)\n\n### Step 4. Generate bioinformatics metadata\n\n* Creates a file with metadata about the analysis using FAIR eDNA terms.\n* File can be read into the [NOAA Ocean DNA Explorer](https://www.ngi.msstate.edu/node).\n\n## Setup Requirements\n\n* [Conda (Miniconda works well)](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)\n* [QIIME 2 (2024.10) amplicon workflow](https://docs.qiime2.org/2024.10/install/)\n* [Snakemake conda environment, with extra packages installed](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html)\n\n   ```bash\n   conda create -c conda-forge -c bioconda -n snakemake-tour2 snakemake biopython yq parallel\n   ```\n\n* [V2 (default) branch of Tourmaline](https://github.com/aomlomics/tourmaline.git)\n\n   ```bash\n   git clone https://github.com/aomlomics/tourmaline.git\n   ```\n\n* bowtie2-blca conda environment (required only if running BLCA taxa assignment)\n\n    ```bash\n   conda create -c conda-forge -c bioconda -n bt2-blca biopython muscle=3.8 bowtie2\n    ```\n\n### Running Requirements\n\n* `snakemake-tour2` environment must be activated\n* Required configuration files for each step\n* Input data files (vary depending on starting step)\n* Must run from the Tourmaline directory downloaded from GitHub, which contains the `tourmaline.sh` script and Snakefiles\n\n## Configuration Files\n\nThe pipeline uses three main configuration files, one for each step. These files can have any name, and example files are provided.\n\n### 1. Sample/QA/QC Configuration (config_01_qaqc.yaml)\n\nKey parameters:\n\n```yaml\nrun_name: [your_run_name]              # Name for this qaqc run, will be a prefix for outputs\noutput_dir: [path]                     # Output directory path\nraw_fastq_path: [path]                 # Path to raw fastq files\npaired_end: [True/False]               # Whether data is paired-end\nto_trim: [True/False]                  # Whether to trim sequences\n\n# Trimming parameters\nfwd_primer: [sequence]                 # Forward primer sequence\nrev_primer: [sequence]                 # Reverse primer sequence\ndiscard_untrimmed: [True/False]        # Whether to discard sequences without the primer\nminimum_length: [int]                  # Minimum sequence length to keep after trimming\n```\n\n#### QA/QC Input Files\n\nThere are three options for input files in the QA/QC step. You must choose one and leave the others blank in the config file:\n\n```yaml\n# Full path to raw demultiplexed fastq files. Sample names will be the prefix of the file names.\nraw_fastq_path: [path]\n# Full path to pre-trimmed fastq files. Sample names will be the prefix of the file names.\ntrimmed_fastq_path: [path]\n# Relative path and file name of a QIIME2 manifest file. It can point to trimmed or untrimmed reads.\nsample_manifest_file: [path/filename]\n```\n\n##### Sample Manifest Format\n\nCan provide either the current QIIME2 tab-separated file format, or the legacy comma-separated format. Much have the correct headers:\n\n**Tab-separated**\n\nPaired-end:\n\n```tsv\nsample-id  forward-absolute-filepath     reverse-absolute-filepath\nsample1    /path/to/sample1_R1.fastq.gz  /path/to/sample1_R2.fastq.gz\n```\n\nSingle-end:\n\n```tsv\nsample-id  absolute-filepath\nsample1    /path/to/sample1_R1.fastq.gz\n```\n\n**CSV (legacy)**\n\nPaired-end:\n\n```csv\nsample-id,absolute-filepath,direction\nsample1,/path/to/sample1_R1.fastq.gz,forward\nsample1,/path/to/sample1_R2.fastq.gz,reverse\n```\n\nSingle-end:\n\n```csv\nsample-id,absolute-filepath\nsample1,/path/to/sample1_R1.fastq.gz\n```\n\n### FASTQ Files without a manifest file\n\n* Paired-end naming: `{sample}_R1.fastq.gz` and `{sample}_R2.fastq.gz`\n* Alternative format: `{sample}_R1_001.fastq.gz` and `{sample}_R2_001.fastq.gz`\n* Single-end naming: `{sample}_R1.fastq.gz` or `{sample}_R1_001.fastq.gz`\n\n### 2. Representative sequences configuration (config_02_repseqs.yaml)\n\nKey parameters:\n\n```yaml\nrun_name: [your_run_name] # Name for this repseqs run, can be the same or different than qaqc step\noutput_dir: [path]        # Output directory path\nasv_method: [method]      # ASV method (dada2pe, dada2se, deblur)\n\n# DADA2 parameters (if using dada2pe/dada2se)\n\ndada2_trunc_len_f: [int]   # Forward read truncation length\ndada2pe_trunc_len_r: [int] # Reverse read truncation length (paired-end only)\ndada2_trim_left_f: [int]   # Number of bases to trim from start of forward reads\ndada2pe_trim_left_r: [int] # Number of bases to trim from start of reverse reads (paired-end only)\n\n# Filtering options\nto_filter: [True/False]        # Whether to apply filtering\nrepseq_min_length: [int]       # Minimum ASV length\nrepseq_max_length: [int]       # Maximum ASV length\nrepseq_min_abundance: [float]  # Minimum abundance threshold\nrepseq_min_prevalence: [float] # Minimum prevalence threshold\n```\n\n#### Repseqs input files\n\nYou have two options for providing files to the repseqs step:\n\n**1) Provide an existing Tourmaline QA/QC run**\n\n* Either use the same `run_name` and `output_dir` for both steps, or\n* Use a different `run_name` for the repseqs step, and provide the `sample_run_name` you want to use. Can be helpful if you are testing out different trimming parameters.\n\n**2) Provide an externally generated QIIME2 sequence archive (.qza)**\n\n\nTo generate a QIIME2 sequence archive, you need a manifest file linking sample names with the absolute file path of the fastq.gz files (see the [TSV format above](https://github.com/aomlomics/tourmaline/blob/develop/README.md#sample-manifest-format).\n\nActivate the `qiime2-amplicon-2024.10` environment.\n\n```bash\nconda activate qiime2-amplicon-2024.10\n```\n\nImport to a QIIME2 artifact. Change code to match your manifest file name and desired output .qza file name and path.\n\n**Paired-end data**\n\n```bash\nqiime tools import \\\n   --type 'SampleData[PairedEndSequencesWithQuality]' \\\n   --input-path my_pe.manifest \\\n   --output-path output-file_pe_fastq.qza \\\n   --input-format PairedEndFastqManifestPhred33V2\n```\n\n**Single-end data**\n\n```bash\nqiime tools import \\\n   --type 'SampleData[SequencesWithQuality]' \\\n   --input-path my_se.manifest \\\n   --output-path output-file_se_fastq.qza \\\n   --input-format SingleEndFastqManifestPhred33V2\n```\n\n### 3. Taxonomy configuration (config_03_taxonomy.yaml)\n\nKey parameters:\n\n```yaml\nrun_name: [your_run_name] # Name for this pipeline run\noutput_dir: [path]        # Output directory path\nclassify_method: [method] # Classification method (naive-bayes, consensus-blast, consensus-vsearch, bt2-blca)\ncollapse_taxalevel: [int] # Creates an additional table where ASV counts are collapsed to the provided taxonomic level\nclassify_threads: [int]   # Number of threads for classification\n```\n\n#### Taxonomy Input Files\n\nYou have two options for providing files to the taxonomy step:\n\n**1) Provide an existing Tourmaline repseqs run**\n\n* Either use the same `run_name` and `output_dir` for both steps, or\n* Use a different `run_name` for the taxonomy step, and provide the `repseqs_run_name` you want to use. Can be helpful if you are testing out different ASV parameters.\n\n**2) Provide externally generated QIIME2 sequence archive and table (.qza)**\n\nMust provide paths for both `repseqs_qza_file` and `table_qza_file`\n\n**ASV sequences**\n\nIf you have a fasta file of ASV/OTU sequences, you can use the following code to generate a QIIME 2 repseqs archive.\n\nActivate the `qiime2-amplicon-2024.10` environment.\n\n```bash\nconda activate qiime2-amplicon-2024.10\n```\n\nImport to a QIIME 2 artifact. Change code to match your fasta file name and desired output .qza file name and path.\n\n```bash\nqiime tools import \\\n   --type 'FeatureData[Sequence]' \\\n   --input-path my-asvs.fasta \\\n   --output-path output-asvs.qza\n```\n\n**Read count table**\n\nIf you have a biom formatted table, you can [follow the QIIME2 guidance and check the format prior to importing](https://docs.qiime2.org/2024.10/tutorials/importing/#feature-table-data). Example for a BIOM v1.0.0 formatted file:\n\n```bash\nconda activate qiime2-amplicon-2024.10\n\nqiime tools import \\\n  --input-path feature-table-v100.biom \\\n  --type 'FeatureTable[Frequency]' \\\n  --input-format BIOMV100Format \\\n  --output-path feature-table.qza\n```\n\nIf you have a .tsv file with rows as unique sequences and columns as sample read counts, you can first [convert to BIOM](https://biom-format.org/documentation/biom_conversion.html) then convert to .qza. Example:\n\n```bash\nconda activate qiime2-amplicon-2024.10\n\nbiom convert -i otu_table.txt -o new_otu_table.biom --to-hdf5 --table-type=\"OTU table\"\n\nqiime tools import \\\n  --input-path new_otu_table.biom \\\n  --type 'FeatureTable[Frequency]' \\\n  --input-format BIOMV210Format \\\n  --output-path feature-table.qza\n```\n\nKey parameters for reference database:\n\n```yaml\ndatabase_name: [name]\n# Reference database name, just used for metadata\nrefseqs_file: [path]\n# Reference sequences file,\ntaxa_file: [path]\n# Reference taxonomy file\nclassify_method: [method]\n# Classification method (naive-bayes, consensus-blast, consensus-vsearch)\ntaxa_ranks: [comma-separated list of ranks]\n# Taxonomy rank levels that match the reference database\npretrained_classifier: [full path]\n# Optional for naive-bayes method, if provided will ignore refseqs_file and taxa_file\nbowtie_database: [path] # optional for bt2-blca, folder with bowtie index database, refseqs and taxa files also required\n```\n\nMethod-specific parameters:\n\n```yaml\n# naive-bayes\nskl_confidence: 0.7\n# Confidence threshold for limiting taxonomic depth\n# SEQ SIMILARITY (consensus-blast or consensus-vsearch)\nperc_identity: 0.8\n# Percent identity threshold for matches\nquery_cov: 0.8\n# Query alignment coverage threshold for matches\nmin_consensus: 0.51\n# Minimum fraction of assignments must match top hit to be accepted as consensus assignment\n# bt2-blca\nconfidence_thres: 0.8\n# Bootstrap confidence threshold for limiting taxonomic depth\n```\n\n## Running the workflow\n\nThe workflow can be run using the `tourmaline.sh` script. You can run all steps at once or run them modularly.\n\n### Clone Tourmaline develop branch (first time only)\n\nIf this is your first time running Tourmaline, you'll need to set up your directory.\n\nStart by cloning the Tourmaline directory and files of the **develop** branch:\n\n```bash\ngit clone --branch develop https://github.com/aomlomics/tourmaline.git\n```\n\n### Activate Snakemake Conda environment\n\n```\nconda activate snakemake-tour2\n```\n\nAlso make sure you have the ```qiime2-amplicon-2024.10``` environment installed, with that name. You do not need to install anything else in that environment.\n\n### Basic usage\n\nNavigate to the Tourmaline directory downloaded from GitHub as your working directory, then run:\n\n```bash\n./tourmaline.sh --step/-s [step] --configfile/-c [config_file] --cores/-n [num_cores]\n```\n\n#### Examples\n\nRun a single step (taxonomy):\n\n```bash\n./tourmaline.sh -s taxonomy -c config_03_taxonomy.yaml -n 6\n```\n\nRun all steps with one command:\n\n```bash\n./tourmaline.sh -s qaqc,repseqs,taxonomy -c config_01_sample.yaml,config_02_repseqs.yaml,config_03_taxonomy.yaml -n 6\n```\n\n#### Important notes\n\n* The number of steps must match the number of config files provided.\n* Each step corresponds to its respective config file.\n* Config files must be provided in the same order as the steps.\n\n### Generate bioinformatics metadata\n\nTo generate a report file with metadata on the bioinformatics, provide your three config files to the ```scripts/format_analysisMetadata.py```, along with a `project_id`. Optionally, you can provide an `analysis_run_name` and `assay_name`, or the default will use the values provided in the sample step config file.  If you are running the script outside of the tourmaline folder, you must also provide the path to the tourmaline metadata file. \n\nExample:  \n```bash\npython scripts/format_analysisMetadata.py -s config_01_sample.yaml -r config_02_repseqs.yaml -t config_03_taxonomy.yaml -p my_project -o my-tourmaline-metadata.tsv\n```\n\nFull documentation:  \n```\nusage: format_analysisMetadata.py [-h] -s SAMPLES_CONFIG -r REPSEQS_CONFIG -t TAXONOMY_CONFIG -p PROJECT_ID [-a ASSAY_NAME]\n                                  [-A ANALYSIS_RUN_NAME] [-T TOURMALINE_METADATA] -o OUTPUT\n\nGenerate a single TSV file from multiple YAML files.\n\noptions:\n  -h, --help            show this help message and exit\n  -s SAMPLES_CONFIG, --samples_config SAMPLES_CONFIG\n                        Path to the samples config file\n  -r REPSEQS_CONFIG, --repseqs_config REPSEQS_CONFIG\n                        Path to the repseqs config file\n  -t TAXONOMY_CONFIG, --taxonomy_config TAXONOMY_CONFIG\n                        Path to the taxonomy config file\n  -p PROJECT_ID, --project_id PROJECT_ID\n                        Value for project_id\n  -a ASSAY_NAME, --assay_name ASSAY_NAME\n                        Value for assay_name, otherwise use value in samples config\n  -A ANALYSIS_RUN_NAME, --analysis_run_name ANALYSIS_RUN_NAME\n                        Value for analysis_run_name, otherwise use value in samples config\n  -T TOURMALINE_METADATA, --tourmaline_metadata TOURMALINE_METADATA\n                        Path to tourmaline metadata\n  -o OUTPUT, --output OUTPUT\n                        Path to the output folder\n            \n```\n\n## Directory structure\n\nThe pipeline creates the following directory structure for outputs:\n\n```\noutput_dir/\n├── [run_name]-samples/    # QA/QC outputs\n├── [run_name]-repseqs/    # Representative sequences outputs\n└── [run_name]-taxonomy/   # Taxonomy assignment outputs\n```\n\nEach directory contains the relevant outputs for that step of the pipeline.\n\n## Disclaimer\n\nThis repository is a scientific product and is not official communication of the National Oceanic and Atmospheric Administration, or the United States Department of Commerce. All NOAA GitHub project code is provided on an 'as is' basis and the user assumes responsibility for its use. Any claims against the Department of Commerce or Department of Commerce bureaus stemming from the use of this GitHub project will be governed by all applicable Federal law. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by the Department of Commerce. The Department of Commerce seal and logo, or the seal and logo of a DOC bureau, shall not be used in any manner to imply endorsement of any commercial product or activity by DOC or the United States Government.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faomlomics%2Ftourmaline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faomlomics%2Ftourmaline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faomlomics%2Ftourmaline/lists"}