{"id":24961666,"url":"https://github.com/esteinig/scrubby","last_synced_at":"2025-04-10T21:37:27.126Z","repository":{"id":65956505,"uuid":"600264938","full_name":"esteinig/scrubby","owner":"esteinig","description":"Host depletion optimised for clinical metagenomic sequencing applications :panda_face:","archived":false,"fork":false,"pushed_at":"2024-11-18T21:19:44.000Z","size":595,"stargazers_count":16,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T14:48:31.765Z","etag":null,"topics":["alignment","background","bioinformatics","depletion","extraction","host","kraken","metagenomics","rust","taxonomy"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/esteinig.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-11T00:46:50.000Z","updated_at":"2025-03-29T21:57:15.000Z","dependencies_parsed_at":"2024-10-28T04:18:54.136Z","dependency_job_id":"1e8f962f-9e04-4f4b-a744-9648762d88de","html_url":"https://github.com/esteinig/scrubby","commit_stats":{"total_commits":67,"total_committers":3,"mean_commits":"22.333333333333332","dds":"0.34328358208955223","last_synced_commit":"d88b2f9890d68b6e3653bb6fa2d9a782382f28dc"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esteinig%2Fscrubby","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esteinig%2Fscrubby/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esteinig%2Fscrubby/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/esteinig%2Fscrubby/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/esteinig","download_url":"https://codeload.github.com/esteinig/scrubby/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248304939,"owners_count":21081551,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","background","bioinformatics","depletion","extraction","host","kraken","metagenomics","rust","taxonomy"],"created_at":"2025-02-03T08:55:47.301Z","updated_at":"2025-04-10T21:37:27.098Z","avatar_url":"https://github.com/esteinig.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# scrubby \u003ca href='https://github.com/esteinig'\u003e\u003cimg src='docs/scrubby.png' align=\"right\" height=\"200\" /\u003e\u003c/a\u003e\n\n[![build](https://github.com/esteinig/nanoq/actions/workflows/rust-ci.yaml/badge.svg?branch=master)](https://github.com/esteinig/scrubby/actions/workflows/rust-ci.yaml)\n![](https://img.shields.io/badge/version-1.0.0-black.svg)\n\nHost background depletion for metagenomic diagnostics with benchmarks and optimisation for clinical sequencing protocols and application scenarios.\n\n## Overview\n\n- [Purpose](#purpose)\n- [Install](#install)\n- [Command-line interface](#command-line-interface)\n- [Commands and options](#commands-and-options)\n- [Benchmarks and optimisation](#benchmarks-and-optimisation)\n- [Rust library](#rust-library)\n- [Dependencies](#dependencies)\n\n## Purpose\n\n...\n\n## Install\n\n### Development release\n\n```\nmamba install -c conda-forge -c bioconda -c esteinig scrubby\n```\n\n### Release version\n\n\u003e[!NOTE]\n\u003eNot yet available - use development release which tracks main branch (above)\n\nScrubby is available as statically compiled binary release for Linux and macOS (`x86_64` and `aarch64`). \n\n### Source\n\n```shell\ngit clone https://github.com/esteinig/scrubby \u0026\u0026 cd scrubby\n```\n\nCompile default version, which requires classifier or aligner (and `samtools`) as dependencies:\n\n```shell\ncargo build --release\n```\n\nCompile no-dependency built-in `minimap2-rs` version using the `mm2` feature (experimental). Note that `minimap2-rs` is \n[only tested](https://github.com/jguhlin/minimap2-rs?tab=readme-ov-file#building-for-musl) for `x86_64` \n(Linux/macOS) and will not compile for `aarch64` (Linux/macOS).\n\n```shell\ncargo build --release --features mm2\n```\n\nCompile with `htslib` feature for using `scrubby alignment` command with `{sam, bam, cram}` format:\n\n```shell\ncargo build --release --features htslib\n```\n\n### Binaries\n\nYOLO pre-compiled binaries and execute :skull:\n\n```\ncurl -L https://github.com/esteinig/scrubby/release/scrubby-v1.0.0-mm2-linux-x86-64.tar.gz | tar -xvj ./scrubby\n```\n\n\n## Command-line interface\n\n- Reads should be quality- and adapter-trimmed before applying `Scrubby`.\n- Single or paired-end reads are supported with optional `gz` input/output compression. \n- Paired-end reads are always depleted/extracted as a pair (no unpaired read output).\n- Default `minimap2` presets are `sr` for paired-end reads and `map-ont` for single reads.\n- Multiple values can be specified consecutively or using multiple arguments (`-T Metazoa -T Bacteria`)\n\n### Reference indices\n\nList pre-built index names:\n\n```shell\nscrubby download --list\n```\n\nDownload pre-built index by name for aligners and classifiers:\n\n```shell\nscrubby download --name chm13v2 --aligner bowtie2 minimap2 --classifier kraken2\n```\n\nMore options for aligners and classifier index download:\n\n```shell\nscrubby download --help\n```\n\n### Read depletion or extraction\n\nRead depletion pipeline with `Bowtie2` aligner (default for paired-end reads):\n\n```shell\nscrubby reads -i r1.fq r2.fq -o c1.fq c2.fq -I chm13v2\n```\n\nUse built-in `minimap2-rs` if compiled with `mm2` feature (default for paired-end and long reads):\n\n```shell\nscrubby reads -i r1.fq r2.fq -o c1.fq c2.fq -I chm13v2.fa.gz\n```\n\nLong reads with non-default preset and `minimap2` aligner (default for long reads):\n\n```shell\nscrubby reads -i r.fq -o c.fq -I chm13v2.fa.gz --preset lr-hq\n```\n\nSingle-end short reads requires explicit aligner and preset for `minimap2`:\n\n```shell\nscrubby reads -i r1.fq -o c1.fq -I chm13v2.fa.gz --aligner minimap2 --preset sr\n```\n\nUse classifier `Kraken2` or `Metabuli` instead of aligner:\n\n```shell\nscrubby reads -i r1.fq r2.fq -o c1.fq c2.fq -T Chordata -D 9606 -I chm13v2_k2/ -c kraken2\n```\n\nUse different aligner `strobealign` or `minimap2`:\n\n```shell\nscrubby reads -i r1.fq r2.fq -o c1.fq c2.fq -I chm13v2.fa.gz -a strobealign\n```\n\nWith report output and depleted read identifiers:\n\n```shell\nscrubby reads -i r1.fq r2.fq -o c1.fq c2.fq -I chm13v2 -j report.json -r reads.tsv\n```\n\nInput and output compressed reads, increase threads and set working directory:\n\n```shell\nscrubby reads -i r1.fq.gz r2.fq.gz -o c1.fq.gz c2.fq.gz -I chm13v2 -w /tmp -t 16\n```\n\nRead extraction instead of depletion:\n\n```shell\nscrubby reads -i r1.fq.gz r2.fq.gz -o c1.fq.gz c2.fq.gz -I chm13v2 -e\n```\n\n\n### Read depletion or extraction from outputs\n\nClassifier output cleaning for Kraken-style reports and read classification outputs (Kraken2, Metabuli):\n\n```shell\nscrubby classifier \\\n  --input r1.fq r2.fq \\\n  --output c1.fq c2.fq\\\n  --report kraken2.report \\\n  --reads kraken2.reads \\\n  --taxa Chordata \\\n  --taxa-direct 9606\n```\n\nAlignment output cleaning (.sam|.bam|.cram|.paf|.gaf) or read identifier list (.txt). Alignment format is recognized from file extension or can be explicitly set with `--format`. Alignment can be '-' for reading from `stdin` with explicit format argument. PAF and TXT formats can be compressed (.gz|.xz|.bz) unless reading\nfrom `stdin`.\n\n```shell\nscrubby alignment  \\\n  --input r1.fq r2.fq \\\n  --output c1.fq c2.fq\\\n  --alignment alignment.paf \\\n  --min-len 50 \\\n  --min-cov 0.5 \\\n  --min-mapq 50\n\nminimap2 -x map-ont ref.fa r.fq | scrubby alignment -a - -f paf -i r.fq -o c.fq\n```\n\n\n### Other options and utilities\n\nAdd the `--extract` (`-e`) flag to any of the above tasks to reverse read depletion for read extraction:\n\n```shell\nscrubby reads --extract ...\n```\n\nDifference between input and output reads with optional counts and read identifier summaries:\n\n```shell\nscrubby diff -i r1.fq r2.fq -o c1.fq c2.fq -j counts.json -r reads.tsv\n```\n\n\n### Report output format\n\n```json\n{\n  \"version\": \"0.7.0\",\n  \"date\": \"2024-07-30T06:50:15Z\",\n  \"command\": \"scrubby reads -i smoke_R1.fastq.gz -i smoke_R2.fastq.gz -o test_R1.fq.gz -o test_R2.fq.gz --index /data/opt/scrubby_indices/chm13v2 --threads 16 --workdir /tmp/test --json test.json\",\n  \"input\": [\n    \"smoke_R1.fastq.gz\",\n    \"smoke_R2.fastq.gz\"\n  ],\n  \"output\": [\n    \"test_R1.fq.gz\",\n    \"test_R2.fq.gz\"\n  ],\n  \"reads_in\": 6678,\n  \"reads_out\": 3346,\n  \"reads_removed\": 3332,\n  \"reads_extracted\": 0,\n  \"settings\": {\n    \"aligner\": \"bowtie2\",\n    \"classifier\": null,\n    \"index\": \"/data/opt/scrubby_indices/chm13v2\",\n    \"alignment\": null,\n    \"reads\": null,\n    \"report\": null,\n    \"taxa\": [],\n    \"taxa_direct\": [],\n    \"classifier_args\": null,\n    \"aligner_args\": null,\n    \"min_len\": 0,\n    \"min_cov\": 0.0,\n    \"min_mapq\": 0,\n    \"extract\": false,\n    \"preset\": null\n  }\n}\n```\n\nIn this example, the `settings.aligner` is `null` if a `--classifier` is set.\n\n## Commands and options\n\n### Global options and commands\n\n```shell\nscrubby 0.7.0 \nEike Steinig (@esteinig)\n\nTaxonomic read depletion for clinical metagenomic diagnostics\n\nUsage: scrubby [OPTIONS] \u003cCOMMAND\u003e\n\nCommands:\n  reads       Deplete or extract reads using aligners or classifiers\n  classifier  Deplete or extract reads from classifier outputs (Kraken2/Metabuli)\n  alignment   Deplete or extract reads from aligner output with additional filters (SAM/BAM/PAF/TXT)\n  download    List available indices and download files for aligners and classfiers\n  diff        Get read counts and identifiers of the difference between input and output read files\n  help        Print this message or the help of the given subcommand(s)\n\nOptions:\n  -l, --log-file \u003cLOG_FILE\u003e  Output logs to file instead of terminal\n  -h, --help                 Print help (see more with '--help')\n  -V, --version              Print version\n```\n\n### Pre-built reference downloads\n\n```shell\nList available indices and download files for aligners and classfiers\n\nUsage: scrubby download [OPTIONS] --name [\u003cNAME\u003e...]\n\nOptions:\n  -n, --name [\u003cNAME\u003e...]            Index name to download [possible values: chm13v2]\n  -o, --outdir \u003cOUTDIR\u003e             Output directory for index download [default: .]\n  -a, --aligner [\u003cALIGNER\u003e...]      Download index for one or more aligners [possible values: bowtie2, minimap2, strobealign]\n  -c, --classfier [\u003cCLASSFIER\u003e...]  Download index for one or more classifiers [possible values: kraken2, metabuli]\n  -l, --list                        List available index names and exit\n  -t, --timeout \u003cTIMEOUT\u003e           Download timeout in minutes - increase for large files and slow connections [default: 360]\n  -h, --help                        Print help (see more with '--help')\n```\n\n\n### Read depletion or extraction\n\n```shell\nDeplete or extract reads using aligners or classifiers\n\nUsage: scrubby reads [OPTIONS] --index \u003cINDEX\u003e\n\nOptions:\n  -i, --input [\u003cINPUT\u003e...]              Input read files (optional .gz)\n  -o, --output [\u003cOUTPUT\u003e...]            Output read files (optional .gz)\n  -I, --index \u003cINDEX\u003e                   Reference index for aligner or classifier\n  -e, --extract                         Read extraction instead of depletion\n  -a, --aligner \u003cALIGNER\u003e               Aligner to use, default is 'bowtie2' or 'minimap2' [possible values: bowtie2, minimap2, strobealign, minimap2-rs]\n  -c, --classifier \u003cCLASSIFIER\u003e         Classifier to use [possible values: kraken2, metabuli]\n  -T, --taxa [\u003cTAXA\u003e...]                Taxa and all sub-taxa to deplete using classifiers\n  -D, --taxa-direct [\u003cTAXA_DIRECT\u003e...]  Taxa to deplete directly using classifiers\n  -t, --threads \u003cTHREADS\u003e               Number of threads to use for aligner and classifier [default: 4]\n  -j, --json \u003cJSON\u003e                     Summary output file (.json)\n  -w, --workdir \u003cWORKDIR\u003e               Optional working directory\n  -r, --read-ids \u003cREAD_IDS\u003e             Read identifier file (.tsv)\n  -h, --help                            Print help (see more with '--help')\n```\n\n### Classifier outputs\n\n\n```shell\nDeplete or extract reads from classifier outputs (Kraken2, Metabuli)\n\nUsage: scrubby classifier [OPTIONS] --report \u003cREPORT\u003e --reads \u003cREADS\u003e --classifier \u003cCLASSIFIER\u003e\n\nOptions:\n  -i, --input [\u003cINPUT\u003e...]              Input read files (optional .gz)\n  -o, --output [\u003cOUTPUT\u003e...]            Output read files (optional .gz)\n  -e, --extract                         Read extraction instead of depletion\n  -k, --report \u003cREPORT\u003e                 Kraken-style report output from classifier\n  -j, --reads \u003cREADS\u003e                   Kraken-style read classification output\n  -c, --classifier \u003cCLASSIFIER\u003e         Classifier output style [possible values: kraken2, metabuli]\n  -T, --taxa [\u003cTAXA\u003e...]                Taxa and all sub-taxa to deplete using classifiers\n  -D, --taxa-direct [\u003cTAXA_DIRECT\u003e...]  Taxa to deplete directly using classifiers\n  -j, --json \u003cJSON\u003e                     Summary output file (.json)\n  -w, --workdir \u003cWORKDIR\u003e               Optional working directory\n  -r, --read-ids \u003cREAD_IDS\u003e             Read identifier file (.tsv)\n  -h, --help                            Print help (see more with '--help')\n```\n\n\n### Alignment outputs\n\n\n```shell\nDeplete or extract reads from aligner output with additional filters (SAM/BAM/PAF)\n\nUsage: scrubby alignment [OPTIONS] --alignment \u003cALIGNMENT\u003e\n\nOptions:\n  -i, --input [\u003cINPUT\u003e...]     Input read files (can be compressed with .gz)\n  -o, --output [\u003cOUTPUT\u003e...]   Output read files (can be compressed with .gz)\n  -e, --extract                Read extraction instead of depletion\n  -a, --alignment \u003cALIGNMENT\u003e  Alignment file in SAM/BAM/PAF/TXT format\n  -f, --format \u003cFORMAT\u003e        Explicit alignment format [possible values: sam, bam, cram, paf, txt]\n  -l, --min-len \u003cMIN_LEN\u003e      Minimum query alignment length filter [default: 0]\n  -c, --min-cov \u003cMIN_COV\u003e      Minimum query alignment coverage filter [default: 0]\n  -q, --min-mapq \u003cMIN_MAPQ\u003e    Minimum mapping quality filter [default: 0]\n  -j, --json \u003cJSON\u003e            Summary output file (.json)\n  -w, --workdir \u003cWORKDIR\u003e      Optional working directory\n  -r, --read-ids \u003cREAD_IDS\u003e    Read identifier file (.tsv)\n  -h, --help                   Print help (see more with '--help')\n```\n\n### Read difference\n\n```shell\nGet read counts and identifiers of the difference between input and output read files\n\nUsage: scrubby diff [OPTIONS]\n\nOptions:\n  -i, --input [\u003cINPUT\u003e...]    Input read files (.gz | .xz | .bz)\n  -o, --output [\u003cOUTPUT\u003e...]  Output read files (.gz | .xz | .bz)\n  -j, --json \u003cJSON\u003e           Summary output file (.json)\n  -r, --read-ids \u003cREAD_IDS\u003e   Read identifier file (.tsv)\n  -h, --help                  Print help (see more with '--help')\n```\n\n## Rust library\n\nYou can use Scrubby with the builder structs from the prelude:\n\n```rust\nuse scrubby::prelude::*;\n\n// Example running Minimap2 on long reads\n\nlet scrubby_mm2_ont = Scrubby::builder(\n  \"/path/to/reads_in.fastq\", \n  \"/path/to/reads_out.fastq\"\n)\n  .json(\"/path/to/report.json\")\n  .extract(false)\n  .threads(16)\n  .index(\"/path/to/reference.fasta\")\n  .aligner(Aligner::Minimap2)\n  .preset(Preset::MapOnt)\n  .build();\n\nscrubby_mm2_ont.clean();\n\n// Example running Minimap2 on paired-end reads\n\nlet scrubby_mm2_sr = Scrubby::builder(\n  vec![\"/path/to/reads_in_R1.fastq\", \"/path/to/reads_in_R2.fastq\"] \n  vec![\"/path/to/reads_out_R1.fastq\", \"/path/to/reads_out_R2.fastq\"]\n)\n  .json(\"/path/to/report.json\")\n  .extract(false)\n  .threads(16)\n  .index(\"/path/to/reference.fasta\")\n  .aligner(Aligner::Minimap2)\n  .preset(Preset::Sr)\n  .build();\n\nscrubby_mm2_sr.clean();\n\n// Example running Kraken2, depleting Metazoa \n\nlet scrubby_kraken2_metazoa = Scrubby::builder(\n  \"/path/to/reads_in.fastq\", \n  \"/path/to/reads_out.fastq\"\n)\n  .json(\"/path/to/report.json\")\n  .extract(false)\n  .threads(16)\n  .index(\"/path/to/kraken/index\")\n  .classifier(Classifier::Kraken2)\n  .taxa(vec![\"Metazoa\"])\n  .build();\n\nscrubby_kraken2_metazoa.clean();\n\n// Example from Kraken2 outputs, depleting Metazoa \n\nlet scrubby_kraken2_output_metazoa = Scrubby::builder(\n  \"/path/to/reads_in.fastq\", \n  \"/path/to/reads_out.fastq\"\n)\n  .json(\"/path/to/report.json\")\n  .extract(false)\n  .classifier(Classifier::Kraken2)\n  .report(\"/path/to/kraken/report\")\n  .reads(\"/path/to/kraken/read/classifications\")\n  .taxa(vec![\"Metazoa\"])\n  .build();\n\nscrubby_kraken2_output_metazoa.clean();\n\n// Example from alignment output file with filters\n\nlet scrubby_paf_output_filters = Scrubby::builder(\n  \"/path/to/reads_in.fastq\", \n  \"/path/to/reads_out.fastq\"\n)\n  .json(\"/path/to/report.json\")\n  .extract(false)\n  .alignment(\"/path/to/aln.paf\")\n  .min_query_length(50)\n  .min_query_coverage(0.5)\n  .min_mapq(50)\n  .build();\n\nscrubby_paf_output_filters.clean();\n\n// Downloader example\n\nlet scrubby_dl = ScrubbyDownloader::builder(\n  \"/path/to/download/directory\", \n  vec![ScrubbyIndex::Chm13v2]\n)\n  .timeout(180)\n  .aligner(vec![\n    Aligner::Minimap2, Aligner::Bowtie2\n  ])\n  .classifier(vec![\n    Classifier::Kraken2, Classifier::Metabuli\n  ])\n  .build();\n\nscrubby_dl.list();\nscrubby_dl.download_index();\n```\n\n## Dependencies\n\nRust libraries:\n\n* [`niffler`](https://github.com/luizirber/niffler)\n* [`needletail`](https://github.com/onecodex/needletail)\n* [`rust-htslib`](https://github.com/rust-bio/rust-htslib)\n* [`minimap2-rs`](https://github.com/jguhlin/minimap2-rs)\n\nAligners and classifiers:\n\n* [`samtools`](https://github.com/samtools/samtools)\n* [`minimap2`](https://github.com/lh3/minimap2)\n* [`strobealign`](https://github.com/ksahlin/strobealign)\n* [`Bowtie2`](https://github.com/BenLangmead/bowtie2)\n* [`Kraken2`](https://github.com/DerrickWood/kraken2)\n* [`Metabuli`](https://github.com/steineggerlab/Metabuli)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Festeinig%2Fscrubby","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Festeinig%2Fscrubby","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Festeinig%2Fscrubby/lists"}