{"id":16849205,"url":"https://github.com/mbhall88/classification_benchmark","last_synced_at":"2025-04-11T06:40:53.917Z","repository":{"id":194495389,"uuid":"643761252","full_name":"mbhall88/classification_benchmark","owner":"mbhall88","description":"Benchmarking different ways of doing read (taxonomic) classification, with a focus on removal of contamination and MTB classification","archived":false,"fork":false,"pushed_at":"2024-04-08T00:42:39.000Z","size":1205,"stargazers_count":12,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-04-06T15:53:18.137Z","etag":null,"topics":["bioinformatics","contamination","sequence-analysis","taxonomic-classification","tuberculosis"],"latest_commit_sha":null,"homepage":"https://doi.org/10.1093/gigascience/giae010","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mbhall88.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-05-22T05:25:56.000Z","updated_at":"2025-02-11T08:46:05.000Z","dependencies_parsed_at":"2024-04-08T01:39:04.799Z","dependency_job_id":"502b8abf-2654-4384-8c3a-d700e98e7cf6","html_url":"https://github.com/mbhall88/classification_benchmark","commit_stats":null,"previous_names":["mbhall88/classification_benchmark"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Fclassification_benchmark","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Fclassification_benchmark/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Fclassification_benchmark/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mbhall88%2Fclassification_benchmark/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mbhall88","download_url":"https://codeload.github.com/mbhall88/classification_benchmark/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248358549,"owners_count":21090401,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","contamination","sequence-analysis","taxonomic-classification","tuberculosis"],"created_at":"2024-10-13T13:14:36.375Z","updated_at":"2025-04-11T06:40:53.897Z","avatar_url":"https://github.com/mbhall88.png","language":"Python","readme":"# Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data\n\n\u003e Hall, Michael B., and Lachlan J. M. Coin. “Pangenome databases improve host removal and mycobacteria classification from clinical metagenomic data” GigaScience, April 4, 2024. \u003chttps://doi.org/10.1093/gigascience/giae010\u003e\n\nBenchmarking different ways of doing read (taxonomic) classification, with a focus on\nremoval of contamination and classification of _M. tuberculosis_ reads.\n\nThis repository contains the code and snakemake pipeline to build/download the\ndatabases, obtain all results from [the paper][doi], along with accompanying configuration\nfiles.\n\nCustom databases have all been uploaded to Zenodo, along with the simulated reads:\n\n- Nanopore simulated metagenomic reads - \u003chttps://doi.org/10.5281/zenodo.8339788\u003e\n- Illumina simulated metagenomic reads - \u003chttps://doi.org/10.5281/zenodo.8339790\u003e\n- Nanopore and Illumina artificial real reads - \u003chttps://doi.org/10.5281/zenodo.10472796\u003e\n- Kraken2 database built from the Human Pangenome Reference Consortium\n  genomes - \u003chttps://doi.org/10.5281/zenodo.8339731\u003e\n- Kraken2 database built from the kraken2 Human\n  library - \u003chttps://doi.org/10.5281/zenodo.8339699\u003e\n- Kraken2 database built from a *Mycobacterium* representative set of\n  genomes - \u003chttps://doi.org/10.5281/zenodo.8339821\u003e\n- A (fasta) database of representative genomes from the *Mycobacterium*\n  genus - \u003chttps://doi.org/10.5281/zenodo.8339940\u003e\n- A (fasta) database of *M. tuberculosis* genomes from a variety of\n  lineages - \u003chttps://doi.org/10.5281/zenodo.8339947\u003e\n- The fasta file built from the [Clockwork](https://github.com/iqbal-lab-org/clockwork)\n  decontamination pipeline - \u003chttps://doi.org/10.5281/zenodo.8339802\u003e\n\n## Example usage\n\nWe provide some usage examples showing how to download the databases and then use them\non your reads.\n\n### Human read removal\n\nThe method we found to give the best balance of runtime, memory usage, and precision and\nrecall was kraken2 with a database built from the Human Pangenome Reference Consortium\ngenomes.\n\nThis example has been wrapped into a standalone tool called [`nohuman`](https://github.com/mbhall88/nohuman/) which takes a fastq as input and returns a fastq with human reads removed.\n\n#### Download human database\n\n```\nmkdir HPRC_db/\ncd HPRC_db\nURL=\"https://zenodo.org/record/8339732/files/k2_HPRC_20230810.tar.gz\"\nwget \"$URL\"\ntar -xzf k2_HPRC_20230810.tar.gz\nrm k2_HPRC_20230810.tar.gz\n```\n\n#### Run kraken2 with HPRC database\n\nYou'll need [kraken2](https://github.com/DerrickWood/kraken2) installed for this step.\n\n```\nkraken2 --threads 4 --db HPRC_db/ --output classifications.tsv reads.fq\n```\n\nIf you are using Illumina reads, a slight adjustment is needed\n\n```\nkraken2 --paired --threads 4 --db HPRC_db/ --output classifications.tsv reads_1.fq reads_2.fq\n```\n\n#### Extract non-human reads\n\nYou'll need [seqkit](https://github.com/shenwei356/seqkit) installed for this step\n\nFor Nanopore data\n\n```\nawk -F'\\t' '$1==\"U\" {print $2}' classifications.tsv | \\\n  seqkit grep -f - -o reads.depleted.fq reads.fq\n```\n\nFor Illumina data\n\n```\nawk -F'\\t' '$1==\"U\" {print $2}' classifications.tsv \u003e ids.txt\nseqkit grep --id-regexp '^(\\S+)/[12]' -f ids.txt -o reads_1.depleted.fq reads_1.fq\nseqkit grep --id-regexp '^(\\S+)/[12]' -f ids.txt -o reads_2.depleted.fq reads_2.fq\n```\n\n### *M. tuberculosis* classification/enrichment\n\nFor this step we recommend either [minimap2](https://github.com/lh3/minimap2) or kraken2\nwith a *Mycobacterium* genus database. We leave it to the user to decide which approach\nthey prefer based on the results in our manuscript.\n\n#### Download databases\n\n```\nmkdir Mycobacterium_db\ncd Mycobacterium_db\n# download database for use with minimap2\nURL=\"https://zenodo.org/record/8339941/files/Mycobacterium.rep.fna.gz\"\nwget \"$URL\"\nIDS_URL=\"https://zenodo.org/record/8343322/files/mtb.ids\"\nwget \"$IDS_URL\"\n# download kraken database\nURL=\"https://zenodo.org/record/8339822/files/k2_Mycobacterium_20230817.tar.gz\"\nwget \"$URL\"\ntar -xzf k2_Mycobacterium_20230817.tar.gz\nrm k2_Mycobacterium_20230817.tar.gz\n```\n\n#### Classify reads\n\n**minimap2**\n\n```\n# nanopore\nminimap2 --secondary=no -c -t 4 -x map-ont -o reads.aln.paf Mycobacterium_db/Mycobacterium.rep.fna.gz reads.depleted.fq\n# illumina\nminimap2 --secondary=no -c -t 4 -x sr -o reads.aln.paf Mycobacterium_db/Mycobacterium.rep.fna.gz reads_1.depleted.fq reads_2.depleted.fq\n```\n\n**kraken2**\n\n```\n# nanopore\nkraken2 --db Mycobacterium_db --threads 4 --report myco.kreport --output classifications.myco.tsv reads.depleted.fq\n# illumina\nkraken2 --db Mycobacterium_db --paired --threads 4 --report myco.kreport --output classifications.myco.tsv reads_1.depleted.fq reads_2.depleted.fq\n```\n\n#### Extract *M. tuberculosis* reads\n\n**minimap2**\n\n```\n# nanopore\ngrep -Ff Mycobacterium_db/mtb.ids reads.aln.paf | cut -f1 | \\\n  seqkit grep -f - -o reads.enriched.fq reads.depleted.fq\n# illumina\ngrep -Ff Mycobacterium_db/mtb.ids reads.aln.paf | cut -f1 \u003e keep.ids\nseqkit grep -f keep.ids -o reads_1.enriched.fq reads_1.depleted.fq\nseqkit grep -f keep.ids -o reads_2.enriched.fq reads_2.depleted.fq\n```\n\n**kraken2**\n\nWe'll use\nthe [`extract_kraken_reads.py` script](https://github.com/jenniferlu717/KrakenTools#extract_kraken_readspy)\nfor this\n\n```\n# nanopore\npython extract_kraken_reads.py -k classifications.myco.tsv -1 reads.depleted.fq -o reads.enriched.fq -t 1773 -r myco.kreport --include-children\n# illumina\npython extract_kraken_reads.py -k classifications.myco.tsv -1 reads_1.depleted.fq -2 reads_2.depleted.fq -o reads_1.enriched.fq -o2 reads_2.enriched.fq -t 1773 -r myco.kreport --include-children\n```\n\n[doi]: https://doi.org/10.1093/gigascience/giae010 \n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbhall88%2Fclassification_benchmark","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmbhall88%2Fclassification_benchmark","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmbhall88%2Fclassification_benchmark/lists"}