{"id":15489784,"url":"https://github.com/oschwengers/referenceseeker","last_synced_at":"2025-08-20T10:31:25.538Z","repository":{"id":41285482,"uuid":"155717045","full_name":"oschwengers/referenceseeker","owner":"oschwengers","description":"Rapid determination of appropriate reference genomes.","archived":false,"fork":false,"pushed_at":"2024-01-12T08:54:26.000Z","size":20630,"stargazers_count":89,"open_issues_count":3,"forks_count":5,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-10-19T08:59:38.117Z","etag":null,"topics":["ani","bioinformatics","mash","microbiology","reference-genomes","refseq","wgs"],"latest_commit_sha":null,"homepage":"https://doi.org/10.21105/joss.01994","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oschwengers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-11-01T13:11:49.000Z","updated_at":"2024-10-16T14:59:14.000Z","dependencies_parsed_at":"2024-01-12T13:12:36.627Z","dependency_job_id":null,"html_url":"https://github.com/oschwengers/referenceseeker","commit_stats":{"total_commits":216,"total_committers":2,"mean_commits":108.0,"dds":0.09259259259259256,"last_synced_commit":"711faf36646cc067ed0427562faa1b82be386e87"},"previous_names":[],"tags_count":16,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Freferenceseeker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Freferenceseeker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Freferenceseeker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Freferenceseeker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oschwengers","download_url":"https://codeload.github.com/oschwengers/referenceseeker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230415317,"owners_count":18222158,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ani","bioinformatics","mash","microbiology","reference-genomes","refseq","wgs"],"created_at":"2024-10-02T07:07:56.892Z","updated_at":"2025-08-20T10:31:25.531Z","avatar_url":"https://github.com/oschwengers.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![DOI](https://joss.theoj.org/papers/10.21105/joss.01994/status.svg)](https://doi.org/10.21105/joss.01994)\n[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-brightgreen.svg)](https://github.com/oschwengers/referenceseeker/blob/master/LICENSE)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/referenceseeker.svg)\n![GitHub release](https://img.shields.io/github/release/oschwengers/referenceseeker.svg)\n[![PyPI](https://img.shields.io/pypi/v/referenceseeker.svg)](https://pypi.org/project/referenceseeker)\n![PyPI - Status](https://img.shields.io/pypi/status/referenceseeker.svg)\n[![Conda](https://img.shields.io/conda/v/bioconda/referenceseeker.svg)](http://bioconda.github.io/recipes/referenceseeker/README.html)\n![Python package](https://github.com/oschwengers/referenceseeker/workflows/Python%20package/badge.svg?branch=master)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3562004.svg)](https://doi.org/10.5281/zenodo.3562004)\n\n# ReferenceSeeker: rapid determination of appropriate reference genomes\n\n## Contents\n\n- [Description](#description)\n- [Input \u0026 Output](#input-output)\n- [Installation](#installation)\n  - [BioConda](#bioconda)\n  - [GitHub](#github)\n- [Usage](#usage)\n- [Examples](#examples)\n- [Databases](#databases)\n  - [RefSeq](#refseq-based)\n  - [Custom](#custom-database)\n- [Dependencies](#dependencies)\n- [Citation](#citation)\n- [Feedback](#feedback)\n\n## Description\n\nReferenceSeeker determines closely related reference genomes following a scalable hierarchical approach combining an fast kmer profile-based database lookup of candidate reference genomes and subsequent computation of specific average nucleotide identity (ANI) values for the rapid determination of suitable reference genomes.\n\nReferenceSeeker computes kmer-based genome distances between a query genome and potential reference genome candidates via Mash (Ondov et al. 2016). For resulting candidates ReferenceSeeker subsequently computes (bidirectional) ANI values picking genomes meeting community standard thresholds by default (ANI \u003e= 95 % \u0026 conserved DNA \u003e= 69 %) (Goris, Konstantinos et al. 2007) ranked by the product of ANI and conserved DNA values to take into account both genome coverage and identity.\n\nCustom databases can be built with local genomes. For further convenience, we provide pre-built databases with sequences from RefSeq (\u003chttps://www.ncbi.nlm.nih.gov/refseq\u003e), GTDB and PLSDB copmrising the following taxa:\n\n- bacteria\n- archaea\n- fungi\n- protozoa\n- viruses\n\nas well as *plasmids*.\n\nThe reasoning for subsequent calculations of both ANI and conserved DNA values is that Mash distance values correlate well with ANI values for closely related genomes, however the same is not true for conserved DNA values. A kmer fingerprint-based comparison alone cannot distinguish if a kmer is missing due to a SNP, for instance or a lack of the kmer-comprising subsequence. As DNA conservation (next to DNA identity) is very important for many kinds of analyses, *e.g.* reference based SNP detections, ranking potential reference genomes based on a mash distance alone is often not sufficient in order to select the most appropriate reference genomes. If desired, ANI and conserved DNA values can be computed bidirectionally.\n\n![Mash D vs. ANI / conDNA](mash-ani-cdna.mini.png?raw=true)\n\n## Input \u0026 Output\n\n### Input\n\nPath to a taxon database and a draft or finished genome in (zipped) fasta format:\n\n```bash\n$ referenceseeker ~/bacteria GCF_000013425.1.fna\n```\n\n### Output\n\nTab separated lines to STDOUT comprising the following columns:\n\nUnidirectionally (query -\u003e references):\n\n- RefSeq Assembly ID\n- Mash Distance\n- ANI\n- Conserved DNA\n- NCBI Taxonomy ID\n- Assembly Status\n- Organism\n\n```bash\n#ID    Mash Distance    ANI    Con. DNA    Taxonomy ID    Assembly Status    Organism\nGCF_000013425.1    0.00000    100.00    100.00    93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325\nGCF_001900185.1    0.00002    100.00    99.89     46170    complete    Staphylococcus aureus subsp. aureus HG001\nGCF_900475245.1    0.00004    100.00    99.57     93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325\nGCF_001018725.2    0.00016    100.00    99.28     1280     complete    Staphylococcus aureus FDAARGOS_10\nGCF_003595465.1    0.00185    99.86     96.81     1280     complete    Staphylococcus aureus USA300-SUR6\nGCF_003595385.1    0.00180    99.87     96.80     1280     complete    Staphylococcus aureus USA300-SUR2\nGCF_003595365.1    0.00180    99.87     96.80     1280     complete    Staphylococcus aureus USA300-SUR1\nGCF_001956815.1    0.00180    99.87     96.80     46170    complete    Staphylococcus aureus subsp. aureus USA300_SUR1\n...\n```\n\nBidirectionally (query -\u003e references [QR] \u0026 references -\u003e query [RQ]):\n\n- RefSeq Assembly ID\n- Mash Distance\n- QR ANI\n- QR Conserved DNA\n- RQ ANI\n- RQ Conserved DNA\n- NCBI Taxonomy ID\n- Assembly Status\n- Organism\n\n```bash\n#ID    Mash Distance    QR ANI    QR Con. DNA    RQ ANI    RQ Con. DNA    Taxonomy ID    Assembly Status    Organism\nGCF_000013425.1    0.00000    100.00    100.00    100.00    100.00    93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325\nGCF_001900185.1    0.00002    100.00    99.89     100.00    99.89     46170    complete    Staphylococcus aureus subsp. aureus HG001\nGCF_900475245.1    0.00004    100.00    99.57     99.99     99.67     93061    complete    Staphylococcus aureus subsp. aureus NCTC 8325 NCTC8325\nGCF_001018725.2    0.00016    100.00    99.28     99.95     98.88     1280     complete    Staphylococcus aureus FDAARGOS_10\nGCF_001018915.2    0.00056    99.99     96.35     99.98     99.55     1280     complete    Staphylococcus aureus NRS133\nGCF_001019415.2    0.00081    99.99     94.47     99.98     99.36     1280     complete    Staphylococcus aureus NRS146\nGCF_001018735.2    0.00096    100.00    94.76     99.98     98.58     1280     complete    Staphylococcus aureus NRS137\nGCF_003354885.1    0.00103    99.93     96.63     99.93     96.66     1280     complete    Staphylococcus aureus 164\n...\n```\n\n## Installation\n\nReferenceSeeker can be installed via Conda and Git(Hub). In either case, a taxon database must be downloaded which we provide for download at Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3562004.svg)](https://doi.org/10.5281/zenodo.3562004)\nFor more information have a look at [Databases](#databases).\n\n### BioConda\n\nThe preferred way to install and run ReferenceSeeker is [Conda](https://conda.io/docs/install/quick.html) using the [Bioconda](https://bioconda.github.io/) channel:\n\n```bash\n$ conda install -c bioconda referenceseeker\n$ referenceseeker --help\n```\n\n### GitHub\n\nAlternatively, you can use this raw GitHub repository:\n\n1. install necessary Python dependencies (if necessary)\n2. clone the latest version of the repository\n3. install necessary 3rd party executables (Mash, MUMmer4)\n\n```bash\n$ pip3 install --user biopython xopen\n$ git clone https://github.com/oschwengers/referenceseeker.git\n$ # install Mash \u0026 MUMmer\n$ ./referenceseeker/bin/referenceseeker --help\n```\n\n### Test\n\nTo test your installation we prepared a tiny mock database comprising 4 `Salmonella spp` genomes and a query assembly (SRA: SRR498276) in the `tests` directory:\n\n```bash\n$ git clone https://github.com/oschwengers/referenceseeker.git\n\n  # GitHub installation\n$ ./referenceseeker/bin/referenceseeker referenceseeker/test/db referenceseeker/test/data/Salmonella_enterica_CFSAN000189.fasta\n\n  # BioConda installation\n$ referenceseeker referenceseeker/test/db referenceseeker/test/data/Salmonella_enterica_CFSAN000189.fasta\n```\n\nExpected output:\n\n```bash\n#ID    Mash Distance    ANI    Con. DNA    Taxonomy ID    Assembly Status    Organism\nGCF_000439415.1    0.00003    100.00    99.55    1173427    complete    Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189\nGCF_900205275.1    0.01522    98.61     83.13    90370      complete    Salmonella enterica subsp. enterica serovar Typhi\n```\n\n## Usage\n\nUsage:\n\n```bash\nusage: referenceseeker [--crg CRG] [--ani ANI] [--conserved-dna CONSERVED_DNA]\n                       [--unfiltered] [--bidirectional] [--help] [--version]\n                       [--verbose] [--threads THREADS]\n                       \u003cdatabase\u003e \u003cgenome\u003e\n\nRapid determination of appropriate reference genomes.\n\npositional arguments:\n  \u003cdatabase\u003e            ReferenceSeeker database path\n  \u003cgenome\u003e              target draft genome in fasta format\n\nFilter options / thresholds:\n  These options control the filtering and alignment workflow.\n\n  --crg CRG, -r CRG     Max number of candidate reference genomes to pass kmer\n                        prefilter (default = 100)\n  --ani ANI, -a ANI     ANI threshold (default = 0.95)\n  --conserved-dna CONSERVED_DNA, -c CONSERVED_DNA\n                        Conserved DNA threshold (default = 0.69)\n  --unfiltered, -u      Set kmer prefilter to extremely conservative values\n                        and skip species level ANI cutoffs (ANI \u003e= 0.95 and\n                        conserved DNA \u003e= 0.69\n  --bidirectional, -b   Compute bidirectional ANI/conserved DNA values\n                        (default = False)\n\nRuntime \u0026 auxiliary options:\n  --help, -h            Show this help message and exit\n  --version, -V         show program's version number and exit\n  --verbose, -v         Print verbose information\n  --threads THREADS, -t THREADS\n                        Number of used threads (default = number of available\n                        CPU cores)\n```\n\n## Examples\n\nInstallation:\n\n```bash\n$ conda install -c bioconda referenceseeker\n$ wget https://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz\n$ tar -xzf bacteria-refseq.tar.gz\n$ rm bacteria-refseq.tar.gz\n```\n\nSimple:\n\n```bash\n$ # referenceseeker \u003cREFERENCE_SEEKER_DB\u003e \u003cGENOME\u003e\n$ referenceseeker bacteria-refseq/ genome.fasta\n```\n\nExpert: verbose output and increased output of candidate reference genomes using a defined number of threads:\n\n```bash\n$ # referenceseeker --crg 500 --verbose --threads 8 \u003cREFERENCE_SEEKER_DB\u003e \u003cGENOME\u003e\n$ referenceseeker --crg 500 --verbose --threads 8 bacteria-refseq/ genome.fasta\n```\n\n## Databases\n\nReferenceSeeker depends on databases comprising taxonomic genome informations as well as kmer hash profiles for each entry.\n\n### Pre-built\n\nWe provide pre-built databases based on public genome data hosted at Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3562004.svg)](https://doi.org/10.5281/zenodo.3562004) :\n\n#### RefSeq\n\nrelease: 221 (2024-01-09)\n\n| Taxon | URL | # Genomes | Size |\n| :---: | --- | ---: | :---: |\n| bacteria | \u003chttps://zenodo.org/record/4415843/files/bacteria-refseq.tar.gz\u003e | 50,226 | 59.6 Gb |\n| archaea | \u003chttps://zenodo.org/record/4415843/files/archaea-refseq.tar.gz\u003e | 905 | 897 Mb |\n| fungi | \u003chttps://zenodo.org/record/4415843/files/fungi-refseq.tar.gz\u003e | 557 | 5.9 Gb |\n| protozoa | \u003chttps://zenodo.org/record/4415843/files/protozoa-refseq.tar.gz\u003e | 90 | 1.1 Gb |\n| viruses | \u003chttps://zenodo.org/record/4415843/files/viral-refseq.tar.gz\u003e | 14,012 | 1 Mb |\n\n#### GTDB\n\nrelease: v214 (2024-01-11)\n\n| Taxon | URL | # Genomes | Size |\n| :---: | --- | ---: | :---: |\n| bacteria | n.a. due to storage quota resitrctions | 80,789 | 82 Gb |\n| archaea | \u003chttps://zenodo.org/record/4415843/files/archaea-gtdb.tar.gz\u003e | 4,416 | 2.8 Gb |\n\n#### Plasmids\n\nIn addition to the genome based databases, we provide the following plasmid databases based on RefSeq and PLSDB:\n\n| DB | URL | # Plasmids | Size |\n| :---: | --- | ---: | :---: |\n| RefSeq | \u003chttps://zenodo.org/record/4415843/files/plasmids-refseq.tar.gz\u003e | 81,674 | 2.6 Gb |\n| PLSDB | \u003chttps://zenodo.org/record/4415843/files/plasmids-plsdb.tar.gz\u003e | 59,882 | 2.3 Gb |\n\n### Custom database\n\nIf above mentiond RefSeq based databases do not contain sufficiently-close related genomes or are just too large, ReferenceSeeker provides auxiliary commands in order to either create databases from scratch or to expand existing ones. Therefore, a second executable `referenceseeker_db` accepts `init` and `import` subcommands:\n\nUsage:\n\n```bash\nusage: referenceseeker_db [--help] [--version] {init,import} ...\n\nRapid determination of appropriate reference genomes.\n\npositional arguments:\n  {init,import}  sub-command help\n    init         Initialize a new database\n    import       Add a new genome to database\n\nRuntime \u0026 auxiliary options:\n  --help, -h     Show this help message and exit\n  --version, -V  show program's version number and exit\n```\n\nIf a new database should be created, use `referenceseeker_db init`:\n\n```bash\nusage: referenceseeker_db init [-h] [--output OUTPUT] --db DB\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --output OUTPUT, -o OUTPUT\n                        output directory (default = current working directory)\n  --db DB, -d DB        Name of the new ReferenceSeeker database\n```\n\nThis new database or an existing one can be used to import genomes in Fasta, GenBank or EMBL format:\n\n```bash\nusage: referenceseeker_db import [-h] --db DB --genome GENOME [--id ID]\n                                 [--taxonomy TAXONOMY]\n                                 [--status {complete,chromosome,scaffold,contig}]\n                                 [--organism ORGANISM]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --db DB, -d DB        ReferenceSeeker database path\n  --genome GENOME, -g GENOME\n                        Genome path [Fasta, GenBank, EMBL]\n  --id ID, -i ID        Unique genome identifier (default sequence id of first\n                        record)\n  --taxonomy TAXONOMY, -t TAXONOMY\n                        Taxonomy ID (default = 12908 [unclassified sequences])\n  --status {complete,chromosome,scaffold,contig}, -s {complete,chromosome,scaffold,contig}\n                        Assembly level (default = contig)\n  --organism ORGANISM, -o ORGANISM\n                        Organism name (default = \"NA\")\n```\n\nExample:\n\nIf ReferenceSeeker is properly installed, clone this repository and change into its parent directoriy.\n\n```\n$ git clone https://github.com/oschwengers/referenceseeker.git\n$ cd referenceseeker\n$ referenceseeker_db init --db test-db --output ./\n$ referenceseeker_db import --db ./test-db --genome test/db/GCF_000439415.1.fna.gz --id GCF_000439415.1 --taxonomy 28901 --status complete --organism \"Salmonella enterica subsp. enterica serovar Bareilly str. CFSAN000189\"\n$ referenceseeker_db import --db ./test-db --genome test/db/GCF_002211925.1.fna.gz --id GCF_002211925.1 --organism \"Salmonella bongori str. SA19983605\"\n$ referenceseeker -v ./test-db ./test/data/Salmonella_enterica_CFSAN000189.fasta\n```\n\n## Dependencies\n\nReferenceSeeker needs the following dependencies:\n\n- Python (3.8, 3.9), Biopython (\u003e=1.78), xopen(\u003e=1.1.0)\n- Mash (2.3) \u003chttps://github.com/marbl/Mash\u003e\n- MUMmer (4.0.0-beta2) \u003chttps://github.com/gmarcais/mummer\u003e\n\nReferenceSeeker has been tested against aforementioned versions.\n\n## Citation\n\n\u003e Schwengers et al., (2020). ReferenceSeeker: rapid determination of appropriate reference genomes. Journal of Open Source Software, 5(46), 1994, https://doi.org/10.21105/joss.01994\n\n## Feedback\n\nWe highly wellcome and appreciate feedback of all kind!\n\nSo, if you run into any issues with ReferenceSeeker, we'd be happy to hear about it! Please, start the pipeline with -v (verbose) and do not hesitate to file an issue here on GitHub including as much of the following as possible:\n\n- a detailed description of the issue\n- the ReferenceSeeker cmd line output\n- a reproducible example of the issue with a small dataset that you can share (helps us identify whether the issue is specific to a particular computer, operating system, and/or dataset).\n\nThe maintenance of ReferenceSeeker is supported by [deNBI](https://www.denbi.de). If you would like to provide (non-technical) feedback, please find a service monitoring survey [here](https://www.surveymonkey.de/r/denbi-service?sc=bigi\u0026tool=referenceseeker).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foschwengers%2Freferenceseeker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foschwengers%2Freferenceseeker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foschwengers%2Freferenceseeker/lists"}