{"id":13704226,"url":"https://github.com/oschwengers/bakta","last_synced_at":"2025-05-14T23:06:30.984Z","repository":{"id":37417556,"uuid":"234191178","full_name":"oschwengers/bakta","owner":"oschwengers","description":"Rapid \u0026 standardized annotation of bacterial genomes, MAGs \u0026 plasmids","archived":false,"fork":false,"pushed_at":"2025-03-11T15:13:59.000Z","size":104125,"stargazers_count":497,"open_issues_count":19,"forks_count":60,"subscribers_count":14,"default_branch":"main","last_synced_at":"2025-04-13T19:39:16.985Z","etag":null,"topics":["annotation","bacteria","bacterial-genomes","bioinformatics","genome-annotation","mag","metagenome-assembled-genomes","microbial-genomics","plasmids"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/oschwengers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.bib","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-15T23:08:38.000Z","updated_at":"2025-04-10T02:50:02.000Z","dependencies_parsed_at":"2023-11-10T16:25:04.396Z","dependency_job_id":"08d4effb-6a1b-4811-95aa-67c5e276bc71","html_url":"https://github.com/oschwengers/bakta","commit_stats":{"total_commits":955,"total_committers":17,"mean_commits":56.1764705882353,"dds":"0.28167539267015707","last_synced_commit":"8206741fa560e4ae4f42f9995f04fce0a049f03b"},"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Fbakta","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Fbakta/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Fbakta/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/oschwengers%2Fbakta/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/oschwengers","download_url":"https://codeload.github.com/oschwengers/bakta/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254243360,"owners_count":22038046,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["annotation","bacteria","bacterial-genomes","bioinformatics","genome-annotation","mag","metagenome-assembled-genomes","microbial-genomics","plasmids"],"created_at":"2024-08-02T21:01:05.956Z","updated_at":"2025-05-14T23:06:25.952Z","avatar_url":"https://github.com/oschwengers.png","language":"Python","funding_links":[],"categories":["Next Generation Sequencing"],"sub_categories":["Annotation"],"readme":"[![DOI:10.1099/mgen.0.000685](https://zenodo.org/badge/DOI/10.1099/mgen.0.000685.svg)](https://doi.org/10.1099/mgen.0.000685)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4247252.svg)](https://doi.org/10.5281/zenodo.4247252)\n[![License: GPL v3](https://img.shields.io/badge/License-GPL%20v3-brightgreen.svg)](https://github.com/oschwengers/bakta/blob/master/LICENSE)\n![PyPI - Python Version](https://img.shields.io/pypi/pyversions/bakta.svg)\n![PyPI - Status](https://img.shields.io/pypi/status/bakta.svg)\n![GitHub release](https://img.shields.io/github/release/oschwengers/bakta.svg)\n\n[![PyPI](https://img.shields.io/pypi/v/bakta.svg)](https://pypi.org/project/bakta)\n[![Conda](https://img.shields.io/conda/vn/bioconda/bakta.svg)](https://bioconda.github.io/recipes/bakta/README.html)\n[![Docker Image Version](https://img.shields.io/docker/v/oschwengers/bakta?sort=semver\u0026label=docker)](https://hub.docker.com/r/oschwengers/bakta)\n[![Spack](https://img.shields.io/spack/v/py-bakta)](https://packages.spack.io/package.html?name=py-bakta)\n[![Galaxy Toolshed - Tool Version](https://img.shields.io/galaxytoolshed/v/bakta/iuc/bakta?label=usegalaxy.eu)](https://usegalaxy.eu/root?tool_id=bakta)\n[![Static Badge](https://img.shields.io/badge/bio.tools-v1.9.4-blue?link=https%3A%2F%2Fbio.tools%2Fbakta)](https://bio.tools/bakta)\n\n# Bakta: rapid \u0026 standardized annotation of bacterial genomes, MAGs \u0026 plasmids\n\nBakta is a tool for the rapid \u0026 standardized annotation of bacterial genomes and plasmids from both isolates and MAGs. It provides **dbxref**-rich, **sORF**-including and taxon-independent annotations in machine-readable `JSON` \u0026 bioinformatics standard file formats for automated downstream analysis.\n\n## Contents\n\n- [Description](#description)\n- [Installation](#installation)\n- [Examples](#examples)\n- [Input \u0026 Output](#input-and-output)\n- [Usage](#usage)\n- [Annotation Workflow](#annotation-workflow)\n- [Database](#database)\n- [Genome Submission](#genome-submission)\n- [Protein bulk annotation](#protein-bulk-annotation)\n- [Genome plots](#genome-plots)\n- [Auxiliary scripts](#auxiliary-scripts)\n- [Web version](#web-version)\n- [Citation](#citation)\n- [FAQ](#faq)\n- [Issues \u0026 Feature Requests](#issues-and-feature-requests)\n\n## Description\n\n- **Comprehensive \u0026 taxonomy-independent database**\nBakta provides a large and taxonomy-independent database using UniProt's entire [UniRef](https://www.uniprot.org/uniref/) protein sequence cluster universe. Thus, it achieves favourable annotations in terms of sensitivity and specificity along the broad continuum ranging from well-studied species to unknown genomes from MAGs.\n\n- **Protein sequence identification**\nBakta exactly identifies known identical protein sequences (**IPS**) from RefSeq and UniProt allowing the fine-grained annotation of gene alleles (`AMR`) or closely related but distinct protein families. This is achieved via an alignment-free sequence identification (**AFSI**) approach using full-length `MD5` protein sequence hash digests.\n\n- **Fast**\nThis AFSI approach substantially accellerates the annotation process by avoiding computationally expensive homology searches for identified genes. Thus, Bakta can annotate a typical bacterial genome in 10 \u0026plusmn;5 min on a laptop, plasmids in a couple of seconds/minutes.\n\n- **Database cross-references**\nFostering the [FAIR](https://www.go-fair.org/fair-principles) principles, Bakta exploits its AFSI approach to annotate CDS with database cross-references (**dbxref**) to RefSeq (`WP_*`), UniRef100 (`UniRef100_*`) and UniParc (`UPI*`). By doing so, IPS allow the surveillance of distinct gene alleles and streamlining comparative analysis as well as posterior (external) annotations of `putative` \u0026 `hypothetical` protein sequences which can be mapped back to existing CDS via these exact \u0026 stable identifiers (*E. coli* gene [ymiA](https://www.uniprot.org/uniprot/P0CB62) [...more](https://www.uniprot.org/help/dubious_sequences)). Currently, Bakta identifies ~350 mio, ~330 mio and ~290 mio distinct protein sequences from UniParc, UniRef100 and RefSeq, respectively. Hence, for certain genomes, up to 99 % of all CDS can be identified this way, skipping computationally expensive sequence alignments.\n\n- **FAIR annotations**\nTo provide standardized annotations adhearing to FAIR principles, Bakta utilizes a versioned custom annotation database comprising UniProt's [UniRef100 \u0026 UniRef90](https://www.uniprot.org/uniref/) protein clusters (FAIR -\u003e [DOI](http://dx.doi.org/10.1038/s41597-019-0180-9)/[DOI](https://doi.org/10.1093/nar/gkaa1100)) enriched with dbxrefs (`GO`, `COG`, `EC`) and annotated by specialized niche databases. For each DB version we provide a comprehensive log file of all imported sequences and annotations.\n\n- **Small proteins / short open reading frames**\nBakta detects and annotates small proteins/short open reading frames (**sORF**) which are not predicted by tools like `Prodigal`.\n\n- **Expert annotation systems**\nTo provide high quality annotations for certain proteins of higher interest, *e.g.* AMR \u0026 VF genes, Bakta includes \u0026 merges different expert annotation systems. Currently, Bakta uses NCBI's AMRFinderPlus for AMR gene annotations as well as an generalized protein sequence expert system with distinct coverage, identity and priority values for each sequence, currenlty comprising the [VFDB](http://www.mgc.ac.cn/VFs/main.htm) as well as NCBI's [BlastRules](https://ftp.ncbi.nih.gov/pub/blastrules/).\n\n- **Comprehensive workflow**\nBakta annotates ncRNA cis-regulatory regions, oriC/oriV/oriT and assembly gaps as well as standard feature types: tRNA, tmRNA, rRNA, ncRNA genes, CRISPR, CDS and pseudogenes.\n\n- **GFF3 \u0026 INSDC conform annotations**\nBakta writes GFF3 and INSDC-compliant (Genbank \u0026 EMBL) annotation files ready for submission (checked via [GenomeTools GFF3Validator](http://genometools.org/cgi-bin/gff3validator.cgi), [table2asn_GFF](https://www.ncbi.nlm.nih.gov/genbank/genomes_gff/#run) and [ENA Webin-CLI](https://github.com/enasequence/webin-cli) for GFF3 and EMBL file formats, respectively for representative genomes of all ESKAPE species).\n\n- **Bacteria \u0026 plasmids only**\nBakta was designed to annotate bacteria (isolates \u0026 MAGs) and plasmids, only. This decision by design has been made in order to tweak the annotation process regarding tools, preferences \u0026 databases and to streamline further development \u0026 maintenance of the software.\n\n- **Reasoning**\nBy annotating bacterial genomes in a standardized, taxonomy-independent, high-throughput and local manner, Bakta aims at a well-balanced tradeoff between fully featured but computationally demanding pipelines like [PGAP](https://github.com/ncbi/pgap) and rapid highly customizable offline tools like [Prokka](https://github.com/tseemann/prokka). Indeed, Bakta is heavily inspired by Prokka (kudos to [Torsten Seemann](https://github.com/tseemann)) and many command line options are compatible for the sake of interoperability and user convenience. Hence, if Bakta does not fit your needs, please consider trying Prokka.\n\n## Installation\n\nBakta can be installed via BioConda, Docker, Singularity and Pip. However, we encourage to use [Conda](https://conda.io/projects/conda/en/latest/user-guide/install/index.html) or [Docker](https://www.docker.com/get-started)/[Singularity](https://sylabs.io/singularity) to automatically install all required 3rd party dependencies.\n\nIn all cases a mandatory [database](#database-download) must be downloaded.\n\n### BioConda\n\n```bash\nconda install -c conda-forge -c bioconda bakta\n```\n\n### Podman (Docker)\n\nWe maintain a Docker image `oschwengers/bakta` providing an entrypoint, so that containers can be used like an executable:\n\n```bash\npodman pull oschwengers/bakta\npodman run oschwengers/bakta --help\n```\n\nInstallation instructions and get-started guides: Podman [docs](https://podman.io/docs). For further convenience, we provide a shell script (`bakta-podman.sh`) handling Podman related parameters (volume mounting, user IDs, etc):\n\n```bash\nbakta-podman.sh --db \u003cdb-path\u003e --output \u003coutput-path\u003e \u003cinput\u003e\n```\n\nFor experienced users and full functionality (`bakta_db` \u0026 `bakta_proteins`), an image without entrypoint might be a better option. For these cases, please use one of the [Biocontainer](https://quay.io/repository/biocontainers/bakta?tab=tags) images:\n\n```bash\nexport CONTAINER=\"quay.io/biocontainers/bakta:1.8.2--pyhdfd78af_0\"\npodman run -it --rm $CONTAINER bakta --help\npodman run -it --rm $CONTAINER bakta_db --help\n```\n\n### Pip\n\n```bash\npython3 -m pip install --user bakta\n```\n\nBakta requires the following 3rd party software tools which must be installed and executable to use the full set of features:\n\n- tRNAscan-SE (2.0.11) \u003chttps://doi.org/10.1101/614032\u003e \u003chttp://lowelab.ucsc.edu/tRNAscan-SE\u003e\n- Aragorn (1.2.41) \u003chttp://dx.doi.org/10.1093/nar/gkh152\u003e \u003chttp://130.235.244.92/ARAGORN\u003e\n- INFERNAL (1.1.4) \u003chttps://dx.doi.org/10.1093%2Fbioinformatics%2Fbtt509\u003e \u003chttp://eddylab.org/infernal\u003e\n- PILER-CR (1.06) \u003chttps://doi.org/10.1186/1471-2105-8-18\u003e \u003chttp://www.drive5.com/pilercr\u003e\n- Pyrodigal (3.5.0) \u003chttps://doi.org/10.21105/joss.04296\u003e \u003chttps://github.com/althonos/pyrodigal\u003e\n- PyHMMER (0.10.15) \u003chttps://doi.org/10.21105/joss.04296\u003e \u003chttps://github.com/althonos/pyhmmer\u003e\n- Diamond (2.1.10) \u003chttps://doi.org/10.1038/nmeth.3176\u003e \u003chttps://github.com/bbuchfink/diamond\u003e\n- Blast+ (2.14.0) \u003chttps://www.ncbi.nlm.nih.gov/pubmed/2231712\u003e \u003chttps://blast.ncbi.nlm.nih.gov\u003e\n- AMRFinderPlus (4.0.3) \u003chttps://github.com/ncbi/amr\u003e\n- pyCirclize (1.7.0) https://github.com/moshi4/pyCirclize\n\n### Database download\n\nBakta requires a mandatory database which is publicly hosted at Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4247252.svg)](https://doi.org/10.5281/zenodo.4247252)\nWe provide 2 types: `full` and `light`. To get best annotation results and to use all features, we recommend using the `full` (default). If you seek for maximum runtime performance or if download time/storage requirements are an issue, please try the `light` version. Further information is provided in the [database](#database) section below.\n\nList available DB versions (available as either `full` or `light`):\n\n```bash\nbakta_db list\n...\n```\n\nTo download the most recent compatible database version we recommend to use the internal database download \u0026 setup tool:\n\n```bash\nbakta_db download --output \u003coutput-path\u003e --type [light|full]\n```\n\nOf course, the database can also be downloaded and installed manually:\n\n```bash\nwget https://zenodo.org/record/14916843/files/db-light.tar.xz\nbakta_db install -i db-light.tar.xz\n```\n\nIf required, or desired, the AMRFinderPlus DB can also be updated manually:\n\n```bash\namrfinder_update --force_update --database db-light/amrfinderplus-db/\n```\n\nIf you're using bakta on Docker:\n\n```bash\ndocker run -v /path/to/desired-db-path:/db --entrypoint /bin/bash oschwengers/bakta:latest -c \"bakta_db download --output /db --type [light|full]\"\n```\n\nAs an additional data repository backup, we provide the most recent database version via our institute servers: [full](https://s3.computational.bio.uni-giessen.de/bakta-db/db-v6.0.tar.xz), [light](https://s3.computational.bio.uni-giessen.de/bakta-db/db-light-v6.0.tar.xz). However, the bandwith is limited. Hence, please use it with caution and only if Zenodo might be temporarily unreachable or slow.\n\nUpdate an existing database:\n\n```bash\nbakta_db update --db \u003cexisting-db-path\u003e [--tmp-dir \u003ctmp-directory\u003e]\n```\n\nUpdate using Docker:\n\n```bash\ndocker run -v /path/to/desired-db-path:/db --entrypoint /bin/bash oschwengers/bakta:latest -c \"bakta_db update --db /db/db-[light|full]\"\n```\n\nThe database path can be provided either via parameter (`--db`) or environment variable (`BAKTA_DB`):\n\n```bash\nbakta --db \u003cdb-path\u003e genome.fasta\n\nexport BAKTA_DB=\u003cdb-path\u003e\nbakta genome.fasta\n```\n\nFor system-wide setups, the database can also be copied to the Bakta base directory:\n\n```bash\ncp -r db/ \u003cbakta-installation-dir\u003e\n```\n\nAs Bakta takes advantage of AMRFinderPlus for the annotation of AMR genes, AMRFinder is required to setup its own internal databases in a `\u003camrfinderplus-db\u003e` subfolder within the Bakta database `\u003cdb-path\u003e`, once via `amrfinder_update --force_update --database \u003cdb-path\u003e/amrfinderplus-db/`. To ease this process we recommend to use Bakta's internal download procedure.\n\n## Examples\n\nSimple:\n\n```bash\nbakta --db \u003cdb-path\u003e genome.fasta\n```\n\nExpert: verbose output writing results to *results* directory with *ecoli123* file `prefix` and *eco634* `locus tag` using an existing prodigal training file, using additional replicon information and 8 threads:\n\n```bash\nbakta --db \u003cdb-path\u003e --verbose --output results/ --prefix ecoli123 --locus-tag eco634 --prodigal-tf eco.tf --replicons replicon.tsv --threads 8 genome.fasta\n```\n\n## Input and Output\n\n### Input\n\nBakta accepts bacterial genomes and plasmids (complete / draft assemblies) in (zipped) fasta format. For a full description of how further genome information can be provided and workflow customizations can be set, please have a look at the [Usage](#usage) section or this [manual](https://bakta.readthedocs.io/).\n\n#### Replicon meta data table\n\nTo fine-tune the very details of each sequence in the input fasta file, Bakta accepts a replicon meta data table provided in `csv` or `tsv` file format: `--replicons \u003cfile.tsv\u003e`. Thus, complete replicons within partially completed draft assemblies can be marked \u0026 handled as such, *e.g.* detection \u0026 annotation of features spanning sequence edges.\n\nTable format:\n\noriginal sequence id  |  new sequence id  |  type  |  topology  |  name\n----|----------------|----------------|----------------|----------------\n`old id` | `new id`, `\u003cempty\u003e` | `chromosome`, `plasmid`, `contig`, `\u003cempty\u003e` | `circular`, `linear`, `\u003cempty\u003e` | `name`, `\u003cempty\u003e`\n\nFor each input sequence recognized via the `original locus id` a `new locus id`, the replicon `type` and the `topology` as well a `name` can be explicitly set.\n\nShortcuts:\n\n- `chromosome`: `c`\n- `plasmid`: `p`\n- `circular`: `c`\n- `linear`: `l`\n\n`\u003cempty\u003e` values (`-` / ``) will be replaced by defaults. If **new locus id** is `empty`, a new contig name will be autogenerated.\n\nDefaults:\n\n- type: `contig`\n- topology: `linear`\n\nExample:\n\noriginal locus id  |  new locus id  |  type  |  topology  |  name\n----|----------------|----------------|----------------|----------------\nNODE_1 | chrom | `chromosome` | `circular` | `-`\nNODE_2 | p1 | `plasmid` | `c` | `pXYZ1`\nNODE_3 | p2 | `p`  |  `c` | `pXYZ2`\nNODE_4 | special-contig-name-xyz |  `-` | `-` | `-`\nNODE_5 | `` |  `-` | `-` | `-`\n\n#### User-provided regions\n\nBakta accepts pre-annotated (*a priori*), user-provided feature regions via `--regions` in either GFF3 or GenBank format. These regions supersede all de novo-predicted regions, but are equally subject to the internal functional annotation process. Currently, only `CDS` are supported. A maximum overlap with *de novo*-predicted CDS of 30 bp is allowed. If you would like to provide custom functional annotations, you can provide these via `--proteins` which is described in the following section.\n\n#### User-provided protein sequences\n\nBakta accepts user-provided trusted protein sequences via `--proteins` in either GenBank (CDS features) or Fasta format which are used in the functional annotation process. Using the Fasta format, each reference sequence can be provided in a short or long format:\n\n```bash\n# short:\n\u003eid gene~~~product~~~dbxrefs\nMAQ...\n\n# long:\n\u003eid min_identity~~~min_query_cov~~~min_subject_cov~~~gene~~~product~~~dbxrefs\nMAQ...\n```\n\nAllowed values:\n\nfield  |  value(s)  |  example\n----|----------------|----------------\nmin_identity | `int`, `float` | 80, 90.3\nmin_query_cov | `int`, `float` | 80, 90.3\nmin_subject_cov | `int`, `float` | 80, 90.3\ngene | `\u003cempty\u003e`, `string` | msp\nproduct | `string` | my special protein\ndbxrefs | `\u003cempty\u003e`, `db:id`, `,` separated list  | `VFDB:VF0511`\n\nProtein sequences provided in short Fasta or GenBank format are searched with default thresholds of 90%, 80% and 80% for minimal identity, query and subject coverage, respectively.\n\n#### User-provided HMMs\n\nBakta accepts user-provided trusted HMMs via `--hmms` in HMMER's text format. If set, Bakta will adhere to the *trusted cutoff* specified in the HMM header. In addition, a max. evalue threshold of 1e-6 is applied. By default, Bakta uses the HMM description line as a product description. Further information can be provided via the HMM description line using the *short* format as explained above in the [User-provided protein sequences](####user-provided-protein-sequences) section.\n\n```bash\n# default\nHMMER3/f [3.1b2 | February 2015]\nNAME  id\nACC   id\nDESC  product\nLENG  435\nTC    600 600\n\n# short\nNAME  id\nACC   id\nDESC  gene~~~product~~~dbxrefs\nLENG  435\nTC    600 600\n```\n\n### Output\n\nAnnotation results are provided in standard bioinformatics file formats:\n\n- `\u003cprefix\u003e.tsv`: annotations as simple human readble TSV\n- `\u003cprefix\u003e.gff3`: annotations \u0026 sequences in GFF3 format\n- `\u003cprefix\u003e.gbff`: annotations \u0026 sequences in (multi) GenBank format\n- `\u003cprefix\u003e.embl`: annotations \u0026 sequences in (multi) EMBL format\n- `\u003cprefix\u003e.fna`: replicon/contig DNA sequences as FASTA\n- `\u003cprefix\u003e.ffn`: feature nucleotide sequences as FASTA\n- `\u003cprefix\u003e.faa`: CDS/sORF amino acid sequences as FASTA\n- `\u003cprefix\u003e.inference.tsv`: inference metrics (score, evalue, coverage, identity) for annotated accessions as TSV\n- `\u003cprefix\u003e.hypotheticals.tsv`: further information on hypothetical protein CDS as simple human readble tab separated values\n- `\u003cprefix\u003e.hypotheticals.faa`: hypothetical protein CDS amino acid sequences as FASTA\n- `\u003cprefix\u003e.txt`: summary as TXT\n- `\u003cprefix\u003e.png`: circular genome annotation plot as PNG\n- `\u003cprefix\u003e.svg`: circular genome annotation plot as SVG\n- `\u003cprefix\u003e.json`: all (internal) annotation \u0026 sequence information as JSON\n\nThe `\u003cprefix\u003e` can be set via `--prefix \u003cprefix\u003e`. If no prefix is set, Bakta uses the input file prefix.\n\nOf note, Bakta provides all detailed (internal) information on each annotated feature in a standardized machine-readable JSON file `\u003cprefix\u003e.json`:\n\n```json\n{\n    \"genome\": {\n        \"genus\": \"Escherichia\",\n        \"species\": \"coli\",\n        ...\n    },\n    \"stats\": {\n        \"size\": 5594605,\n        \"gc\": 0.497,\n        ...\n    },\n    \"features\": [\n        {\n            \"type\": \"cds\",\n            \"contig\": \"contig_1\",\n            \"start\": 971,\n            \"stop\": 1351,\n            \"strand\": \"-\",\n            \"gene\": \"lsoB\",\n            \"product\": \"type II toxin-antitoxin system antitoxin LsoB\",\n            ...\n        },\n        ...\n    ],\n    \"sequences\": [\n        {\n            \"id\": \"c1\",\n            \"description\": \"[organism=Escherichia coli] [completeness=complete] [topology=circular]\",\n            \"nt\": \"AGCTTT...\",\n            \"length\": 5498578,\n            \"complete\": true,\n            \"type\": \"chromosome\",\n            \"topology\": \"circular\"\n            ...\n        },\n        ...\n    ]\n}\n```\n\nBakta provides a helper function to create above mentioned output files from the (GNU-zipped) *JSON* result file, thus helping potential long-term or large-scale annotation projects to reduce overall storage requirements.\n\n```bash\nbakta_io --output \u003coutput-path\u003e --prefix \u003cprefix\u003e result.json.gz\n\nbakta_io --help\n```\n\nExemplary annotation result files for several genomes (mostly ESKAPE species) are hosted at Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4770026.svg)](https://doi.org/10.5281/zenodo.4770026)\n\n## Usage\n\n```bash\nusage: bakta [--db DB] [--min-contig-length MIN_CONTIG_LENGTH] [--prefix PREFIX] [--output OUTPUT] [--force]\n             [--genus GENUS] [--species SPECIES] [--strain STRAIN] [--plasmid PLASMID]\n             [--complete] [--prodigal-tf PRODIGAL_TF] [--translation-table {11,4,25}] [--gram {+,-,?}]\n             [--locus LOCUS] [--locus-tag LOCUS_TAG] [--locus-tag-increment {1,5,10}] [--keep-contig-headers] [--compliant]\n             [--replicons REPLICONS] [--regions REGIONS] [--proteins PROTEINS] [--hmms HMMS] [--meta]\n             [--skip-trna] [--skip-tmrna] [--skip-rrna] [--skip-ncrna] [--skip-ncrna-region]\n             [--skip-crispr] [--skip-cds] [--skip-pseudo] [--skip-sorf] [--skip-gap] [--skip-ori] [--skip-filter] [--skip-plot]\n             [--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]\n             \u003cgenome\u003e\n\nRapid \u0026 standardized annotation of bacterial genomes, MAGs \u0026 plasmids\n\npositional arguments:\n  \u003cgenome\u003e              Genome sequences in (zipped) fasta format\n\nInput / Output:\n  --db DB, -d DB        Database path (default = \u003cbakta_path\u003e/db). Can also be provided as BAKTA_DB environment variable.\n  --min-contig-length MIN_CONTIG_LENGTH, -m MIN_CONTIG_LENGTH\n                        Minimum contig/sequence size (default = 1; 200 in compliant mode)\n  --prefix PREFIX, -p PREFIX\n                        Prefix for output files\n  --output OUTPUT, -o OUTPUT\n                        Output directory (default = current working directory)\n  --force, -f           Force overwriting existing output folder (except for current working directory)\n\nOrganism:\n  --genus GENUS         Genus name\n  --species SPECIES     Species name\n  --strain STRAIN       Strain name\n  --plasmid PLASMID     Plasmid name\n\nAnnotation:\n  --complete            All sequences are complete replicons (chromosome/plasmid[s])\n  --prodigal-tf PRODIGAL_TF\n                        Path to existing Prodigal training file to use for CDS prediction\n  --translation-table {11,4,25}\n                        Translation table: 11/4/25 (default = 11)\n  --gram {+,-,?}        Gram type for signal peptide predictions: +/-/? (default = ?)\n  --locus LOCUS         Locus prefix (default = 'contig')\n  --locus-tag LOCUS_TAG\n                        Locus tag prefix (default = autogenerated)\n  --locus-tag-increment {1,5,10}\n                        Locus tag increment: 1/5/10 (default = 1)\n\n  --keep-contig-headers\n                        Keep original contig/sequence headers\n  --compliant           Force Genbank/ENA/DDJB compliance\n  --replicons REPLICONS, -r REPLICONS\n                        Replicon information table (tsv/csv)\n  --regions REGIONS     Path to pre-annotated regions in GFF3 or Genbank format (regions only, no functional annotations).\n  --proteins PROTEINS   Fasta file of trusted protein sequences for CDS annotation\n  --hmms HMMS           HMM file of trusted hidden markov models in HMMER format for CDS annotation\n  --meta                Run in metagenome mode. This only affects CDS prediction.\n\nWorkflow:\n  --skip-trna           Skip tRNA detection \u0026 annotation\n  --skip-tmrna          Skip tmRNA detection \u0026 annotation\n  --skip-rrna           Skip rRNA detection \u0026 annotation\n  --skip-ncrna          Skip ncRNA detection \u0026 annotation\n  --skip-ncrna-region   Skip ncRNA region detection \u0026 annotation\n  --skip-crispr         Skip CRISPR array detection \u0026 annotation\n  --skip-cds            Skip CDS detection \u0026 annotation\n  --skip-pseudo         Skip pseudogene detection \u0026 annotation\n  --skip-sorf           Skip sORF detection \u0026 annotation\n  --skip-gap            Skip gap detection \u0026 annotation\n  --skip-ori            Skip oriC/oriT detection \u0026 annotation\n  --skip-filter         Skip feature overlap filters\n  --skip-plot           Skip generation of circular genome plots\n\nGeneral:\n  --help, -h            Show this help message and exit\n  --verbose, -v         Print verbose information\n  --debug               Run Bakta in debug mode. Temp data will not be removed.\n  --threads THREADS, -t THREADS\n                        Number of threads to use (default = number of available CPUs)\n  --tmp-dir TMP_DIR     Location for temporary files (default = system dependent auto detection)\n  --version             show program's version number and exit\n```\n\n## Annotation Workflow\n\n### RNAs\n\n1. tRNA genes: tRNAscan-SE 2.0\n2. tmRNA genes: Aragorn\n3. rRNA genes: Infernal vs. Rfam rRNA covariance models\n4. ncRNA genes: Infernal vs. Rfam ncRNA covariance models\n5. ncRNA cis-regulatory regions: Infernal vs. Rfam ncRNA covariance models\n6. CRISPR arrays: PILER-CR\n\nBakta distinguishes ncRNA genes and (cis-regulatory) regions in order to enable the distinct handling thereof during the annotation process, *i.e.* feature overlap detection.\n\nncRNA gene types:\n\n- sRNA\n- antisense\n- ribozyme\n- antitoxin\n\nncRNA (cis-regulatory) region types:\n\n- riboswitch\n- thermoregulator\n- leader\n- frameshift element\n\n### Coding sequences\n\nThe structural prediction is conducted via Pyrodigal and complemented by a custom detection of sORF \u003c 30 aa. In addition, superseding regions of pre-predicted CDS can be provided via `--regions`.\n\nTo rapidly identify known protein sequences with exact sequence matches and to conduct a comprehensive annotations, Bakta utilizes a compact read-only SQLite database comprising protein sequence digests and pre-assigned annotations for millions of known protein sequences and clusters.\n\nConceptual terms:\n\n- **UPS**: unique protein sequences identified via length and MD5 hash digests (100% coverage \u0026 100% sequence identity)\n- **IPS**: identical protein sequences comprising seeds of UniProt's UniRef100 protein sequence clusters\n- **PSC**: protein sequences clusters comprising seeds of UniProt's UniRef90 protein sequence clusters\n- **PSCC**: protein sequences clusters of clusters comprising annotations of UniProt's UniRef50 protein sequence clusters\n\n**CDS**:\n\n1. De novo-prediction via Pyrodigal respecting sequences' completeness (distinct prediction for complete replicons and uncompleted contigs)\n2. Discard spurious CDS via AntiFam\n3. Detect translational exceptions (selenocysteines)\n4. Import of superseding user-provided CDS regions (optional)\n5. Detection of UPSs via MD5 digests and lookup of related IPS and PCS\n6. Sequence alignments of remainder via Diamond vs. PSC (query/subject coverage=0.8, identity=0.5)\n7. Assignment to UniRef90 or UniRef50 clusters if alignment hits achieve identities larger than 0.9 or 0.5, respectively\n8. Execution of expert systems:\n   - AMR: AMRFinderPlus\n   - Expert proteins: NCBI BlastRules, VFDB\n   - User proteins (optionally via `--proteins \u003cFasta/GenBank\u003e`)\n9. Prediction of signal peptides (optionally via `--gram \u003c+/-\u003e`)\n10. Detection of pseudogenes:\n   1. Search for reference PCSs using `hypothetical` CDS as seed sequences\n   2. Translated alignment (blastx) of reference PCSs against up-/downstream-elongated CDS regions\n   3. Analysis of translated alignments and detection of pseudogenization causes \u0026 effects\n11. Combination of IPS, PSC, PSCC and expert system information favouring more specific annotations and avoiding redundancy\n\nCDS without IPS or PSC hits as well as those without gene symbols or product descriptions different from `hypothetical` will be marked as `hypothetical`.\n\nSuch hypothetical CDS are further analyzed:\n\n1. Detection of Pfam domains, repeats \u0026 motifs\n2. Calculation of protein sequence statistics, *i.e.* molecular weight, isoelectric point\n\n**sORFs**:\n\n1. Custom sORF detection \u0026 extraction with amino acid lengths \u003c 30 aa\n2. Apply strict feature type-dependent overlap filters\n3. discard spurious sORF via AntiFam\n4. Detection of UPS via MD5 hashes and lookup of related IPS\n5. Sequence alignments of remainder via Diamond vs. an sORF subset of PSCs (coverage=0.9, identity=0.9)\n6. Exclude sORF without sufficient annotation information\n7. Prediction of signal peptides (optionally via `--gram \u003c+/-\u003e`)\n\nsORF not identified via IPS or PSC will be discarded. Additionally, all sORF without gene symbols or product descriptions different from `hypothetical` will be discarded.\nDue due to uncertain nature of sORF prediction, only those identified via IPS / PSC hits exhibiting proper gene symbols or product descriptions different from `hypothetical` will be included in the final annotation.\n\n### Miscellaneous\n\n1. Gaps: in-mem detection \u0026 annotation of sequence gaps\n2. oriC/oriV/oriT: Blast+ (cov=0.8, id=0.8) vs. [MOB-suite](https://github.com/phac-nml/mob-suite) oriT \u0026 [DoriC](http://tubic.org/doric/public/index.php) oriC/oriV sequences. Annotations of ori regions take into account overlapping Blast+ hits and are conducted based on a majority vote heuristic. Region edges are fuzzy - use with caution!\n\n## Database\n\nThe Bakta database comprises a set of AA \u0026 DNA sequence databases as well as HMM \u0026 covariance models.\nAt its core Bakta utilizes a compact read-only SQLite DB storing protein sequence digests, lengths, pre-assigned annotations and dbxrefs of UPS, IPS and PSC from:\n\n- **UPS**: UniParc / UniProtKB (350,631,327)\n- **IPS**: UniProt UniRef100 (330,865,009)\n- **PSC**: UniProt UniRef90 (135,274,518)\n- **PSCC**: UniProt UniRef50 (37,008,138)\n\nThis allows the exact protein sequences identification via MD5 digests \u0026 sequence lengths as well as the rapid subsequent lookup of related information. Protein sequence digests are checked for hash collisions while the DB creation process. IPS \u0026 PSC have been comprehensively pre-annotated integrating annotations \u0026 database *dbxrefs* from:\n\n- NCBI nonredundant proteins (UPS: 290,693,966)\n- NCBI COG DB (PSC: 3,513,643)\n- KEGG Kofams (PSC: 24,267,514)\n- SwissProt EC/GO terms (PSC: 337,264)\n- NCBI NCBIfams (PSC: 21,758,901)\n- PHROG (PSC: 11,717)\n- NCBI AMRFinderPlus (IPS: 8,382)\n- ISFinder DB (IPS: 155,449, PSC: 14,481)\n- Pfam families (PSC: 659,781)\n\nTo provide high quality annotations for distinct protein sequences of high importance (AMR, VF, *etc*) which cannot sufficiently be covered by the IPS/PSC approach, Bakta provides additional expert systems. For instance, AMR genes, are annotated via NCBI's AMRFinderPlus.\nAn expandable alignment-based expert system supports the incorporation of high quality annotations from multiple sources. This currenlty comprises NCBI's BlastRules as well as VFDB and will be complemented with more expert annotation sources over time. Internally, this expert system is based on a Diamond DB comprising the following information in a standardized format:\n\n- source: *e.g.* BlastRules\n- rank: a precedence rank\n- min identity\n- min query coverage\n- min model coverage\n- gene lable\n- product description\n- dbxrefs\n\nRfam covariance models:\n\n- ncRNA: 779\n- ncRNA cis-regulatory regions: 288\n\nori sequences:\n\n- oriC/V: 6,690\n- oriT: 502\n\nTo provide FAIR annotations, the database releases are SemVer versioned (w/o patch level), *i.e.* `\u003cmajor\u003e.\u003cminor\u003e`. For each version we provide a comprehensive log file tracking all imported sequences as well as annotations thereof. The DB schema is represented by the `\u003cmajor\u003e` digit and automatically checked at runtime by Bakta in order to ensure compatibility. Content updates are tracked by the `\u003cminor\u003e` digit.\n\nAs this taxonomic-untargeted database is fairly demanding in terms of storage consumption, we also provide a lightweight DB type providing all non-coding feature information but only PSCC information from UniRef50 clusters for CDS. If download bandwiths or storage requirements become an issue or if shorter runtimes are favored over more-specific annotation, the `light` DB will do the job.\n\nLatest database version: 6.0\nDB types:\n\n- `light`: 1.3 Gb zipped, 3.9 Gb unzipped, MD5: 4a6e059ded39e9c5537ef4137d2f5648\n- `full`: 30 Gb zipped, 84 Gb unzipped, MD5: 4c1115e40abfa2b464ae5dd988bdd88e\n\nAll database releases are hosted at Zenodo: [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14916843.svg)](https://doi.org/10.5281/zenodo.14916843)\n\n## Genome Submission\n\nMost genomes annotated with Bakta should be ready-to-submid to INSDC member databases GenBank and ENA. As a first step, please register your BioProject (e.g. PRJNA123456) and your locus_tag prefix (*e.g.* ESAKAI).\n\n```bash\n# annotate your genome in `--compliant` mode:\n$ bakta --db \u003cdb-path\u003e -v --genus Escherichia --species \"coli O157:H7\" --strain Sakai --complete --compliant --locus-tag ESAKAI test/data/GCF_000008865.2.fna.gz\n```\n\n### GenBank\n\nGenomes are submitted to GenBank via Fasta (`.fna`) and SQN files. Therefore, `.sqn` files can be created with NCBI's new [table2asn](https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/) tool via Bakta's `.gff3` files.\nPlease, have a look at the [documentation]((https://www.ncbi.nlm.nih.gov/genbank/genomes_gff)) and have all additional files (template.txt) prepared:\n\n```bash\n# download table2asn for Linux\n$ wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/linux64.table2asn.gz\n$ gunzip linux64.table2asn.gz\n\n# or MacOS\n$ wget https://ftp.ncbi.nlm.nih.gov/asn1-converters/by_program/table2asn/mac.table2asn.gz\n$ gunzip mac.table2asn.gz\n\n$ chmod 755 linux64.table2asn.gz mac.table2asn.gz\n\n# create the SQN file:\n$ linux64.table2asn -Z -W -M n -J -c w -t template.txt -V vbt -l paired-ends -i GCF_000008865.2.fna -f GCF_000008865.2.gff3 -o GCF_000008865.2.sqn\n```\n\n### ENA\n\nGenomes are submitted to ENA as EMBL (`.embl`) files via EBI's [Webin-CLI](https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html) tool.\nPlease have all additional files (manifest.tsv, chrom-list.tsv) prepared as described [here](https://ena-docs.readthedocs.io/en/latest/submit/fileprep/assembly.html#flat-file).\n\n```bash\n# download ENA Webin-CLI\n$ wget https://github.com/enasequence/webin-cli/releases/download/8.1.0/webin-cli-8.1.0.jar\n\n$ gzip -k GCF_000008865.2.embl\n$ gzip -k chrom-list.tsv\n$ java -jar webin-cli-8.1.0.jar -submit -userName=\u003cLOGIN\u003e -password \u003cPWD\u003e -context genome -manifest manifest.tsv\n```\n\nExemplarey manifest.tsv and chrom-list.tsv files might look like:\n\n```bash\n$ cat manifest.tsv\nSTUDY    PRJEB44484\nSAMPLE    ERS6291240\nASSEMBLYNAME    GCF\nASSEMBLY_TYPE    isolate\nCOVERAGE    100\nPROGRAM    SPAdes\nPLATFORM    Illumina\nMOLECULETYPE    genomic DNA\nFLATFILE    GCF_000008865.2.embl.gz\nCHROMOSOME_LIST    chrom-list.tsv.gz\n\n$ cat chrom-list.tsv\ncontig_1    contig_1    circular-chromosome\ncontig_2    contig_2    circular-plasmid\ncontig_3    contig_3    circular-plasmid\n```\n\n## Protein bulk annotation\n\nFor the direct bulk annotation of protein sequences aside from the genome, Bakta provides a dedicated CLI entry point `bakta_proteins`:\n\nExamples:\n\n```bash\nbakta_proteins --db \u003cdb-path\u003e input.fasta\n\nbakta_proteins --db \u003cdb-path\u003e --prefix test --output test --proteins special.faa --threads 8 input.fasta\n```\n\n### Output\n\nAnnotation results are provided in standard bioinformatics file formats:\n\n- `\u003cprefix\u003e.tsv`: annotations as simple human readble TSV\n- `\u003cprefix\u003e.faa`: protein sequences as FASTA\n- `\u003cprefix\u003e.hypotheticals.tsv`: further information on hypothetical proteins as simple human readble tab separated values\n- `\u003cprefix\u003e.json`: all (internal) annotation \u0026 sequence information as JSON\n\nThe `\u003cprefix\u003e` can be set via `--prefix \u003cprefix\u003e`. If no prefix is set, Bakta uses the input file prefix.\n\n### Usage\n\n```bash\nusage: bakta_proteins [--db DB] [--output OUTPUT] [--prefix PREFIX] [--force]\n                      [--proteins PROTEINS]\n                      [--help] [--verbose] [--debug] [--threads THREADS] [--tmp-dir TMP_DIR] [--version]\n                      \u003cinput\u003e\n\nRapid \u0026 standardized annotation of bacterial genomes, MAGs \u0026 plasmids\n\npositional arguments:\n  \u003cinput\u003e               Protein sequences in (zipped) fasta format\n\nInput / Output:\n  --db DB, -d DB        Database path (default = \u003cbakta_path\u003e/db). Can also be provided as BAKTA_DB environment variable.\n  --output OUTPUT, -o OUTPUT\n                        Output directory (default = current working directory)\n  --prefix PREFIX, -p PREFIX\n                        Prefix for output files\n  --force, -f           Force overwriting existing output folder\n\nAnnotation:\n  --proteins PROTEINS   Fasta file of trusted protein sequences for annotation\n\nGeneral:\n  --help, -h            Show this help message and exit\n  --verbose, -v         Print verbose information\n  --debug               Run Bakta in debug mode. Temp data will not be removed.\n  --threads THREADS, -t THREADS\n                        Number of threads to use (default = number of available CPUs)\n  --tmp-dir TMP_DIR     Location for temporary files (default = system dependent auto detection)\n  --version, -V         show program's version number and exit\n```\n\n## Genome plots\n\nBakta allows the creation of circular genome plots via [pyCirclize](https://github.com/moshi4/pyCirclize). Plots are generated as part of the default workflow and saved as `PNG` and `SVG` files. In addition to the default workflow, Bakta provides a dedicated CLI entry point `bakta_plot`:\n\nExamples:\n\n```bash\nbakta_plot input.json\n\nbakta_plot --output test --prefix test --config config.yaml --sequences 1,2 input.json\n```\n\nIt accepts the results of a former annotation process in JSON format and allows the selection of distinct sequences, either denoted by their `FASTA` identifiers or sequential number starting by 1. Colors for each feature type can be adopted via a simple configuration file in `YAML` format, *e.g.* [config.yaml](config.yaml). Currently, two default plot types are supported, *i.e.* `features` and `cog`. Examples for chromosomes and plasmids are provided in [here](examples/)\n\n### Usage\n\n```bash\nusage: bakta_plot [--config CONFIG] [--output OUTPUT] [--prefix PREFIX]\n                  [--sequences SEQUENCES] [--type {features,cog}] [--label LABEL] [--size {4,8,16}] [--dpi {150,300,600}]\n                  [--help] [--verbose] [--debug] [--tmp-dir TMP_DIR] [--version]\n                  \u003cinput\u003e\n\nRapid \u0026 standardized annotation of bacterial genomes, MAGs \u0026 plasmids\n\npositional arguments:\n  \u003cinput\u003e               Bakta annotations in (zipped) JSON format\n\nInput / Output:\n  --config CONFIG, -c CONFIG\n                        Plotting configuration in YAML format\n  --output OUTPUT, -o OUTPUT\n                        Output directory (default = current working directory)\n  --prefix PREFIX, -p PREFIX\n                        Prefix for output files\n\nPlotting:\n  --sequences SEQUENCES\n                        Sequences to plot: comma separated number or name (default = all, numbers one-based)\n  --type {features,cog}\n                        Plot type: feature/cog (default = features)\n  --label LABEL         Plot center label (for line breaks use '|')\n  --size {4,8,16}       Plot size in inches: 4/8/16 (default = 8)\n  --dpi {150,300,600}   Plot resolution as dots per inch: 150/300/600 (default = 300)\n\nGeneral:\n  --help, -h            Show this help message and exit\n  --verbose, -v         Print verbose information\n  --debug               Run Bakta in debug mode. Temp data will not be removed.\n  --tmp-dir TMP_DIR     Location for temporary files (default = system dependent auto detection)\n  --version             show program's version number and exit\n```\n\n### Description\n\nCurrently, there are two types of plots: `features` (the default) and `cog`. In default mode (`features`), all features are plotted on two rings representing the forward and reverse strand from outer to inner, respectively using the following feature colors:\n\n- CDS: `#cccccc`\n- tRNA/tmRNA: `#b2df8a`\n- rRNA: `#fb8072`\n- ncRNA: `#fdb462`\n- ncRNA-region: `#80b1d3`\n- CRISPR: `#bebada`\n- Gap: `#000000`\n- Misc: `#666666`\n\nIn the `cog` mode, all protein-coding genes (CDS) are colored due to assigned COG functional categories. To better distinguish non-coding genes, these are plotted on an additional 3rd ring.\n\nIn addition, both plot types share two innermost GC content and GC skew rings. The first ring represents the GC content per sliding window over the entire sequence(s) in green (`#33a02c`) and red `#e31a1c` representing GC above and below average, respectively. The 2nd ring represents the GC skew in orange (`#fdbf6f`) and blue (`#1f78b4`). The GC skew gives hints on a replicon's replication bubble and hence, on the completeness of the assembly. On a complete \u0026 circular bacterial chromosome, you normally see two inflection points at the origin of replication and at its opposite region -\u003e [Wikipedia](https://en.wikipedia.org/wiki/GC_skew)\n\nCustom plot labels (text in the center) can be provided via `--label`:\n\n```bash\nbakta_plot --sequences 2 --dpi 300 --size 8 --prefix plot-cog-p2 --type cog --label=\"pO157|plasmid, 92.7 kbp\"\n```\n\n![Plot example of Bakta test genome.](/examples/plot-cog-p2.png)\n\n## Auxiliary scripts\n\nOften, the usage of Bakta is a necessary upfront task followed by deeper analyses implemented in custom scripts. In [scripts](scripts) we'd like to collect \u0026 offer a pool of scripts addressing common tasks:\n\n- `collect-annotation-stats.py`: Collect annotation stats for a cohort of genomes and print a condensed `TSV`.\n- `extract-region.py`: Extract genome features within a given genomic range and export them as `GFF3`, `Embl`, `Genbank`, `FAA` and `FFN`\n\nOf course, pull requests are welcome ;-)\n\n## Web version\n\nFor further convenience, we developed an accompanying web application available at https://bakta.computational.bio.\n\nThis interactive web application provides an interactive genome browsers, aggregated feature counts and a searchable data table with detailed information on each predicted feature as well as dbxref-linked records to public databases.\n\nOf note, this web application can also be used to visualize offline annotation results conducted by using the command line version. Therefore, the web application provides an offline viewer accepting JSON result files which are parsed and visualized locally within the browser without sending any data to the server.\n\n## Citation\n\nIf you use Bakta in your research, please cite this paper:\n\u003e Schwengers O., Jelonek L., Dieckmann M. A., Beyvers S., Blom J., Goesmann A. (2021). Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification. Microbial Genomics, 7(11). https://doi.org/10.1099/mgen.0.000685\n\nBakta is *standing on the shoulder of giants* taking advantage of many great software tools and databases. If you find any of these useful for your research, please cite these primary sources, as well.\n\n### Tools\n\n- tRNAscan-SE 2.0 \u003chttps://doi.org/10.1093/nar/gkab688\u003e\n- Aragorn \u003chttps://doi.org/10.1093/nar/gkh152\u003e\n- Infernal \u003chttps://doi.org/10.1093/bioinformatics/btt509\u003e\n- PilerCR \u003chttps://doi.org/10.1186/1471-2105-8-18\u003e\n- Pyrodigal \u003chttps://doi.org/10.21105/joss.04296\u003e Prodigal \u003chttps://doi.org/10.1186/1471-2105-11-119\u003e\n- Diamond \u003chttps://doi.org/10.1038/s41592-021-01101-x\u003e\n- BLAST+ \u003chttps://doi.org/10.1186/1471-2105-10-421\u003e\n- PyHMMER \u003chttps://doi.org/10.21105/joss.04296\u003e HMMER \u003chttps://doi.org/10.1371/journal.pcbi.1002195\u003e\n- AMRFinderPlus \u003chttps://doi.org/10.1038/s41598-021-91456-0\u003e\n- pyCirclize https://github.com/moshi4/pyCirclize\n\n### Databases\n\n- Rfam: \u003chttps://doi.org/10.1002/cpbi.51\u003e\n- Mob-suite: \u003chttps://doi.org/10.1099/mgen.0.000206\u003e\n- DoriC: \u003chttps://doi.org/10.1093/nar/gky1014\u003e\n- AntiFam: \u003chttps://doi.org/10.1093/database/bas003\u003e\n- UniProt: \u003chttps://doi.org/10.1093/nar/gky1049\u003e\n- RefSeq: \u003chttps://doi.org/10.1093/nar/gkx1068\u003e\n- COG: \u003chttps://doi.org/10.1093/bib/bbx117\u003e\n- KEGG: \u003chttps://doi.org/10.1093/bioinformatics/btz859\u003e\n- PHROG: \u003chttps://doi.org/10.1093/nargab/lqab067\u003e\n- AMRFinder: \u003chttps://doi.org/10.1128/AAC.00483-19\u003e\n- ISFinder: \u003chttps://doi.org/10.1093/nar/gkj014\u003e\n- Pfam: \u003chttps://doi.org/10.1093/nar/gky995\u003e\n- VFDB: \u003chttps://doi.org/10.1093/nar/gky1080\u003e\n\n## FAQ\n\n- **AMRFinder fails**\nIf AMRFinder constantly crashes even on fresh setups and Bakta's database was downloaded manually, then AMRFinder needs to setup its own internal database. This is required only once: `amrfinder_update --force_update --database \u003cbakta-db\u003e/amrfinderplus-db`. You could also try Bakta's internal database download logic automatically taking care of this: `bakta_db download --output \u003cbakta-db\u003e`\n\n- **DeepSig not found in Conda environment**\nFor the prediction of signal predictions, Bakta uses DeepSig that is currently not available for MacOS and only up to Bakta v1.9.4. Therefore, we decided to exclude DeepSig from Bakta's default Conda dependencies because otherwise it would not be installable on MacOS systems. On Linux systems it can be installed via `conda install -c conda-forge -c bioconda python=3.8 deepsig`.\n\n- **Nice, but I'm mising XYZ...**\nBakta is quite new and we're keen to constantly improve it and further expand its feature set. In case there's anything missing, please do not hesitate to open an issue and ask for it!\n\n- **Bakta is running too long without CPU load... why?**\nBakta takes advantage of an SQLite DB which results in high storage IO loads. If this DB is stored on a remote / network volume, the lookup of IPS/PSC annotations might take a long time. In these cases, please, consider moving the DB to a local volume or hard drive.\n\n## Issues and Feature Requests\n\nBakta is new and like in every software, expect some bugs lurking around. So, if you run into any issues with Bakta, we'd be happy to hear about it.\nTherefore, please, execute bakta in debug mode (`--debug`) and do not hesitate to file an issue including as much information as possible:\n\n- a detailed description of the issue\n- command line output\n- log file (`\u003cprefix\u003e.log`)\n- result file (`\u003cprefix\u003e.json`) *if possible*\n- a reproducible example of the issue with an input file that you can share *if possible*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foschwengers%2Fbakta","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foschwengers%2Fbakta","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foschwengers%2Fbakta/lists"}