{"id":47604513,"url":"https://github.com/tomdstanton/pyfgs","last_synced_at":"2026-05-27T03:06:52.279Z","repository":{"id":344737796,"uuid":"1182766417","full_name":"tomdstanton/pyfgs","owner":"tomdstanton","description":"🔗🐍⏭️  PyO3 bindings and Python interface to FragGeneScanRs, a gene prediction model for short and error-prone reads","archived":false,"fork":false,"pushed_at":"2026-04-22T03:41:31.000Z","size":1740,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-22T05:39:41.823Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://tomdstanton.github.io/pyfgs/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tomdstanton.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-15T23:48:02.000Z","updated_at":"2026-04-22T03:39:39.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tomdstanton/pyfgs","commit_stats":null,"previous_names":["tomdstanton/pyfgs"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/tomdstanton/pyfgs","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomdstanton%2Fpyfgs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomdstanton%2Fpyfgs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomdstanton%2Fpyfgs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomdstanton%2Fpyfgs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tomdstanton","download_url":"https://codeload.github.com/tomdstanton/pyfgs/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tomdstanton%2Fpyfgs/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33548311,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-27T02:00:06.184Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-01T19:03:56.366Z","updated_at":"2026-05-27T03:06:52.272Z","avatar_url":"https://github.com/tomdstanton.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 🔗🐍⏭️ `pyfgs` [![Stars](https://img.shields.io/github/stars/tomdstanton/pyfgs.svg?style=social\u0026maxAge=3600\u0026label=Star)](https://github.com/tomdstanton/pyfgs/stargazers)\n\n*PyO3 bindings and Python interface to [FragGeneScanRs](https://github.com/unipept/FragGeneScanRs),\na gene prediction model for short and error-prone reads.*\n\n[![Release](https://img.shields.io/github/v/release/tomdstanton/pyfgs?style=flat-square)](https://img.shields.io/github/v/release/tomdstanton/pyfgs)\n[![License](https://img.shields.io/github/license/tomdstanton/pyfgs?style=flat-square)](https://img.shields.io/github/license/tomdstanton/pyfgs)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.19059429.svg?style=flat-square)](https://doi.org/10.5281/zenodo.19059429)\n[![PyPI](https://img.shields.io/pypi/v/pyfgs.svg?style=flat-square\u0026maxAge=3600\u0026logo=PyPI)](https://pypi.org/project/pyfgs)\n[![Wheel](https://img.shields.io/pypi/wheel/pyfgs.svg?style=flat-square\u0026maxAge=3600)](https://pypi.org/project/pyfgs/#files)\n[![Python Versions](https://img.shields.io/pypi/pyversions/pyfgs.svg?style=flat-square\u0026maxAge=600\u0026logo=python)](https://pypi.org/project/pyfgs/#files)\n[![Python Implementations](https://img.shields.io/pypi/implementation/pyfgs.svg?style=flat-square\u0026maxAge=600\u0026label=impl)](https://pypi.org/project/pyfgs/#files)\n[![Source](https://img.shields.io/badge/source-GitHub-303030.svg?maxAge=2678400\u0026style=flat-square)](https://github.com/tomdstanton/pyfgs/)\n[![Issues](https://img.shields.io/github/issues/tomdstanton/pyfgs.svg?style=flat-square\u0026maxAge=600)](https://github.com/tomdstanton/pyfgs/issues)\n[![Downloads](https://img.shields.io/pypi/dm/pyfgs?style=flat-square\u0026color=303f9f\u0026maxAge=86400\u0026label=downloads)](https://pepy.tech/project/pyfgs)\n\n\n##  Why `pyfgs`?\n\n**Built for noisy data**\n\nStandard ab initio predictors (like Prodigal or Pyrodigal) are fantastic for pristine, fully assembled contigs.\nHowever, they struggle with raw metagenomic reads or error-prone assemblies because they immediately break the open\nreading frame at the first sign of an indel. `pyfgs` uses an error-tolerant Hidden Markov Model trained on specific\nsequencing profiles (Illumina, 454, Sanger) to power through these sequencing errors, dynamically correct the reading\nframe, and salvage the translated protein.\n\n**Native frameshift tracking**\n\nInstead of just silently stitching broken genes together, `pyfgs` exposes the exact coordinates of every hallucinated or\nskipped base directly to Python. This allows you to rigorously track structural variants, correctly annotate\nINSDC-compliant pseudogenes, or export exact frameshift coordinates for downstream quality control.\n\n**No subprocess I/O tax**\n\nRunning standard CLI bioinformatics tools from Python usually requires a heavy I/O penalty: dumping sequences to a\ntemporary FASTA file, firing a subprocess, and parsing the text outputs back into memory. `pyfgs` binds directly to the\nunderlying Rust engine. The HMM runs entirely in memory and yields native Python objects ready for immediate analysis.\n\n**True multithreading and zero-copy memory**\n\n`pyfgs` is designed to process massive datasets efficiently:\n\n- GIL-Free Inference: The Rust backend completely releases the Python Global Interpreter Lock (GIL) during the heavy\n  HMM math. You can drop the predictor into a standard ThreadPoolExecutor and achieve true parallel processing across\n  all your CPU cores.\n\n- Zero-Copy Bytes: The engine borrows raw byte slices `(\u0026[u8])` directly from Python's memory, bypassing the overhead of\n  copying strings between languages.\n\n- Lazy Translation: Translating DNA to amino acids is computationally expensive. `pyfgs` evaluates sequence strings\n  lazily, meaning you only pay the CPU and memory cost of string allocation if you explicitly request the sequence data.\n\n**A Pythonic API**\n\nBioinformatics coordinates are notoriously messy. `pyfgs` outputs standard 0-based, half-open intervals ([start, end)),\nallowing you to slice sequence arrays immediately without wrestling with 1-based GFF3 coordinate math. When you do need\nstandardized files, it includes heavily optimized, native-Rust context managers to stream perfectly compliant VCF, BED,\nGFF3, and FASTA files directly to disk without bloating your RAM.\n\n\n## 🔧 Installing\n\nThis project is supported on Python 3.10 and later.\n\n`pyfgs` can be installed directly from [PyPI](https://pypi.org/project/pyfgs/):\n\n```console\npip install pyfgs\n```\n\n⚡️ Power users ⚡️ can force your local machine to compile the Rust engine specifically for your own CPU by running:\n\n```console\nRUSTFLAGS=\"-C target-cpu=native\" pip install --no-binary pyfgs pyfgs\n```\n\n\n## 💻 Usage\n\n### API Usage\nFor full API usage, please refer to the [documentation](https://tomdstanton.github.io/pyfgs/api/).\n\n```python\nimport concurrent.futures\nimport pyfgs\nfrom Bio.Seq import Seq\nfrom Bio.SeqRecord import SeqRecord\nfrom Bio.SeqFeature import SeqFeature, FeatureLocation\nfrom Bio import SeqIO\n\ndef main():\n    # 1. Initialize the GeneFinder\n    # Set whole_genome=False to force the HMM to hunt for frameshifts.\n    finder = pyfgs.GeneFinder(pyfgs.Model.Complete, whole_genome=False)\n\n    # 2. Parse the genome into memory \n    # (Safe for assemblies! For massive raw read FASTQs, use an itertools chunker instead)\n    contigs = list(pyfgs.FastaReader(\"bacterial_assembly.fasta\"))\n    seqs = [seq for _, seq in contigs]\n    \n    # 3. Process concurrently\n    # The GIL is released, and map perfectly preserves our sequence order!\n    with concurrent.futures.ThreadPoolExecutor() as executor:\n        all_genes = list(executor.map(finder.find_genes, seqs))\n\n    # 4. Format into INSDC-compliant GenBank records\n    records = []\n    for (header_bytes, seq_bytes), genes in zip(contigs, all_genes):\n        header_str = header_bytes.decode('utf-8')\n        record = SeqRecord(\n            Seq(seq_bytes.decode('utf-8')), \n            id=header_str, \n            name=header_str, \n            description=\"Annotated by pyfgs\"\n        )\n        \n        for i, gene in enumerate(genes):\n            # Query the Rust backend for structural variants\n            mutations = gene.mutations(seq_bytes)\n            \n            # INSDC Standard: Frameshifted ORFs cannot be 'CDS', must be 'pseudogene'\n            feature_type = \"pseudogene\" if mutations else \"CDS\"\n            \n            qualifiers = {\n                \"source\": \"pyfgs\",\n                \"inference\": \"ab initio prediction:pyfgs\",\n                \"ID\": f\"{header_str}_FGS_{i+1}\"\n            }\n            \n            if mutations:\n                qualifiers[\"pseudogene\"] = [\"unknown\"]\n                qualifiers[\"note\"] = [\n                    f\"Frameshift {'insertion' if mut.mut_type == 'ins' else 'deletion'} \"\n                    f\"at pos {mut.pos} (codon {mut.codon_idx}). {mut.annotation}\"\n                    for mut in mutations\n                ]\n            else:\n                # Only strictly intact CDS features receive a translation qualifier\n                qualifiers[\"translation\"] = [gene.translation().decode('utf-8')]\n            \n            # Biopython's FeatureLocation is natively 0-based and half-open, \n            # mapping perfectly to our Gene.start and Gene.end!\n            location = FeatureLocation(gene.start, gene.end, strand=gene.strand)\n            feature = SeqFeature(location=location, type=feature_type, qualifiers=qualifiers)\n            record.features.append(feature)\n        \n        records.append(record)\n\n    # 5. Export to GenBank\n    SeqIO.write(records, \"annotated_genome.gbk\", \"genbank\")\n    print(f\"Successfully annotated {len(records)} contigs!\")\n\nif __name__ == \"__main__\":\n    main()\n```\n\n### CLI Usage\n\nFor CLI usage, type `pyfgs --help`\n\n```console\nusage: pyfgs \u003cseq\u003e [options]\n\n🔗🐍⏭️\tPyO3 bindings and Python interface to FragGeneScanRs,\n\ta gene prediction model for short and error-prone reads.\n\nInput options 💽:\n\n  seq                 Sequence file (or '-' for stdin)\n  -m, --model         Sequence error model (default: complete)\n                       - short1: Illumina sequencing reads with about 0.1% error rate\n                       - short5: Illumina sequencing reads with about 0.5% error rate\n                       - short10: Illumina sequencing reads with about 1% error rate\n                       - sanger5: Sanger sequencing reads with about 0.5% error rate\n                       - sanger10: Sanger sequencing reads with about 1% error rate\n                       - pyro5: 454 pyrosequencing reads with about 0.5% error rate\n                       - pyro10: 454 pyrosequencing reads with about 1% error rate\n                       - pyro30: 454 pyrosequencing reads with about 3% error rate\n                       - complete: Complete genomic sequences or short sequence reads without sequencing error\n  -r, --reads         Force FASTQ parsing (Overrides auto-detection)\n  -w, --whole-genome  Strict contiguous ORFs. Disables error-tolerant frameshift detection.\n\nOutput options ⚙️:\n  Provide a PATH to save to a file, or use the flag alone to print to stdout.\n\n  --faa [PATH]        Output protein FASTA\n  --fna [PATH]        Output nucleotide FASTA\n  --bed [PATH]        Output BED6+1 format\n  --gff [PATH]        Output GFF3 format\n  --vcf [PATH]        Output VCF v4.2 format\n\nOther options 🚧:\n\n  -t, --threads       Number of threads (default: optimal)\n  -v, --version       Print version and exit\n  -h, --help          Print help and exit\n```\n\n\n## Performance\n\n`pyfgs` is continuously benchmarked against NCBI RefSeq ground-truth datasets on every commit to `main` to ensure we never introduce performance regressions.\n\n`pyfgs` was benchmarked against `pyrodigal` (the excellent standard for Python-based gene prediction) to compare both raw inference speed and accuracy against NCBI RefSeq ground-truth annotations.\n\nBecause `pyfgs` is powered by pre-trained Hidden Markov Models in Rust, it does not need to perform an initial training scan over the sequence to calculate transition probabilities. This allows it to scale incredibly well on larger genomes and massive metagenomic datasets.\n\n### ⏱️ Speed: The \"No-Training\" Advantage\n\n**Test Conditions:** Pure inference time (excluding I/O) on an M-series Mac, using complete reference genomes. `pyrodigal` was run in single-genome mode (`meta=False`, requiring a training step), and `pyfgs` was run with `whole_genome=True`.\n\n| Organism | Genome Size | `pyrodigal` Time | `pyfgs` Time | Speedup |\n| :--- | :--- | :--- | :--- | :--- |\n| *S. aureus* (Low GC) | 2.8 Mb | 0.85s | **0.85s** | **1.0x** (Tie) |\n| *E. coli* (Standard) | 4.6 Mb | 2.26s | **1.42s** | **1.6x** |\n| *P. aeruginosa* (High GC)| 6.3 Mb | 3.30s | **1.37s** | **2.4x** |\n\n*Note: For complete genomes, `pyrodigal`'s dynamic programming engine must first scan the entire sequence to build a statistical model. `pyfgs` completely bypasses this upfront compute tax, resulting in massive time savings on larger genomes.*\n\n### 🎯 Accuracy \u0026 The `whole_genome` Trade-off\n\n`pyfgs` offers two distinct modes of operation, allowing you to choose between strict RefSeq-style conservative calling and highly sensitive frameshift-aware predictions.\n\n**1. Complete Genomes (`whole_genome=True`)**\nWhen working with pristine, high-quality assemblies, setting `whole_genome=True` forces the Viterbi algorithm to only traverse standard codon states.\n* **Result:** Blistering speed and highly conservative gene calls that closely mirror strict NCBI RefSeq annotations (~93-97% exact 3' stop codon matches).\n\n**2. Noisy Reads \u0026 Metagenomes (`whole_genome=False`)**\nWhen working with raw Oxford Nanopore reads, error-prone contigs, or complex metagenomes, setting `whole_genome=False` unlocks the true power of the `pyfgs` HMM.\n* **The Compute Tax:** The Rust backend activates \"Indel\" states, mathematically evaluating the probability of a frameshift insertion or deletion at *every single nucleotide*. This increases the compute time by ~30-50%.\n* **The Sensitivity Boost:** The algorithm becomes incredibly forgiving. It successfully rescues broken genes, pseudogenes, and fragmented ORFs that standard dynamic programming tools completely discard. This results in a higher number of overall predicted ORFs, ensuring you don't miss crucial biological signals hidden behind sequencing errors.\n\n## 🔖 Citation\n\nFor now, please cite the original\n[FragGeneScanRs paper](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04736-5):\n\n\u003e Van der Jeugt, F., Dawyndt, P. \u0026 Mesuere, B. FragGeneScanRs: faster gene prediction for short reads.\nBMC Bioinformatics 23, 198 (2022). https://doi.org/10.1186/s12859-022-04736-5\n\n\n## 💭 Feedback\n\n### ⚠️ Issue Tracker\n\nFound a bug ? Have an enhancement request ? Head over to the\n[GitHub issue tracker](https://github.com/tomdstanton/pyfgs/issues) if you need to report\nor ask something. If you are filing in on a bug, please include as much\ninformation as you can about the issue, and try to recreate the same bug\nin a simple, easily reproducible situation.\n\n### 🏗️ Contributing\n\nContributions are more than welcome! See\n[`CONTRIBUTING.md`](https://github.com/tomdstanton/pyfgs/blob/main/CONTRIBUTING.md)\nfor more details.\n\n## 📋 Changelog\n\nThis project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html)\nand provides a [changelog](https://github.com/tomdstanton/pyfgs/blob/main/CHANGELOG.md)\nin the [Keep a Changelog](http://keepachangelog.com/en/1.0.0/) format.\n\n\n## ⚖️ License\n\nThis library is provided under the [GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/).\nThe FragGeneScanRs code was written by [Peter Dawyndt](https://github.com/pdawyndt),\n[Bart Mesuere](https://github.com/bmesuere) and\n[Felix Van der Jeugt](https://github.com/ninewise) and is distributed under the\nterms of the GPLv3 as well. See `https://github.com/FragGeneScanRs/LICENSE` for more information.\n\n*This project is in no way affiliated, sponsored, or otherwise endorsed\nby the original FragGeneScanRs authors [Peter Dawyndt](https://github.com/pdawyndt),\n[Bart Mesuere](https://github.com/bmesuere) and\n[Felix Van der Jeugt](https://github.com/ninewise). It was developed\nby [Tom Stanton](https://github.com/tomdstanton/) during his Post-doc project\nat [Monash University](https://www.monash.edu/medicine/translational/infectious-diseases) in\nthe [Wryes Lab](https://wyreslab.com/).*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomdstanton%2Fpyfgs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftomdstanton%2Fpyfgs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftomdstanton%2Fpyfgs/lists"}