{"id":13751787,"url":"https://github.com/lh3/miniprot","last_synced_at":"2025-05-16T08:03:27.840Z","repository":{"id":60038463,"uuid":"521097490","full_name":"lh3/miniprot","owner":"lh3","description":"Align proteins to genomes with splicing and frameshift","archived":false,"fork":false,"pushed_at":"2025-04-18T20:19:53.000Z","size":450,"stargazers_count":357,"open_issues_count":5,"forks_count":19,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-19T08:21:47.021Z","etag":null,"topics":["bioinformatics","sequence-alignment"],"latest_commit_sha":null,"homepage":"https://lh3.github.io/miniprot/","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-08-04T02:30:58.000Z","updated_at":"2025-04-18T20:19:17.000Z","dependencies_parsed_at":"2024-02-15T17:34:58.076Z","dependency_job_id":"88870de6-1aed-4dc3-8ce4-8edea3950050","html_url":"https://github.com/lh3/miniprot","commit_stats":null,"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminiprot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminiprot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminiprot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminiprot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/miniprot/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254493381,"owners_count":22080126,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","sequence-alignment"],"created_at":"2024-08-03T09:00:54.693Z","updated_at":"2025-05-16T08:03:27.818Z","avatar_url":"https://github.com/lh3.png","language":"C","readme":"[![Release](https://img.shields.io/github/v/release/lh3/miniprot)](https://github.com/lh3/miniprot/releases)\n[![BioConda Install](https://img.shields.io/conda/dn/bioconda/miniprot.svg?style=flag\u0026label=BioConda%20install)](https://anaconda.org/bioconda/miniprot)\n[![Build Status](https://github.com/lh3/miniprot/actions/workflows/ci.yaml/badge.svg)](https://github.com/lh3/miniprot/actions)\n## \u003ca name=\"started\"\u003e\u003c/a\u003eGetting Started\n```sh\n# download and compile\ngit clone https://github.com/lh3/miniprot\ncd miniprot \u0026\u0026 make\n\n# test file\n./miniprot test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz \u003e aln.paf        # PAF output\n./miniprot --gff test/DPP3-hs.gen.fa.gz test/DPP3-mm.pep.fa.gz \u003e aln.gff  # GFF3+PAF output\n\n# general command line: index and align in one go (-I sets max intron size based on genome size)\n./miniprot -Iut16 --gff genome.fna protein.faa \u003e aln.gff\n\n# general command line: index first and then align (recommended)\n./miniprot -t16 -d genome.mpi genome.fna\n./miniprot -Iut16 --gff genome.mpi protein.faa \u003e aln.gff\n\n# output format\nman ./miniprot.1\n```\n\n## Table of Contents\n\n- [Getting Started](#started)\n- [Introduction](#intro)\n- [Users' Guide](#uguide)\n  - [Installation](#install)\n  - [Usage](#usage)\n  - [Algorithm overview](#algo)\n  - [Citing miniprot](#cite)\n- [Limitations](#limit)\n\n## \u003ca name=\"intro\"\u003e\u003c/a\u003eIntroduction\n\nMiniprot aligns a protein sequence against a genome with affine gap penalty,\nsplicing and frameshift. It is primarily intended for annotating protein-coding\ngenes in a new species using known genes from other species. Miniprot is\nsimilar to [GeneWise][genewise] and [Exonerate][exonerate] in functionality but\nit can map proteins to whole genomes and is much faster at the residue\nalignment step.\n\nMiniprot is not optimized for mapping distant homologs because distant homologs\nare less informative to gene annotations. Nonetheless, it is still possible to\ntune seeding parameters to achieve higher sensitivity at the cost of\nperformance.\n\n## \u003ca name=\"uguide\"\u003e\u003c/a\u003eUsers' Guide\n\n### \u003ca name=\"install\"\u003e\u003c/a\u003eInstallation\n\nMiniprot requires SSE2 or NEON instructions and only works on x86\\_64 or ARM\nCPUs. It depends on [zlib][zlib] for parsing gzip'd input files. To compile\nminiprot, type `make` in the source code directory. This will produce a\nstandalone executable `miniprot`. This executable is all you need to invoke\nminiprot.\n\nFor some unknown reason, the default gcc-4.8.5 on CentOS 7 may compile a binary\nthat is very slow on certain sequences but gcc-10.3.0 has more stable\nperformance. If possible, use a more recent gcc to compile miniprot.\n\n### \u003ca name=\"usage\"\u003e\u003c/a\u003eUsage\n\nTo run miniprot, use\n```sh\nminiprot -t8 ref-file protein.faa \u003e output.paf\n```\nwhere `ref-file` can either be a genome in the FASTA format or a pre-built\nindex generated by\n```sh\nminiprot -t8 -d ref.mpi ref.fna\n```\nBecause miniprot indexing is slow and memory intensive, it is recommended to\npre-build the index. FASTA input files can be optionally compressed with gzip.\n\nMiniprot outputs alignment in the protein PAF format. Different from the more\ncommon nucleotide PAF format, miniprot uses more CIGAR operators to encode\nintrons and frameshifts. Please refer to the [manpage][manpage] for detailed explanation.\n\nFor convenience, miniprot can also output GFF3 with option `--gff`:\n```sh\nminiprot -t8 --gff -d ref.mpi ref.fna \u003e out.gff\n```\nThe detailed alignment is embedded in `##PAF` lines in the GFF3 output. You can\nalso get detailed residue alignment with `--aln`.\n\nIf you are aligning proteins to a whole genome, it is recommended to add option\n`-I` to let miniprot automatically set the maximum intron size. You can also\nuse `-G` to explicitly specify the max intron size.\n\n### \u003ca name=\"algo\"\u003e\u003c/a\u003eAlgorithm overview\n\n1. Translate the reference genome to amino acids in six phases and filter out\n   ORFs shorter than 45bp. Reduce 20 amino acids to 13 distinct integers and\n   extract random open syncmers of 6aa in length. By default, miniprot selects\n   20% of 6-mers in average. For a reduced 6-mer at reference position `x`,\n   keep the 6-mer and `floor(x/256)` in a dense hash table. This concludes the\n   indexing step.\n\n2. Given a protein sequence as query, extract 6-mer syncmers on the protein,\n   look up the index for seed matches and apply minimap2-like chaining. This\n   first round of chaining is approximate as the reference positions have been\n   binned during indexing.\n\n3. For each chain in step 2, redo seeding and chaining with sliding 5-mers from\n   both the reference and the protein in the original chain. Miniprot uses all\n   reduced 5-mers for this second round of chaining.\n\n4. Choose top 100 (see `-N`) chains. Filter out anchors around potential\n   introns or long gaps. Perform striped dynamic programming between remaining\n   anchors and also extend from the first or last anchors. This gives the final\n   alignment.\n\n### \u003ca name=\"cite\"\u003e\u003c/a\u003eCiting miniprot\n\nIf you use miniprot, please cite:\n\n\u003e Li, H. (2023) Protein-to-genome alignment with miniprot. *Bioinformatics*, **39**, btad014 [[PMID: 36648328]][mp-pmid].\n\nThe preprint is available at\n[arXiv:2210.08052](https://arxiv.org/abs/2210.08052), which\nadditionally shows metrics on MetaEuk. Please note that the published paper\nevaluated miniprot-0.7. The latest version may report different numbers.\n\n## \u003ca name=\"limit\"\u003e\u003c/a\u003eLimitations\n\n* The initial conditions of dynamic programming are not technically correct,\n  which may result in suboptimal residue alignment in rare cases.\n\n* Support for non-splicing alignment needs to be improved.\n\n* More manual inspection required for improved accuracy. For example, tandem\n  copies in segmental duplications could be handled more carefully.\n\n[exonerate]: https://pubmed.ncbi.nlm.nih.gov/15713233/\n[genewise]: https://pubmed.ncbi.nlm.nih.gov/15123596/\n[mp-pmid]: https://pubmed.ncbi.nlm.nih.gov/36648328/\n[zlib]: https://zlib.net\n[paftools]: https://github.com/lh3/minimap2/blob/master/misc/paftools.js\n[minimap2]: https://github.com/lh3/minimap2\n[spaln]: https://github.com/ogotoh/spaln\n[spaln2]: https://pubmed.ncbi.nlm.nih.gov/22848105/\n[manpage]: https://lh3.github.io/miniprot/miniprot.html\n","funding_links":[],"categories":["Ranked by starred repositories"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fminiprot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fminiprot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fminiprot/lists"}