{"id":13704809,"url":"https://github.com/lh3/minimap","last_synced_at":"2025-05-05T12:32:18.178Z","repository":{"id":139699901,"uuid":"41985726","full_name":"lh3/minimap","owner":"lh3","description":"This repo is DEPRECATED. Please use minimap2, the successor of minimap.","archived":true,"fork":false,"pushed_at":"2017-09-20T14:15:02.000Z","size":108,"stargazers_count":106,"open_issues_count":8,"forks_count":29,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-08-03T22:14:15.316Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://github.com/lh3/minimap2","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-09-06T03:37:27.000Z","updated_at":"2024-03-06T06:53:01.000Z","dependencies_parsed_at":"2024-01-12T21:17:08.132Z","dependency_job_id":"08e26185-4ee1-458b-81f3-cff75a25648f","html_url":"https://github.com/lh3/minimap","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminimap","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminimap/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminimap/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminimap/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/minimap/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224448691,"owners_count":17313099,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T22:00:17.536Z","updated_at":"2024-11-13T12:30:57.844Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":["Ranked by starred repositories","Long-read Sequencing Tools"],"sub_categories":[],"readme":"## Introduction\n\nMinimap is an *experimental* tool to efficiently find multiple approximate\nmapping positions between two sets of long sequences, such as between reads and\nreference genomes, between genomes and between long noisy reads. By default, it\nis tuned to have high sensitivity to 2kb matches around 20% divergence but with\nlow specificity. Minimap does not generate alignments as of now and because of\nthis, it is usually tens of times faster than mainstream *aligners*. With four\nCPU cores, minimap can map 1.6Gbp PacBio reads to human in 2.5 minutes, 1Gbp\nPacBio E. coli reads to pre-indexed 9.6Gbp bacterial genomes in 3 minutes, to\npre-indexed \u003e100Gbp nt database in ~1 hour (of which ~20 minutes are spent on\nloading index from the network filesystem; peak RAM: 10GB), map 2800 bacteria\nto themselves in 1 hour, and map 1Gbp E. coli reads against themselves in a\ncouple of minutes.\n\nMinimap does not replace mainstream aligners, but it can be useful when you\nwant to quickly identify long approximate matches at moderate divergence among\na huge collection of sequences. For this task, it is much faster than most\nexisting tools.\n\n## Usage\n\n* Map two sets of long sequences:\n  ```sh\n  minimap target.fa.gz query.fa.gz \u003e out.mini\n  ```\n  The output is TAB-delimited with each line consisting of query name, length,\n  0-based start, end, strand, target name, length, start, end, the number of\n  matching bases, the number of co-linear minimizers in the match and the\n  fraction of matching bases.\n\n* All-vs-all PacBio read self-mapping for [miniasm][miniasm]:\n  ```sh\n  minimap -Sw5 -L100 -m0 reads.fa reads.fa | gzip -1 \u003e reads.paf.gz\n  ```\n\n* Prebuild index and then map:\n  ```sh\n  minimap -d target.mmi target.fa.gz\n  minimap -l target.mmi query.fa.gz \u003e out.mini\n  ```\n  Minimap indexing is very fast (1 minute for human genome; 50 minutes for \u003e100Gbp\n  nt database retrieved on 2015-09-30), but for huge\n  repeatedly used databases, prebuilding index is still preferred.\n\n* Map sequences against themselve without diagnal matches:\n  ```sh\n  minimap -S sequences.fa sequences.fa \u003e self-match.mini\n  ```\n  The output may still contain overlapping matches in repetitive regions.\n\n## Algorithm Overview\n\n1. Indexing. Collect all [(*w*,*k*)-minimizers][mini] in a batch (**-I**=4\n   billion bp) of target sequences and store them in a hash table. Mark top\n   **-f**=0.1% of most frequent minimizers as repeats. Minimap\n   uses [invertible hash function][invhash] to avoid taking ploy-A as\n   minimizers.\n\n2. For each query, collect all (*w*,*k*)-minimizers and look up the hash table for\n   matches (*q\u003csub\u003ei\u003c/sub\u003e*,*t\u003csub\u003ei\u003c/sub\u003e*,*s\u003csub\u003ei\u003c/sub\u003e*), where\n   *q\u003csub\u003ei\u003c/sub\u003e* is the query position, *t\u003csub\u003ei\u003c/sub\u003e* the target position\n   and *s\u003csub\u003ei\u003c/sub\u003e* indicates whether the minimizer match is on the same\n   strand.\n\n3. For matches on the same strand, sort by {*q\u003csub\u003ei\u003c/sub\u003e*-*t\u003csub\u003ei\u003c/sub\u003e*}\n   and then cluster matches within a **-r**=500bp window. Minimap merges\n   two windows if **-m**=50% of minimizer matches overlap. For matches on different\n   strands, sort {*q\u003csub\u003ei\u003c/sub\u003e*+*t\u003csub\u003ei\u003c/sub\u003e*} and apply a similar\n   clustering procedure. This is inspired by the [Hough transformation][hough].\n\n4. For each cluster, sort (*q\u003csub\u003ei\u003c/sub\u003e*,*t\u003csub\u003ei\u003c/sub\u003e*) by *q\u003csub\u003ei\u003c/sub\u003e*\n   and solve a [longest increasing sequence problem][lis] for *t\u003csub\u003ei\u003c/sub\u003e*. This\n   finds the longest co-linear matching chain. Break the chain whenever there\n   is a gap longer than **-g**=10000.\n\n5. Output the start and end of the chain if it contains **-c**=4 or more\n   minimizer matches and the matching length is no less than **-L**=40.\n\n6. Go to 1 and rewind to the first record of query if there are more target\n   sequences; otherwise stop.\n\nTo increase sensitivity, we may decrease **-w** to index more minimizers;\nwe may also decrease **-k**, though this may greatly impact performance for\nmammalian genomes.\n\nAlso note that by default, if the total length of target sequences is less than\n4Gbp (1G=1 billion; controlled by **-I**), minimap creates one index and stream\nall the query sequences in one go. The multiple hits of a query sequence is\nadjacent to each other in the output. If the total length is greater than\n4Gbp, minimap needs to read query sequences multiple times. The multiple hits\nof a query may not be adjacent.\n\n[mini]: http://bioinformatics.oxfordjournals.org/content/20/18/3363.abstract\n[lis]: https://en.wikipedia.org/wiki/Longest_increasing_subsequence\n[hough]: https://en.wikipedia.org/wiki/Hough_transform\n[invhash]: https://gist.github.com/lh3/974ced188be2f90422cc\n[miniasm]: https://github.com/lh3/miniasm\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fminimap","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fminimap","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fminimap/lists"}