{"id":13575587,"url":"https://github.com/lh3/minigraph","last_synced_at":"2025-04-05T23:07:40.683Z","repository":{"id":39620439,"uuid":"169757964","full_name":"lh3/minigraph","owner":"lh3","description":"Sequence-to-graph mapper and graph generator","archived":false,"fork":false,"pushed_at":"2024-05-22T00:59:12.000Z","size":948,"stargazers_count":391,"open_issues_count":45,"forks_count":37,"subscribers_count":28,"default_branch":"master","last_synced_at":"2024-05-22T01:47:35.650Z","etag":null,"topics":["bioinformatics","genome-graph","genomics","pan-genome","sequence-alignment"],"latest_commit_sha":null,"homepage":"https://lh3.github.io/minigraph","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":"code_of_conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-02-08T15:43:54.000Z","updated_at":"2024-08-01T15:28:22.422Z","dependencies_parsed_at":"2022-07-13T09:10:28.421Z","dependency_job_id":"3bf27d06-75c0-4ec7-a66b-98eb28bf9e1a","html_url":"https://github.com/lh3/minigraph","commit_stats":null,"previous_names":[],"tags_count":22,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminigraph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminigraph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminigraph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fminigraph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/minigraph/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247411234,"owners_count":20934653,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","genome-graph","genomics","pan-genome","sequence-alignment"],"created_at":"2024-08-01T15:01:02.380Z","updated_at":"2025-04-05T23:07:40.664Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":["C","A list of software capable of analyzing mainly **eukaryotic** genomes for pangenomics.","Ranked by starred repositories"],"sub_categories":[],"readme":"## \u003ca name=\"started\"\u003e\u003c/a\u003eGetting Started\n\n```sh\ngit clone https://github.com/lh3/minigraph\ncd minigraph \u0026\u0026 make\n# Map sequence to sequence, similar to minimap2 without base alignment\n./minigraph test/MT-human.fa test/MT-orangA.fa \u003e out.paf\n# Map sequence to graph\n./minigraph test/MT.gfa test/MT-orangA.fa \u003e out.gaf\n# Incremental graph generation (-l10k necessary for this toy example)\n./minigraph -cxggs -l10k test/MT.gfa test/MT-chimp.fa test/MT-orangA.fa \u003e out.gfa\n# Call per-sample path in each bubble/variation (-c not needed for this)\n./minigraph -xasm -l10k --call test/MT.gfa test/MT-orangA.fa \u003e orangA.call.bed\n\n# Extract localized structural variations\ngfatools bubble out.gfa \u003e SV.bed\n\n# Generate human MHC graph and call SVs jointly (~10 min)\ncurl -sL https://zenodo.org/record/8245267/files/mg-cookbook-v1_x64-linux.tar.bz2?download=1 | tar -jxf -\ncd mg-cookbook-v1_x64-linux \u0026\u0026 ./00run.sh\n```\n\n## Table of Contents\n\n\u003cimg align=\"right\" width=\"278\" src=\"doc/example1.png\"/\u003e\n\n- [Getting Started](#started)\n- [Introduction](#intro)\n- [Users' Guide](#uguide)\n  - [Installation](#install)\n  - [Sequence-to-graph mapping](#map)\n  - [Graph generation](#ggen)\n  - [Calling structural variations](#callsv)\n  - [SV calling showcase (human MHC)](#svexample)\n  - [Prebuilt graphs](#prebuilt)\n  - [Algorithm overview](#algo)\n- [Limitations](#limit)\n\n## \u003ca name=\"intro\"\u003e\u003c/a\u003eIntroduction\n\nMinigraph is a sequence-to-graph mapper and graph constructor. For graph\ngeneration, it aligns a query sequence against a sequence graph and\nincrementally augments an existing graph with long query subsequences diverged\nfrom the graph. The figure on the right briefly explains the procedure.\n\nMinigraph borrows ideas and code from [minimap2][minimap2]. It is fairly\nefficient and can construct a graph from 90 human assemblies in a couple of\ndays using 24 CPU cores. Older versions of minigraph was unable to produce\nbase alignment. The latest version can. **Please add option `-c` for graph\ngeneration** as it generally improves the quality of graphs.\n\n## \u003ca name=\"uguide\"\u003e\u003c/a\u003eUsers' Guide\n\n### \u003ca name=\"install\"\u003e\u003c/a\u003eInstallation\n\nTo install minigraph, type `make` in the source code directory. The only\nnon-standard dependency is [zlib][zlib]. For better performance, it is\nrecommended to compile with recent compliers.\n\n### \u003ca name=\"map\"\u003e\u003c/a\u003eSequence-to-graph mapping\n\nTo map sequences against a graph, you should prepare the graph in the [GFA\nformat][gfa1], or preferrably the [rGFA format][rgfa]. If you don't have\na graph, you can generate a graph from multiple samples (see the [Graph\ngeneration section](#ggen) below). The typical command line for mapping is\n```sh\nminigraph -cx lr graph.gfa query.fa \u003e out.gaf\n```\nYou may choose the right preset option `-x` according to input. Minigraph\noutput mappings in the [GAF format][gaf], which is a strict superset of the\n[PAF format][paf]. The only visual difference between GAF and PAF is that the\n6th column in GAF may encode a graph path like\n`\u003eMT_human:0-4001\u003cMT_orang:3426-3927` instead of a contig/chromosome name.\n\nThe minigraph GFA parser seamlessly parses FASTA and converts it to GFA\ninternally, so you can also provide sequences in FASTA as the reference. In\nthis case, minigraph will behave like minimap2, though likely producing\ndifferent alignments due to differences between the two implementations.\n\n### \u003ca name=\"ggen\"\u003e\u003c/a\u003eGraph generation\n\nThe following command-line generates a graph in rGFA:\n```sh\nminigraph -cxggs -t16 ref.fa sample1.fa sample2.fa \u003e out.gfa\n```\nwhich is equivalent to\n```sh\nminigraph -cxggs -t16 ref.fa sample1.fa \u003e sample1.gfa\nminigraph -cxggs -t16 sample1.gfa sample2.fa \u003e out.gfa\n```\nFile `ref.fa` is typically the reference genome (e.g. GRCh38 for human).\nIt can also be replaced by a graph in rGFA. Minigraph assumes `sample1.fa` to\nbe the whole-genome assembly of an individual. This is an important assumption:\nminigraph only considers 1-to-1 orthogonal regions between the graph and the\nindividual FASTA. If you use raw reads or put multiple individual genomes in\none file, minigraph will filter out most alignments as they cover the input\ngraph multiple times.\n\nThe output rGFA can be converted to a FASTA file with [gfatools][gfatools]:\n```sh\ngfatools gfa2fa -s graph.gfa \u003e out.stable.fa\n```\nThe output `out.stable.fa` will always include the initial reference `ref.fa`\nand may additionally add new segments diverged from the initial reference.\n\n### \u003ca name=\"callsv\"\u003e\u003c/a\u003eCalling structural variations\n\nA minigraph graph is composed of chains of bubbles with the reference as the\nbackbone. Each *bubble* represents a structural variation. It can be\nmulti-allelic if there are multiple paths through the bubble. You can extract\nthese bubbles with\n```sh\ngfatools bubble graph.gfa \u003e var.bed\n```\nThe output is a BED-like file. The first three columns give the position of a\nbubble/variation and the rest of columns are:\n\n* (4) \\# GFA segments in the bubble including the source and the sink of the bubble\n* (5) \\# all possible paths through the bubble (not all paths present in input samples)\n* (6) 1 if the bubble involves an inversion; 0 otherwise\n* (7) length of the shortest path (i.e. allele) through the bubble\n* (8) length of the longest path/allele through the bubble\n* (9-11) please ignore\n* (12) list of segments in the bubble; first for the source and last for the sink\n* (13) sequence of the shortest path (`*` if zero length)\n* (14) sequence of the longest path (NB: it may not be present in the input samples)\n\nGiven an assembly, you can find the path/allele of this assembly in each bubble with\n```sh\nminigraph -cxasm --call -t16 graph.gfa sample-asm.fa \u003e sample.bed\n```\nOn each line in the BED-like output, the last colon separated field gives the\nalignment path through the bubble, the path length in the graph, the mapping\nstrand of sample contig, the contig name, the approximate contig start and\ncontig end. The number of lines in the file is the same as the number of lines\nin the output of `gfatools bubble`.\n\n### \u003ca name=\"svexample\"\u003e\u003c/a\u003eSV calling showcase (human MHC)\n\nThe following example generates a graph for 61 humam MHC haplotypes and calls\nSVs from them. Primary sequences are retrieved from an [AGC][agc] archive.\n```sh\n# Obtain cookbook data and precompiled binaries\ncurl -sL https://zenodo.org/record/8245267/files/mg-cookbook-v1_x64-linux.tar.bz2?download=1 | tar -jxf -\ncd mg-cookbook-v1_x64-linux\n\n# Generate graph. This takes ~7 minutes.\n./agc listset MHC-61.agc | awk '!/GRC/{a=a\" \u003c(./agc getset MHC-61.agc \"$1\")\"}END{print \"./minigraph -cxggs \u003c(./agc getset MHC-61.agc MHC-00GRCh38)\"a}' | bash \u003e MHC-61.gfa 2\u003e MHC-61.gfa.log\n\n# Call SVs per sample. This takes a couple of minutes.\n./agc listset MHC-61.agc | xargs -i echo ./minigraph -cxasm --call -t1 MHC-61.gfa '\u003c(./agc getset MHC-61.agc {})' \\\u003e {}.bed 2\\\u003e {}.bed.log | parallel -j16\n\n# Merge per-sample calls and generate VCF. `-r0` indicates the reference sample.\npaste *.bed | ./k8 mgutils.js merge -s \u003c(./agc listset MHC-61.agc) - | gzip \u003e MHC-61.sv.bed.gz\n./k8 mgutils-es6.js merge2vcf -r0 MHC-61.sv.bed.gz \u003e MHC-61.sv.vcf\n```\n\nIn this example, the GRCh38 haplotype is named \"MHC-00GRCh38\" in the AGC\narchive and is taken as the reference. The awk command line generates a command\nline that retrieves each haplotype on the fly and feeds it to minigraph.\n`misc/mgutils.js merge` combines per-sample calls and generates a merged BED\nfile. The final `misc/mgutils-es6.js merge2vcf` derives a VCF file. This script\nrequires the latest k8 JavaScript runtime.\n\n### \u003ca name=\"prebuilt\"\u003e\u003c/a\u003ePrebuilt graphs\n\nPrebuilt human graphs in the rGFA format can be found [at Zenodo][human-zenodo].\n\n### \u003ca name=\"algo\"\u003e\u003c/a\u003eAlgorithm overview\n\n\u003cimg align=\"right\" width=\"278\" src=\"doc/example2.png\"/\u003e\n\nIn the following, minigraph command line options have a dash ahead and are\nhighlighted in bold. The description may help to tune minigraph parameters.\n\n1. Read all reference bases, extract (**-k**,**-w**)-minimizers and index them\n   in a hash table.\n\n2. Read **-K** [=*500M*] query bases in the mapping mode, or read all query\n   bases in the graph construction mode. For each query sequence, do step 3\n   through 5:\n\n3. Find colinear minimizer chains using the [minimap2][minimap2] algorithm,\n   assuming segments in the graph are disconnected. These are called *linear\n   chains*.\n\n4. Perform another round of chaining, taking each linear chain as an anchor.\n   For a pair of linear chains, minigraph tries to connect them by doing graph\n   wavefront alignment algorithm (GWFA). If minigraph fails to find an\n   alignment within an edit distance threshold, it will find up to 15 shortest\n   paths between the two linear chains and chooses the path of length closest\n   to the distance on the query sequence. Chains found at this step are called\n   *graph chains*.\n\n5. Identify primary chains and estimate mapping quality with a method similar\n   to the one used in minimap2. Perform base alignment.\n\n6. In the graph construction mode, collect all mappings longer than **-d**\n   [=*10k*] and keep their query and graph segment intervals in two lists,\n   respectively.\n\n7. For each mapping longer than **-l** [=*100k*], finds poorly aligned regions.\n   A region is filtered if it overlaps two or more intervals collected at step\n   6.\n\n8. Insert the remaining poorly aligned regions into the input graph. This\n   constructs a new graph.\n\n## \u003ca name=\"limit\"\u003e\u003c/a\u003eLimitations\n\n* A complex minigraph subgraph is often suboptimal and may vary with the order\n  of input samples. It may not represent the evolution history\n  or the functional relevance at the locus. Please *do not overinterpret*\n  complex subgraphs. If you are interested in a particular subgraph, it is\n  recommended to extract the input contig subsequences involved in the subgraph\n  with the `--call` option and manually curated the results.\n\n* Minigraph needs to find strong colinear chains first. For a graph consisting\n  of many short segments (e.g. one generated from rare SNPs in large\n  populations), minigraph will fail to map query sequences.\n\n* The base alignment in the current version of minigraph is slow for species of\n  high diversity.\n\n\n[zlib]: http://zlib.net/\n[minimap2]: https://github.com/lh3/minimap2\n[rgfa]: https://github.com/lh3/gfatools/blob/master/doc/rGFA.md\n[gfa1]: https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md\n[gaf]: https://github.com/lh3/gfatools/blob/master/doc/rGFA.md#the-graph-alignment-format-gaf\n[paf]: https://github.com/lh3/miniasm/blob/master/PAF.md\n[gfatools]: https://github.com/lh3/gfatools\n[bandage]: https://rrwick.github.io/Bandage/\n[gfaviz]: https://github.com/ggonnella/gfaviz\n[human-zenodo]: https://zenodo.org/record/6983934\n[agc]: https://github.com/refresh-bio/agc\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fminigraph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fminigraph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fminigraph/lists"}