{"id":13641559,"url":"https://github.com/lh3/bwa","last_synced_at":"2025-04-10T04:49:32.186Z","repository":{"id":1309678,"uuid":"1253014","full_name":"lh3/bwa","owner":"lh3","description":"Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)","archived":false,"fork":false,"pushed_at":"2024-07-27T19:23:26.000Z","size":1715,"stargazers_count":1535,"open_issues_count":239,"forks_count":556,"subscribers_count":107,"default_branch":"master","last_synced_at":"2024-10-29T15:34:52.080Z","etag":null,"topics":["bioinformatics","fm-index","genomics","sequence-alignment"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README-alt.md","changelog":"ChangeLog","contributing":null,"funding":null,"license":"COPYING","code_of_conduct":"code_of_conduct.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2011-01-14T01:36:33.000Z","updated_at":"2024-10-29T05:05:18.000Z","dependencies_parsed_at":"2024-11-26T15:03:50.815Z","dependency_job_id":null,"html_url":"https://github.com/lh3/bwa","commit_stats":{"total_commits":844,"total_committers":37,"mean_commits":22.81081081081081,"dds":0.20260663507109,"last_synced_commit":"79b230de48c74156f9d3c26795a360fc5a2d5d3b"},"previous_names":[],"tags_count":33,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbwa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbwa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbwa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fbwa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/bwa/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248161243,"owners_count":21057552,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","fm-index","genomics","sequence-alignment"],"created_at":"2024-08-02T01:01:21.771Z","updated_at":"2025-04-10T04:49:32.163Z","avatar_url":"https://github.com/lh3.png","language":"C","readme":"## For the Impatient\n\n```sh\n# Download bwakit (or from \u003chttp://sourceforge.net/projects/bio-bwa/files/bwakit/\u003e manually)\nwget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.12_x64-linux.tar.bz2/download \\\n  | gzip -dc | tar xf -\n# Generate the GRCh38+ALT+decoy+HLA and create the BWA index\nbwa.kit/run-gen-ref hs38DH   # download GRCh38 and write hs38DH.fa\nbwa.kit/bwa index hs38DH.fa  # create BWA index\n# mapping\nbwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh  # skip \"|sh\" to show command lines\n```\n\nThis generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA\ngenotypes on each gene and `out.hla.all` for other possible HLA genotypes.\nPlease check out [bwa/bwakit/README.md][kithelp] for details.\n\n## Background\n\nGRCh38 consists of several components: chromosomal assembly, unlocalized contigs\n(chromosome known but location unknown), unplaced contigs (chromosome unknown)\nand ALT contigs (long clustered variations). The combination of the first three\ncomponents is called the *primary assembly*. It is recommended to use the\ncomplete primary assembly for all analyses. Using ALT contigs in read mapping is\ntricky.\n\nGRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary\nassembly. However, sequences that are highly diverged from the primary assembly\nonly contribute a few million bp. Most subsequences of ALT contigs are nearly\nidentical to the primary assembly. If we align sequence reads to GRCh38+ALT\nblindly, we will get many additional reads with zero mapping quality and miss\nvariants on them. It is crucial to make mappers aware of ALTs.\n\nBWA-MEM is ALT-aware. It essentially computes mapping quality across the\nnon-redundant content of the primary assembly plus the ALT contigs and is free\nof the problem above.\n\n## Methods\n\n### Sequence alignment\n\nAs of now, ALT mapping is done in two separate steps: BWA-MEM mapping and\npostprocessing. The `bwa.kit/run-bwamem` script performs the two steps when ALT\ncontigs are present. The following picture shows an example about how BWA-MEM\ninfers mapping quality and reports alignment after step 2:\n\n![](http://lh3lh3.users.sourceforge.net/images/alt-demo.png)\n\n#### Step 1: BWA-MEM mapping\n\nAt this step, BWA-MEM reads the ALT contig names from \"*idxbase*.alt\", ignoring\nthe ALT-to-ref alignment, and labels a potential hit as *ALT* or *non-ALT*,\ndepending on whether the hit lands on an ALT contig or not. BWA-MEM then reports\nalignments and assigns mapQ following these two rules:\n\n1. The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of\n   an ALT hit is computed across all hits.\n\n2. If there are no non-ALT hits, the best ALT hit is outputted as the primary\n   alignment. If there are both ALT and non-ALT hits, non-ALT hits will be\n   primary and ALT hits be supplementary (SAM flag 0x800).\n\nIn theory, non-ALT alignments from step 1 should be identical to alignments\nagainst the reference genome with ALT contigs. In practice, the two types of\nalignments may differ in rare cases due to seeding heuristics. When an ALT hit\nis significantly better than non-ALT hits, BWA-MEM may miss seeds on the\nnon-ALT hits.\n\nIf we don't care about ALT hits, we may skip postprocessing (step 2).\nNonetheless, postprocessing is recommended as it improves mapQ and gives more\ninformation about ALT hits.\n\n#### Step 2: Postprocessing\n\nPostprocessing is done with a separate script `bwa-postalt.js`. It reads all\npotential hits reported in the XA tag, lifts ALT hits to the chromosomal\npositions using the ALT-to-ref alignment, groups them based on overlaps between\ntheir lifted positions, and then re-estimates mapQ across the best scoring hit\nin each group. Being aware of the ALT-to-ref alignment, this script can greatly\nimprove mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also\nwrites each hit overlapping the reported hit into a separate SAM line. This\nenables variant calling on each ALT contig independent of others.\n\n### On the completeness of GRCh38+ALT\n\nWhile GRCh38 is much more complete than GRCh37, it is still missing some true\nhuman sequences. To make sure every piece of sequence in the reference assembly\nis correct, the [Genome Reference Consortium][grc] (GRC) require each ALT contig\nto have enough support from multiple sources before considering to add it to the\nreference assembly. This careful and sophisticated procedure has left out some\nsequences, one of which is [this example][novel], a 10kb contig assembled from\nCHM1 short reads and present also in NA12878. You can try [BLAT][blat] or\n[BLAST][blast] to see where it maps.\n\nFor a more complete reference genome, we compiled a new set of decoy sequences\nfrom GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.\nThe sequences are included in `hs38DH-extra.fa` from the [BWA binary\npackage][res].\n\nIn addition to decoy, we also put multiple alleles of HLA genes in\n`hs38DH-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],\nversion 3.18.0 and are used to collect reads sequenced from these genes.\n\n### HLA typing\n\nHLA genes are known to be associated with many autoimmune diseases, infectious\ndiseases and drug responses. They are among the most important genes but are\nrarely studied by WGS projects due to the high sequence divergence between\nHLA genes and the reference genome in these regions.\n\nBy including the HLA gene regions in the reference assembly as ALT contigs, we\nare able to effectively identify reads coming from these genes. We also provide\na pipeline, which is included in the [BWA binary package][res], to type the\nseveral classic HLA genes. The pipeline is conceptually simple. It de novo\nassembles sequence reads mapped to each gene, aligns exon sequences of each\nallele to the assembled contigs and then finds the pairs of alleles that best\nexplain the contigs. In practice, however, the completeness of IMGT/HLA and\ncopy-number changes related to these genes are not so straightforward to\nresolve. HLA typing may not always be successful. Users may also consider to use\nother programs for typing such as [Warren et al (2012)][hla4], [Liu et al\n(2013)][hla2], [Bai et al (2014)][hla3] and [Dilthey et al (2014)][hla1], though\nmost of them are distributed under restrictive licenses.\n\n## Preliminary Evaluation\n\nTo check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878\nunitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy\n(hs38DH), and called small variants from the alignment. CHM1 is haploid.\nIdeally, heterozygous calls are false positives (FP). NA12878 is diploid. The\ntrue positive (TP) heterozygous calls from NA12878 are approximately equal\nto the difference between NA12878 and CHM1 heterozygous calls. A better assembly\nshould yield higher TP and lower FP. The following table shows the numbers for\nthese assemblies:\n\n|Assembly|hs37   |hs38   |hs38DH|CHM1_1.1|  huref|\n|:------:|------:|------:|------:|------:|------:|\n|FP      | 255706| 168068| 142516|307172 | 575634|\n|TP      |2142260|2163113|2150844|2167235|2137053|\n\nWith this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces\nFP by ~25k but also reduces TP by ~12k. We manually inspected variants called\nfrom hs38 only and found the majority of them are associated with excessive read\ndepth, clustered variants or weak alignment. We believe most hs38-only calls are\nproblematic. In addition, if we compare two NA12878 replicates from HiSeq X10\nwith nearly identical library construction, the difference is ~140k, an order\nof magnitude higher than the difference between hs38 and hs38DH. ALT contigs,\ndecoy and HLA genes in hs38DH improve variant calling and enable the analyses of\nALT contigs and HLA typing at little cost.\n\n## Problems and Future Development\n\nThere are some uncertainties about ALT mappings - we are not sure whether they\nhelp biological discovery and don't know the best way to analyze them. Without\nclear demand from downstream analyses, it is very difficult to design the\noptimal mapping strategy. The current BWA-MEM method is just a start. If it\nturns out to be useful in research, we will probably rewrite bwa-postalt.js in C\nfor performance; if not, we may make changes. It is also possible that we might\nmake breakthrough on the representation of multiple genomes, in which case, we\ncan even get rid of ALT contigs for good.\n\n\n\n[res]: https://sourceforge.net/projects/bio-bwa/files/bwakit\n[sb]: https://github.com/GregoryFaust/samblaster\n[grc]: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/\n[novel]: https://gist.github.com/lh3/9935148b71f04ba1a8cc\n[blat]: https://genome.ucsc.edu/cgi-bin/hgBlat\n[blast]: http://blast.st-va.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn\u0026PAGE_TYPE=BlastSearch\u0026LINK_LOC=blasthome\n[sgdp]: http://www.simonsfoundation.org/life-sciences/simons-genome-diversity-project/\n[hladb]: http://www.ebi.ac.uk/ipd/imgt/hla/\n[grcdef]: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/info/definitions.shtml\n[hla1]: http://biorxiv.org/content/early/2014/07/08/006973\n[hlalink]: http://www.hladiseaseassociations.com\n[hlatools]: https://www.biostars.org/p/93245/\n[hla2]: http://nar.oxfordjournals.org/content/41/14/e142.full.pdf+html\n[hla3]: http://www.biomedcentral.com/1471-2164/15/325\n[hla4]: http://genomemedicine.com/content/4/12/95\n[kithelp]: https://github.com/lh3/bwa/tree/master/bwakit\n","funding_links":[],"categories":["Applications","Next Generation Sequencing","Variant Callers","Ranked by starred repositories","Mapping tools","Extending ADAM"],"sub_categories":["Library OSes and SDKs","Sequence Alignment","Germline SNP/Indel Callers","aligner","Applications"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fbwa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fbwa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fbwa/lists"}