{"id":20810106,"url":"https://github.com/lh3/ropebwt3","last_synced_at":"2025-04-09T11:12:02.494Z","repository":{"id":243696604,"uuid":"810366452","full_name":"lh3/ropebwt3","owner":"lh3","description":"BWT construction and search","archived":false,"fork":false,"pushed_at":"2024-10-16T01:43:55.000Z","size":294,"stargazers_count":89,"open_issues_count":0,"forks_count":2,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-10-17T15:18:43.588Z","etag":null,"topics":["bioinformatics","bwt","fm-index"],"latest_commit_sha":null,"homepage":"","language":"C","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lh3.png","metadata":{"files":{"readme":"README.md","changelog":"NEWS.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-04T14:54:32.000Z","updated_at":"2024-10-16T02:46:14.000Z","dependencies_parsed_at":"2024-06-26T08:47:10.469Z","dependency_job_id":"64ec6455-4a04-41bc-a822-967b6c2be804","html_url":"https://github.com/lh3/ropebwt3","commit_stats":null,"previous_names":["lh3/ropebwt3"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fropebwt3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fropebwt3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fropebwt3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lh3%2Fropebwt3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lh3","download_url":"https://codeload.github.com/lh3/ropebwt3/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248027411,"owners_count":21035594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","bwt","fm-index"],"created_at":"2024-11-17T20:19:43.978Z","updated_at":"2025-04-09T11:12:02.478Z","avatar_url":"https://github.com/lh3.png","language":"C","funding_links":[],"categories":[],"sub_categories":[],"readme":"## \u003ca name=\"start\"\u003e\u003c/a\u003eGetting Started\n```sh\n# Compile\ngit clone https://github.com/lh3/ropebwt3\ncd ropebwt3\nmake  # use \"make omp=0\" if your compiler doesn't suport OpenMP\n\n# Toy examples\necho -e 'AGG\\nAGC' | ./ropebwt3 build -LR -\necho TGAACTCTACACAACATATTTTGTCACCAAG | ./ropebwt3 build -Ldo idx.fmd -\necho ACTCTACACAAgATATTTTGTCA | ./ropebwt3 mem -Ll10 idx.fmd -\n\n# Download the prebuilt FM-index for 152 M. tuberculosis genomes\nwget -O- https://zenodo.org/records/12803206/files/mtb152.tar.gz?download=1 | tar -zxf -\n\n# Count super-maximal exact matches (no contig positions)\necho ACCTACAACACCGGTGGCTACAACGTGG  | ./ropebwt3 mem -L mtb152.fmd -\n# Local alignment\necho ACCTACAACACCGGTaGGCTACAACGTGG | ./ropebwt3 sw -Lm20 mtb152.fmd -\n# Retrieve R15311, the 46th genome in the collection. 90=(46-1)*2\n./ropebwt3 get mtb152.fmd 90 \u003e R15311.fa\n\n# Download the index of 472 human long-read assemblies (18GB download size)\nwget -O human472.fmr.gz https://zenodo.org/records/14854401/files/human472.fmr.gz\nwget -O human472.fmd.ssa.gz https://zenodo.org/records/14854401/files/human472.fmd.ssa.gz\nwget -O human472.fmd.len.gz https://zenodo.org/records/14854401/files/human472.fmd.len.gz\ngzip -d human472.fmr.gz human472.fmd.ssa.gz   # or use pigz for parallel decompression\n./ropebwt3 build -i human472.fmr -do human472.fmd   # convert to a faster format\n\n# Find C4 alleles (the query is on the exon 26 of C4A)\necho CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ./ropebwt3 sw -eN200 -Lm10 human472.fmd -\n```\n\n## Table of Contents\n\n- [Getting Started](#start)\n- [Introduction](#intro)\n- [Usage](#use)\n  - [Finding maximal exact matches](#mem)\n  - [Local alignment](#bwasw)\n  - [Haplotype diversity with end-to-end alignment](#e2e)\n  - [Indexing](#build)\n  - [Binary BWT formats](#format)\n- [Citation](#cite)\n- [Limitations](#limit)\n\n## \u003ca name=\"intro\"\u003e\u003c/a\u003eIntroduction\n\nRopebwt3 constructs the FM-index of a large DNA sequence set and searches for\nmatches against the FM-index. It is optimized for highly redundant sequence\nsets such as a pangenome or sequence reads at high coverage. Ropebwt3 can\nlosslessly compress 7.3Tb of common bacterial genomes into a 30GB run-length\nencoded BWT file and report supermaximal exact matches (SMEMs) or local\nalignments with mismatches and gaps.\n\nPrebuilt ropebwt3 indices can be downloaded [from Zenodo][zenodo].\n\n## \u003ca name=\"use\"\u003e\u003c/a\u003eUsage\n\nA full ropebwt3 index consists of three files:\n\n* `\u003cbase\u003e.fmd`: run-length encoded BWT that supports the rank operation. It is\n  generated by the `build` command. By default, the $i$-th sequence in the input\n  is the $2i$-th sequence in the BWT and its reverse complement is the\n  $(2i+1)$-th sequence. Some commands assume such ordering.\n\n* `\u003cbase\u003e.fmd.ssa`: sampled suffix array, generated by the `ssa` command. For\n  now, it is only needed for reporting coordinates in the PAF output of the\n  `sw` command.\n\n* `\u003cbase\u003e.fmd.len.gz`: list of sequence names and lengths. It is generated\n  with third-party tools/scripts, for example, with `seqtk comp input.fa | cut\n  -f1,2 | gzip`. This file is needed for reporting sequence names and lengths\n  in the PAF output.\n\n### \u003ca name=\"mem\"\u003e\u003c/a\u003eFinding maximal exact matches\n\nA maximal exact match (MEM) is an exact alignment between the index and a query\nthat cannot be extended in either direction. A super MEM (SMEM) is a MEM that\nis not contained in any other MEM on the query sequence. You can find the SMEMs\nwith\n```sh\nropebwt3 mem -t4 -l31 bwt.fmd query.fa \u003e matches.bed\n```\nIn the output, the first three columns give the query sequence name, start and\nend of a match and the fourth column gives the number of hits. Option `-l`\nspecifies the minimum SMEM length. A larger value helps performance.\nThis command does not output positions of SMEMs by default.\nYou can use option `-p` to get the positions of a subset of SMEMs.\nIn addition, you can use `--gap` to obtain regions not covered by long SMEMs or\n`--cov` to get the total length of regions covered by long SMEMs.\n\n### \u003ca name=\"bwasw\"\u003e\u003c/a\u003eLocal alignment\n\nRopebwt3 implements a revised [BWA-SW algorithm][bwasw] to align query\nsequences against an FM-index:\n```sh\nropebwt3 sw -t4 -N25 -k11 bwt.fmd query.fa \u003e aln.paf\n```\nOption `-N` effectively sets the bandwidth during alignment. A larger value\nimproves alignment accuracy at the cost of performance. Option `-k` initiates\nalignments with an exact match.\n\nGiven a complete ropebwt3 index with sampled suffix array and sequence names,\nthe `sw` command outputs standard PAF but it only outputs one hit per query\neven if there are multiple equally best hits. The number of hits in BWT is\nwritten to the `rh` tag.\n\n**Local alignment is tens of times slower than finding SMEMs.** It is not designed\nfor aligning high-throughput sequence reads.\n\n### \u003ca name=\"e2e\"\u003e\u003c/a\u003eHaplotype diversity with end-to-end alignment\n\nWith option `-e`, the `sw` command aligns the query sequence from end to end.\nIn this mode, ropebwt3 may output multiple suboptimal end-to-end hits.\nThis provides a way to retrieve similar haplotypes from the index.\n\nThe `hapdiv` command applies this algorithm to 101-mers in a query sequence and\noutputs 1) query name, 2) query start, 3) query end, 4) number of distinct\nalleles the 101-mer matches, 5) maximum edit distance observed,\n6) number of haplotypes with perfectly matching the 101-mer,\n7-11) number of haplotypes with edit distance 1-5 from the 101-mer,\nand 12) with distance 6 or higher.\n\n### \u003ca name=\"build\"\u003e\u003c/a\u003eIndexing\n\nRopebwt3 implements two algorithms for BWT construction. Although both\nalgorithms work for general sequences, you need to choose an algorithm based on\nthe input date types for the best performance.\n\n```sh\n# If not sure, use the general command line\nropebwt3 build -t24 -bo bwt.fmr file1.fa file2.fa filen.fa\n# You can also append another file to an existing index\nropebwt3 build -t24 -i bwt-old.fmr -bo bwt-new.fmr filex.fa\n# If each file is small, concatenate them together\ncat file1.fa file2.fa filen.fa | ropebwt3 build -t24 -m2g -bo bwt.fmr -\n# For short reads, use the old ropebwt2 algorithm and optionally apply RCLO (option -r)\nropebwt3 build -r -bo bwt.fmr reads.fq.gz\n# use grlBWT, which may be faster but uses working disk space\nropebwt3 fa2line genome1.fa genome2.fa genomen.fa \u003e all.txt\ngrlbwt-cli all.txt -t 32 -T . -o bwt.grl\ngrl2plain bwt.rl_bwt bwt.txt\nropebwt3 plain2fmd -o bwt.fmd bwt.txt\n```\n\nThese command lines construct a BWT for both strands of the input sequences.\nYou can skip the reverse strand by adding option `-R`.\nIf you provide multiple files on a `build` command line, ropebwt3 internally\nwill run `build` on each input file and then incrementally merge each\nindividual BWT to the final BWT.\n\nAfter BWT construction, you will probably want to generate sampled suffix array\nwith:\n```sh\nropebwt3 ssa -o index.fmd.ssa -s8 -t32 index.fmd\n```\nThis stores one suffix array value per $`2^8`$ positions. The size of the\noutput file is roughly $`64\\cdot(n/2^s+m)`$, where $n$ is the number of symbols\nin the BWT and $m$ is the number of sequences. Furthermore, if you want to get\nthe contig names with `sw`, you need to prepare another file:\n```sh\ncat input*.fa.gz | seqtk comp | cut -f1,2 | gzip \u003e index.fmd.len.gz\n```\nIf the BWT is built from multiple files, make sure the order in `cat` is\nthe same as the order used for BWT construction.\n\n### \u003ca name=\"format\"\u003e\u003c/a\u003eBinary BWT file formats\n\nRopebwt3 uses two binary formats to store run-length encoded BWTs: the ropebwt2\nFMR format and the fermi FMD format. The FMR format is dynamic in that you can\nadd new sequences or merge BWTs to an existing FMR file. The same BWT does not\nnecessarily lead to the same FMR. The FMD format is simpler in structure,\nfaster to load, smaller in memory and can be memory-mapped. The two formats can\noften be used interchangeably in ropebwt3, but it is recommended to use FMR for BWT\nconstruction and FMD for sequence search. You can explicitly convert\nbetween the two formats with:\n```sh\nropebwt3 build -i in.fmd -bo out.fmr  # from static to dynamic format\nropebwt3 build -i in.fmr -do out.fmd  # from dynamic to static format\n```\n\u003c!--\n## \u003ca name=\"dev\"\u003e\u003c/a\u003eFor Developers\n\nYou can encode and decode a FMD file with [rld0.h](rld0.h) and\n[rld0.c](rld0.c). The two-file library also supports the rank() operator. Here\nis a small program to convert FMD to plain text:\n```c\n// compile with \"gcc -O3 rld0.c this.c\"; run with \"./a.out idx.fmd \u003e out.txt\"\n#include \u003cstdio.h\u003e\n#include \"rld0.h\"\nint main(int argc, char *argv[]) {\n  if (argc \u003c 2) return 1;\n  rld_t *e = rld_restore(argv[1]);\n  rlditr_t ei; // iterator\n  rld_itr_init(e, \u0026ei, 0);\n  int c;\n  int64_t i, l;\n  while ((l = rld_dec(e, \u0026ei, \u0026c, 0)) \u003e 0)\n    for (i = 0; i \u003c l; ++i) putchar(\"\\nACGTN\"[c]);\n  rld_destroy(e);\n  return 0;\n}\n```\nand to count a string in an FMD file:\n```c\n// compile with \"gcc -O3 rld0.c this.c\"; run with \"./a.out idx.fmd AGCATAG\"\n#include \u003cstdint.h\u003e\n#include \u003cstring.h\u003e\n#include \u003cstdio.h\u003e\n#include \"rld0.h\"\nint main(int argc, char *argv[]) {\n  if (argc \u003c 3) return 1;\n  rld_t *e = rld_restore(argv[1]);\n  uint64_t k = 0, l = e-\u003ecnt[6], ok[6], ol[6];\n  const char *s = argv[2];\n  int i, len = strlen(s);\n  for (i = len - 1; i \u003e= 0; --i) { // backward search\n    int c = s[i];\n    c = c=='A'?1:c=='C'?2:c=='G'?3:c=='T'?4:5;\n    rld_rank2a(e, k, l, ok, ol);\n    k = e-\u003ecnt[c] + ok[c];\n    l = e-\u003ecnt[c] + ol[c];\n    if (k == l) break;\n  }\n  printf(\"%ld\\n\", (long)(l - k));\n  rld_destroy(e);\n  return 0;\n}\n```\n--\u003e\n\n## \u003ca name=\"cite\"\u003e\u003c/a\u003eCitation\n\nRopebwt3 is described in\n\n\u003e Li (2024) BWT construction and search at the terabase scale, *Bioinformatics*, **40**:btae717.\n\u003e DOI:[10.1093/bioinformatics/btae717](https://doi.org/10.1093/bioinformatics/btae717)\n\n## \u003ca name=\"limit\"\u003e\u003c/a\u003eLimitations\n\n* Ropebwt3 is slow on the \"locate\" operation.\n\n[grlbwt]: https://github.com/ddiazdom/grlBWT\n[movi]: https://github.com/mohsenzakeri/Movi\n[bigbwt]: https://gitlab.com/manzai/Big-BWT\n[fm2]: https://github.com/lh3/fermi2\n[rb2]: https://github.com/lh3/ropebwt2\n[zenodo]: https://zenodo.org/records/11533210\n[rb2-paper]: https://academic.oup.com/bioinformatics/article/30/22/3274/2391324\n[fm-paper]: https://academic.oup.com/bioinformatics/article/28/14/1838/218887\n[atb02]: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/\n[bwasw]: https://pubmed.ncbi.nlm.nih.gov/20080505/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fropebwt3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flh3%2Fropebwt3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flh3%2Fropebwt3/lists"}