https://github.com/lh3/ropebwt3

BWT construction and search
https://github.com/lh3/ropebwt3

bioinformatics bwt fm-index

Last synced: about 1 month ago
JSON representation

BWT construction and search

Host: GitHub
URL: https://github.com/lh3/ropebwt3
Owner: lh3
License: other
Created: 2024-06-04T14:54:32.000Z (12 months ago)
Default Branch: master
Last Pushed: 2024-10-16T01:43:55.000Z (7 months ago)
Last Synced: 2024-10-17T15:18:43.588Z (7 months ago)
Topics: bioinformatics, bwt, fm-index
Language: C
Homepage:
Size: 287 KB
Stars: 89
Watchers: 6
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: NEWS.md
- License: LICENSE.txt

Awesome Lists containing this project

README

## Getting Started
```sh
# Compile
git clone https://github.com/lh3/ropebwt3
cd ropebwt3
make # use "make omp=0" if your compiler doesn't suport OpenMP

# Toy examples
echo -e 'AGG\nAGC' | ./ropebwt3 build -LR -
echo TGAACTCTACACAACATATTTTGTCACCAAG | ./ropebwt3 build -Ldo idx.fmd -
echo ACTCTACACAAgATATTTTGTCA | ./ropebwt3 mem -Ll10 idx.fmd -

# Download the prebuilt FM-index for 152 M. tuberculosis genomes
wget -O- https://zenodo.org/records/12803206/files/mtb152.tar.gz?download=1 | tar -zxf -

# Count super-maximal exact matches (no contig positions)
echo ACCTACAACACCGGTGGCTACAACGTGG | ./ropebwt3 mem -L mtb152.fmd -
# Local alignment
echo ACCTACAACACCGGTaGGCTACAACGTGG | ./ropebwt3 sw -Lm20 mtb152.fmd -
# Retrieve R15311, the 46th genome in the collection. 90=(46-1)*2
./ropebwt3 get mtb152.fmd 90 > R15311.fa

# Download the index of 472 human long-read assemblies (18GB download size)
wget -O human472.fmr.gz https://zenodo.org/records/14854401/files/human472.fmr.gz
wget -O human472.fmd.ssa.gz https://zenodo.org/records/14854401/files/human472.fmd.ssa.gz
wget -O human472.fmd.len.gz https://zenodo.org/records/14854401/files/human472.fmd.len.gz
gzip -d human472.fmr.gz human472.fmd.ssa.gz # or use pigz for parallel decompression
./ropebwt3 build -i human472.fmr -do human472.fmd # convert to a faster format

# Find C4 alleles (the query is on the exon 26 of C4A)
echo CCAGGACCCCTGTCCAGTGTTAGACAGGAGCATGCAG | ./ropebwt3 sw -eN200 -Lm10 human472.fmd -
```

## Table of Contents

- [Getting Started](#start)
- [Introduction](#intro)
- [Usage](#use)
- [Finding maximal exact matches](#mem)
- [Local alignment](#bwasw)
- [Haplotype diversity with end-to-end alignment](#e2e)
- [Indexing](#build)
- [Binary BWT formats](#format)
- [Citation](#cite)
- [Limitations](#limit)

## Introduction

Ropebwt3 constructs the FM-index of a large DNA sequence set and searches for
matches against the FM-index. It is optimized for highly redundant sequence
sets such as a pangenome or sequence reads at high coverage. Ropebwt3 can
losslessly compress 7.3Tb of common bacterial genomes into a 30GB run-length
encoded BWT file and report supermaximal exact matches (SMEMs) or local
alignments with mismatches and gaps.

Prebuilt ropebwt3 indices can be downloaded [from Zenodo][zenodo].

## Usage

A full ropebwt3 index consists of three files:

* `.fmd`: run-length encoded BWT that supports the rank operation. It is
generated by the `build` command. By default, the $i$-th sequence in the input
is the $2i$-th sequence in the BWT and its reverse complement is the
$(2i+1)$-th sequence. Some commands assume such ordering.

* `.fmd.ssa`: sampled suffix array, generated by the `ssa` command. For
now, it is only needed for reporting coordinates in the PAF output of the
`sw` command.

* `.fmd.len.gz`: list of sequence names and lengths. It is generated
with third-party tools/scripts, for example, with `seqtk comp input.fa | cut
-f1,2 | gzip`. This file is needed for reporting sequence names and lengths
in the PAF output.

### Finding maximal exact matches

A maximal exact match (MEM) is an exact alignment between the index and a query
that cannot be extended in either direction. A super MEM (SMEM) is a MEM that
is not contained in any other MEM on the query sequence. You can find the SMEMs
with
```sh
ropebwt3 mem -t4 -l31 bwt.fmd query.fa > matches.bed
```
In the output, the first three columns give the query sequence name, start and
end of a match and the fourth column gives the number of hits. Option `-l`
specifies the minimum SMEM length. A larger value helps performance.
This command does not output positions of SMEMs by default.
You can use option `-p` to get the positions of a subset of SMEMs.
In addition, you can use `--gap` to obtain regions not covered by long SMEMs or
`--cov` to get the total length of regions covered by long SMEMs.

### Local alignment

Ropebwt3 implements a revised [BWA-SW algorithm][bwasw] to align query
sequences against an FM-index:
```sh
ropebwt3 sw -t4 -N25 -k11 bwt.fmd query.fa > aln.paf
```
Option `-N` effectively sets the bandwidth during alignment. A larger value
improves alignment accuracy at the cost of performance. Option `-k` initiates
alignments with an exact match.

Given a complete ropebwt3 index with sampled suffix array and sequence names,
the `sw` command outputs standard PAF but it only outputs one hit per query
even if there are multiple equally best hits. The number of hits in BWT is
written to the `rh` tag.

**Local alignment is tens of times slower than finding SMEMs.** It is not designed
for aligning high-throughput sequence reads.

### Haplotype diversity with end-to-end alignment

With option `-e`, the `sw` command aligns the query sequence from end to end.
In this mode, ropebwt3 may output multiple suboptimal end-to-end hits.
This provides a way to retrieve similar haplotypes from the index.

The `hapdiv` command applies this algorithm to 101-mers in a query sequence and
outputs 1) query name, 2) query start, 3) query end, 4) number of distinct
alleles the 101-mer matches, 5) maximum edit distance observed,
6) number of haplotypes with perfectly matching the 101-mer,
7-11) number of haplotypes with edit distance 1-5 from the 101-mer,
and 12) with distance 6 or higher.

### Indexing

Ropebwt3 implements two algorithms for BWT construction. Although both
algorithms work for general sequences, you need to choose an algorithm based on
the input date types for the best performance.

```sh
# If not sure, use the general command line
ropebwt3 build -t24 -bo bwt.fmr file1.fa file2.fa filen.fa
# You can also append another file to an existing index
ropebwt3 build -t24 -i bwt-old.fmr -bo bwt-new.fmr filex.fa
# If each file is small, concatenate them together
cat file1.fa file2.fa filen.fa | ropebwt3 build -t24 -m2g -bo bwt.fmr -
# For short reads, use the old ropebwt2 algorithm and optionally apply RCLO (option -r)
ropebwt3 build -r -bo bwt.fmr reads.fq.gz
# use grlBWT, which may be faster but uses working disk space
ropebwt3 fa2line genome1.fa genome2.fa genomen.fa > all.txt
grlbwt-cli all.txt -t 32 -T . -o bwt.grl
grl2plain bwt.rl_bwt bwt.txt
ropebwt3 plain2fmd -o bwt.fmd bwt.txt
```

These command lines construct a BWT for both strands of the input sequences.
You can skip the reverse strand by adding option `-R`.
If you provide multiple files on a `build` command line, ropebwt3 internally
will run `build` on each input file and then incrementally merge each
individual BWT to the final BWT.

After BWT construction, you will probably want to generate sampled suffix array
with:
```sh
ropebwt3 ssa -o index.fmd.ssa -s8 -t32 index.fmd
```
This stores one suffix array value per $`2^8`$ positions. The size of the
output file is roughly $`64\cdot(n/2^s+m)`$, where $n$ is the number of symbols
in the BWT and $m$ is the number of sequences. Furthermore, if you want to get
the contig names with `sw`, you need to prepare another file:
```sh
cat input*.fa.gz | seqtk comp | cut -f1,2 | gzip > index.fmd.len.gz
```
If the BWT is built from multiple files, make sure the order in `cat` is
the same as the order used for BWT construction.

### Binary BWT file formats

Ropebwt3 uses two binary formats to store run-length encoded BWTs: the ropebwt2
FMR format and the fermi FMD format. The FMR format is dynamic in that you can
add new sequences or merge BWTs to an existing FMR file. The same BWT does not
necessarily lead to the same FMR. The FMD format is simpler in structure,
faster to load, smaller in memory and can be memory-mapped. The two formats can
often be used interchangeably in ropebwt3, but it is recommended to use FMR for BWT
construction and FMD for sequence search. You can explicitly convert
between the two formats with:
```sh
ropebwt3 build -i in.fmd -bo out.fmr # from static to dynamic format
ropebwt3 build -i in.fmr -do out.fmd # from dynamic to static format
```

## Citation

Ropebwt3 is described in

> Li (2024) BWT construction and search at the terabase scale, *Bioinformatics*, **40**:btae717.
> DOI:[10.1093/bioinformatics/btae717](https://doi.org/10.1093/bioinformatics/btae717)

## Limitations

* Ropebwt3 is slow on the "locate" operation.

[grlbwt]: https://github.com/ddiazdom/grlBWT
[movi]: https://github.com/mohsenzakeri/Movi
[bigbwt]: https://gitlab.com/manzai/Big-BWT
[fm2]: https://github.com/lh3/fermi2
[rb2]: https://github.com/lh3/ropebwt2
[zenodo]: https://zenodo.org/records/11533210
[rb2-paper]: https://academic.oup.com/bioinformatics/article/30/22/3274/2391324
[fm-paper]: https://academic.oup.com/bioinformatics/article/28/14/1838/218887
[atb02]: https://ftp.ebi.ac.uk/pub/databases/AllTheBacteria/Releases/0.2/
[bwasw]: https://pubmed.ncbi.nlm.nih.gov/20080505/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lh3/ropebwt3

Awesome Lists containing this project

README