https://github.com/biojulia/fmindexes.jl

FM-index for full-text search
https://github.com/biojulia/fmindexes.jl

Last synced: 12 months ago
JSON representation

FM-index for full-text search

Host: GitHub
URL: https://github.com/biojulia/fmindexes.jl
Owner: BioJulia
License: other
Created: 2015-08-06T06:16:05.000Z (almost 11 years ago)
Default Branch: master
Last Pushed: 2021-11-20T01:02:13.000Z (over 4 years ago)
Last Synced: 2025-04-15T21:29:09.457Z (about 1 year ago)
Language: Julia
Homepage:
Size: 55.7 KB
Stars: 20
Watchers: 13
Forks: 10
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

          # FMIndexes

[![Build Status](https://travis-ci.org/BioJulia/FMIndexes.jl.svg?branch=master)](https://travis-ci.org/BioJulia/FMIndexes.jl)

[FM-index](https://en.wikipedia.org/wiki/FM-index) is a static, compact, and fast index for full-text search.

The index type, `FMIndex{w,T}`, is able to index an arbitrary byte sequence.

`w` is the number of bits required to encode the alphabet of the sequence and `T` is the type of positions of the sequence.

```julia

julia> using FMIndexes

julia> fmindex = FMIndex("abracadabra");

julia> count("abra", fmindex)  # count the number of occurrences of a query

2

julia> for loc in locate("ra", fmindex)  # return the iterator of positions of a query

           println(loc)

       end

10

3

julia> locateall("ra", fmindex)  # return the all positions of a query

2-element Array{Int64,1}:

 10

  3

julia> String(restore(fmindex))  # restore a byte sequence from the index

"abracadabra"

```

## Tips for efficient indexing

The following is a general constructor:

```julia

FMIndex(seq, σ=256; r=32, program=:SuffixArrays, mmap::Bool=false, opts...)

```

`seq` is expected be a byte sequence; `seq[i]` should return a value of `UInt8`.

If the alphabet of a sequence can be encoded less that 8 bits, [IntArrays.jl](https://github.com/bicycle1885/IntArrays.jl) would be helpful to save the space.

`σ` is the size of the alphabet; for example, if the sequence is a DNA sequence, setting `σ` to 4 (four nucleotides) is the best choice in terms of efficiency.

Setting larger `σ` value than necessary is just a waste of query time and index space.

The positions of the sequence are sampled every `r` elements. There is a trade-off between query time and index space about this value: the smaller `r` is, the faster it is to locate positions but the larger the index is.

`program` is used to construct the suffix array of the sequence. The [SuffixArrays.jl](https://github.com/quinnj/SuffixArrays.jl) package is used by default, but if you want to create the index for a very long sequence it is recommended to use the [pSAscan](https://www.cs.helsinki.fi/group/pads/pSAscan.html) program.

Also, the `mmap` flag determines wheather the suffix array is stored in a memory-mapped array or not. This flag would be necessary for a long sequence because the temporary suffix array often consumes larger memory space than the index itself (for instance, the suffix array of a sequence of 2^32 length consumes 16GiB RAM).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/biojulia/fmindexes.jl

Awesome Lists containing this project

README