https://github.com/hucsmn/suffix_array
suffix array construction and searching algorithms for in-memory binary data.
https://github.com/hucsmn/suffix_array
rust-library sais suffix-array
Last synced: 3 months ago
JSON representation
suffix array construction and searching algorithms for in-memory binary data.
- Host: GitHub
- URL: https://github.com/hucsmn/suffix_array
- Owner: hucsmn
- License: mit
- Created: 2019-04-16T12:05:56.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2022-09-10T04:27:27.000Z (almost 4 years ago)
- Last Synced: 2025-12-14T03:15:45.078Z (6 months ago)
- Topics: rust-library, sais, suffix-array
- Language: Rust
- Homepage:
- Size: 72.3 KB
- Stars: 12
- Watchers: 1
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Suffix array
============
Suffix array construction and searching algorithms for in-memory binary data.
To index plain texts, burntsushi's [suffix](https://crates.io/crates/suffix) featuring utf-8 support is a better choice.
This crate uses the Amos Wenger's [C bindings](https://crates.io/crates/cdivsufsort) to Yuta Mori's [dissufsort](https://github.com/y-256/libdivsufsort) to construct suffix array, which is the fastest known suffix array construction algorithm (SACA) running in single thread that uses merely O(1) additional workspace.
In addition, I have implemented the space-efficient parallel SACA [pSACAK](https://ieeexplore.ieee.org/document/8371211) in a separate [crate](https://crates.io/crates/psacak). For now, it has not been thoroughly benchmarked as well as optimized.
TODO
=====
* [x] Benchmark using [Pizza&Chili Corpus](http://pizzachili.dcc.uchile.cl).
* [ ] Rewrite the benchmarks.
* [ ] Speed up searching by LCP array, a.k.a. construct enhanced suffix array (ESA).
There are two major classes of LCP array construction algorithms (LACA):
The first class produces LCP array from suffix array. AFAIK, these LACAs have to allocate additional auxiliary arrays (such as [ISA](https://dl.acm.org/doi/10.5555/647820.736222), [PLCP](https://doi.org/10.1007/978-3-642-02441-2_17), and [BWT](https://dl.acm.org/doi/10.5555/2790248.2790251)) to construct the LCP array.
The second class constructs the suffix array together with its LCP array.
They are fast and space-efficient but require sophisticated modifications on kind of SACAs based on induce copying (such as [sais-lite](https://arxiv.org/abs/1101.3448), [saca-k](https://dl.acm.org/doi/10.1016/j.ipl.2016.09.010), and [divsufsort](https://arxiv.org/abs/1710.01896)).
This crate would not provide ESA support until I figure out a proper way to implement it.
* [x] Speed up searching by bucket pointers, inspired by [this paper](https://dl.acm.org/doi/10.1145/1376616.1376683) (which uses a trie as index, interleaved arrays and text fingerprints to improve locality for ESA searching on disk).
See `SuffixArray::enable_buckets()`.
* [ ] Add compressed suffix array (CSA) support.
CSA safcrificed the speed quite a lot to gain some space efficiency, which is not that necessary.
* [x] Serialization/deserialization. Enable the optional `pack` feature to use those APIs. This feature is based on Paul Masurel's [bitpacking](https://crates.io/crates/bitpacking).
* [x] Rewrite suffix array construction algorithm.
Currently, this crate uses [dissufsort](https://github.com/y-256/libdivsufsort) to construct all the suffix arrays.
* [ ] Look into other one's efforts on improving libdivsufsort and multikey quick sort: [1](http://panthema.net/2013/parallel-string-sorting/), [2](https://github.com/akamiru/sort/wiki/General-Description-of-DAware), and [3](https://github.com/jlabeit/parallel-divsufsort).