https://github.com/sophiaconsulting/fast-suffix-array
An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI.
https://github.com/sophiaconsulting/fast-suffix-array
bioinformatics duplicate-detection fm-index npm python rust suffix-array text-search wasm
Last synced: 2 months ago
JSON representation
An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI.
- Host: GitHub
- URL: https://github.com/sophiaconsulting/fast-suffix-array
- Owner: sophiaconsulting
- License: mit
- Created: 2026-04-05T08:01:10.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-07T21:05:24.000Z (2 months ago)
- Last Synced: 2026-04-08T02:03:47.478Z (2 months ago)
- Topics: bioinformatics, duplicate-detection, fm-index, npm, python, rust, suffix-array, text-search, wasm
- Language: Rust
- Homepage: https://docs.fastsuffix.dev
- Size: 18.5 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
fast-suffix-array
Suffix arrays for every platform: Rust, Python, JavaScript/WASM, and the command line.
---
The suffix array is the foundational data structure behind full-text search, data compression, bioinformatics, and duplicate detection. This library provides O(n) suffix array construction in pure Rust, compiling to both native code and WebAssembly.
A single `pip install` or `npm install` gives you suffix arrays, LCP arrays, FM-index, BWT, longest common substring, duplicate phrase detection, LPF arrays, and sentence splitting. The library ships as a Rust crate (`fast-suffix-array-core`), Python package (`fast-suffix-array` via PyO3), npm/WASM package (`fast-suffix-array`), and CLI tool (`fsa`).
It is the **only WASM suffix array on npm** (21x faster than mnemonist) and the **only WASM FM-index on npm** (1,287x faster than `indexOf` for multi-pattern search). On PyPI, it sits alongside [math-hiyoko's fm-index](https://pypi.org/project/fm-index/) as one of two actively maintained FM-indexes, differentiating on breadth: SA, LCP, BWT, LCS, LPF, duplicate detection, and WASM browser support in a single package.
It is also the **only library in any language** for word-boundary-aware duplicate phrase detection -- the same O(n) suffix array algorithm Google uses to deduplicate training corpora, accessible via pip and npm.
## The substring gap
Suffix-array-based search finds arbitrary substrings that word-tokenizing search libraries miss. Every popular JavaScript search library -- [fuse.js](https://www.npmjs.com/package/fuse.js) (8.4M downloads/week), [lunr](https://www.npmjs.com/package/lunr) (5.6M), [minisearch](https://www.npmjs.com/package/minisearch) (900K) -- tokenizes text into words. They answer "which documents contain these words?" but cannot find substrings like `"rown fox"` or `"ATTA"` (a DNA motif inside `GATTACA`).
| Query | fuse.js | lunr | minisearch | FM-index |
|-------|---------|------|------------|----------|
| `"rown fox"` | MISS | FOUND | FOUND | **FOUND** |
| `"ATTA"` | MISS | MISS | MISS | **FOUND** |
| `"Script dev"` | MISS | MISS | MISS | **FOUND** |
| `"ows-Whe"` | MISS | MISS | MISS | **FOUND** |
| `"ix arr"` | MISS | MISS | MISS | **FOUND** |
| `"uick brown f"` | MISS | FOUND | FOUND | **FOUND** |
| **Score** | 0/6 | 2/6 | 2/6 | **6/6** |
## Install
```bash
pip install fast-suffix-array # Python (wheels for Linux, macOS, Windows)
npm install fast-suffix-array # JavaScript / WASM (Node.js + browsers)
cargo install fast-suffix-array-core --features cli # CLI
```
```toml
# Rust — add to Cargo.toml
[dependencies]
fast-suffix-array-core = "0.1"
```
Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).
## Quick start
### Python
```python
# FM-index: build once, query in O(p log sigma) time
from fast_suffix_array import FMIndex
index = FMIndex(open("genome.fasta", "rb").read())
index.count(b"GATTACA") # O(p log sigma) — independent of text length
index.locate(b"GATTACA") # numpy array of all positions
```
```python
# Find all repeated phrases in a text
import fast_suffix_array
results = fast_suffix_array.find_duplicates(
"the quick brown fox jumped over the lazy dog. "
"the quick brown fox sat on the mat.",
min_words=3,
)
for r in results:
print(f'{r["count"]}x ({r["number_of_words"]} words) "{r["phrase"]}"')
```
```python
# 466x faster than pylcs for longest common substring
lcs = fast_suffix_array.longest_common_substring(doc_a, doc_b)
# Count unique substrings in O(n)
count = fast_suffix_array.distinct_substring_count(text)
```
### JavaScript / WASM
```js
// FM-index: compressed full-text index with O(p log sigma) queries
const { WasmFMIndex } = require('fast-suffix-array');
const index = new WasmFMIndex(text, 2);
index.count('pattern'); // O(p log sigma) time
index.locate('pattern'); // Uint32Array of positions
index.heapSize(); // ~1.8x text size (vs 5-9x for suffix array)
```
```js
// Find all repeated phrases
const { process_test_string } = require('fast-suffix-array');
const results = process_test_string(
text, // text to analyze
text, // source text (offsets reference this)
[], // removed_ranges (empty = analyze source directly)
4, // min phrase length (words)
9, // min string length (chars)
50, // max phrase length (words)
2, // min words in substring
true, // enable block detection
);
```
```js
// Drop-in replacement for mnemonist (21x faster)
const SuffixArray = require('fast-suffix-array/suffix-array');
const sa = new SuffixArray('banana');
sa.array; // [5, 3, 1, 0, 4, 2]
```
### CLI
```bash
# Find duplicate phrases in a file
fsa scan manuscript.md --top 20
# JSON output for scripting
fsa scan manuscript.md --json
# Compare two files (longest common substring)
fsa lcs chapter1.md chapter2.md
# File statistics (word count, distinct substrings)
fsa info manuscript.md
```
## Feature overview
| Capability | What it does | Complexity | Status |
|-----------|-------------|-----------|--------|
| **Suffix array** | O(n) SA-IS, 3 backends (pure Rust, divsufsort, libsais) | O(n) | Only WASM SA on npm. 21x faster than mnemonist. |
| **FM-index** | Compressed full-text search, count/locate, self-index | O(p log sigma) count | Only FM-index on npm. On PyPI alongside math-hiyoko. |
| **Duplicate detection** | Word-boundary phrases, Aho-Corasick dedup, block detection | O(n * w) | Only library for this in any language. |
| **String analysis** | LCS, distinct substrings, longest repeated, LPF | O(n+m) / O(n) | 466x faster than pylcs for LCS. |
| **BWT** | Forward/inverse Burrows-Wheeler transform | O(n) | Enables compression pipelines. |
| **Tokenization** | Sentence splitting, word boundaries, 50+ abbreviations | O(n) | CJK, Devanagari, Arabic, Ethiopic, Myanmar. |
## Documentation
Full API reference, benchmarks with methodology, architecture guide, and use cases:
**[docs.fastsuffix.dev](https://docs.fastsuffix.dev/)**
- [Benchmarks](https://docs.fastsuffix.dev/benchmarks/) -- full benchmark suite with methodology and raw data
- [API Reference](https://docs.fastsuffix.dev/api/) -- Python, JavaScript, CLI, and Rust APIs
- [Guide](https://docs.fastsuffix.dev/guide/) -- when to use what, how it works under the hood
- [Use Cases](https://docs.fastsuffix.dev/use-cases/) -- bioinformatics, browser search, data pipelines, and more
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, testing, and PR guidelines.
## License
MIT