https://github.com/sophiaconsulting/fast-suffix-array

An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI.
https://github.com/sophiaconsulting/fast-suffix-array

bioinformatics duplicate-detection fm-index npm python rust suffix-array text-search wasm

Last synced: 2 months ago
JSON representation

An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI.

Host: GitHub
URL: https://github.com/sophiaconsulting/fast-suffix-array
Owner: sophiaconsulting
License: mit
Created: 2026-04-05T08:01:10.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-04-07T21:05:24.000Z (2 months ago)
Last Synced: 2026-04-08T02:03:47.478Z (2 months ago)
Topics: bioinformatics, duplicate-detection, fm-index, npm, python, rust, suffix-array, text-search, wasm
Language: Rust
Homepage: https://docs.fastsuffix.dev
Size: 18.5 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

README

          


  



fast-suffix-array




  Suffix arrays for every platform: Rust, Python, JavaScript/WASM, and the command line.





  

  

  

  

  





  





  



---

The suffix array is the foundational data structure behind full-text search, data compression, bioinformatics, and duplicate detection. This library provides O(n) suffix array construction in pure Rust, compiling to both native code and WebAssembly.

A single `pip install` or `npm install` gives you suffix arrays, LCP arrays, FM-index, BWT, longest common substring, duplicate phrase detection, LPF arrays, and sentence splitting. The library ships as a Rust crate (`fast-suffix-array-core`), Python package (`fast-suffix-array` via PyO3), npm/WASM package (`fast-suffix-array`), and CLI tool (`fsa`).

It is the **only WASM suffix array on npm** (21x faster than mnemonist) and the **only WASM FM-index on npm** (1,287x faster than `indexOf` for multi-pattern search). On PyPI, it sits alongside [math-hiyoko's fm-index](https://pypi.org/project/fm-index/) as one of two actively maintained FM-indexes, differentiating on breadth: SA, LCP, BWT, LCS, LPF, duplicate detection, and WASM browser support in a single package.

It is also the **only library in any language** for word-boundary-aware duplicate phrase detection -- the same O(n) suffix array algorithm Google uses to deduplicate training corpora, accessible via pip and npm.

## The substring gap

Suffix-array-based search finds arbitrary substrings that word-tokenizing search libraries miss. Every popular JavaScript search library -- [fuse.js](https://www.npmjs.com/package/fuse.js) (8.4M downloads/week), [lunr](https://www.npmjs.com/package/lunr) (5.6M), [minisearch](https://www.npmjs.com/package/minisearch) (900K) -- tokenizes text into words. They answer "which documents contain these words?" but cannot find substrings like `"rown fox"` or `"ATTA"` (a DNA motif inside `GATTACA`).

| Query | fuse.js | lunr | minisearch | FM-index |

|-------|---------|------|------------|----------|

| `"rown fox"` | MISS | FOUND | FOUND | **FOUND** |

| `"ATTA"` | MISS | MISS | MISS | **FOUND** |

| `"Script dev"` | MISS | MISS | MISS | **FOUND** |

| `"ows-Whe"` | MISS | MISS | MISS | **FOUND** |

| `"ix arr"` | MISS | MISS | MISS | **FOUND** |

| `"uick brown f"` | MISS | FOUND | FOUND | **FOUND** |

| **Score** | 0/6 | 2/6 | 2/6 | **6/6** |

## Install

```bash

pip install fast-suffix-array     # Python (wheels for Linux, macOS, Windows)

npm install fast-suffix-array     # JavaScript / WASM (Node.js + browsers)

cargo install fast-suffix-array-core --features cli   # CLI

```

```toml

# Rust — add to Cargo.toml

[dependencies]

fast-suffix-array-core = "0.1"

```

Platforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).

## Quick start

### Python

```python

# FM-index: build once, query in O(p log sigma) time

from fast_suffix_array import FMIndex

index = FMIndex(open("genome.fasta", "rb").read())

index.count(b"GATTACA")        # O(p log sigma) — independent of text length

index.locate(b"GATTACA")       # numpy array of all positions

```

```python

# Find all repeated phrases in a text

import fast_suffix_array

results = fast_suffix_array.find_duplicates(

    "the quick brown fox jumped over the lazy dog. "

    "the quick brown fox sat on the mat.",

    min_words=3,

)

for r in results:

    print(f'{r["count"]}x  ({r["number_of_words"]} words)  "{r["phrase"]}"')

```

```python

# 466x faster than pylcs for longest common substring

lcs = fast_suffix_array.longest_common_substring(doc_a, doc_b)

# Count unique substrings in O(n)

count = fast_suffix_array.distinct_substring_count(text)

```

### JavaScript / WASM

```js

// FM-index: compressed full-text index with O(p log sigma) queries

const { WasmFMIndex } = require('fast-suffix-array');

const index = new WasmFMIndex(text, 2);

index.count('pattern');     // O(p log sigma) time

index.locate('pattern');    // Uint32Array of positions

index.heapSize();            // ~1.8x text size (vs 5-9x for suffix array)

```

```js

// Find all repeated phrases

const { process_test_string } = require('fast-suffix-array');

const results = process_test_string(

  text,     // text to analyze

  text,     // source text (offsets reference this)

  [],       // removed_ranges (empty = analyze source directly)

  4,        // min phrase length (words)

  9,        // min string length (chars)

  50,       // max phrase length (words)

  2,        // min words in substring

  true,     // enable block detection

);

```

```js

// Drop-in replacement for mnemonist (21x faster)

const SuffixArray = require('fast-suffix-array/suffix-array');

const sa = new SuffixArray('banana');

sa.array;   // [5, 3, 1, 0, 4, 2]

```

### CLI

```bash

# Find duplicate phrases in a file

fsa scan manuscript.md --top 20

# JSON output for scripting

fsa scan manuscript.md --json

# Compare two files (longest common substring)

fsa lcs chapter1.md chapter2.md

# File statistics (word count, distinct substrings)

fsa info manuscript.md

```

## Feature overview

| Capability | What it does | Complexity | Status |

|-----------|-------------|-----------|--------|

| **Suffix array** | O(n) SA-IS, 3 backends (pure Rust, divsufsort, libsais) | O(n) | Only WASM SA on npm. 21x faster than mnemonist. |

| **FM-index** | Compressed full-text search, count/locate, self-index | O(p log sigma) count | Only FM-index on npm. On PyPI alongside math-hiyoko. |

| **Duplicate detection** | Word-boundary phrases, Aho-Corasick dedup, block detection | O(n * w) | Only library for this in any language. |

| **String analysis** | LCS, distinct substrings, longest repeated, LPF | O(n+m) / O(n) | 466x faster than pylcs for LCS. |

| **BWT** | Forward/inverse Burrows-Wheeler transform | O(n) | Enables compression pipelines. |

| **Tokenization** | Sentence splitting, word boundaries, 50+ abbreviations | O(n) | CJK, Devanagari, Arabic, Ethiopic, Myanmar. |

## Documentation

Full API reference, benchmarks with methodology, architecture guide, and use cases:

**[docs.fastsuffix.dev](https://docs.fastsuffix.dev/)**

- [Benchmarks](https://docs.fastsuffix.dev/benchmarks/) -- full benchmark suite with methodology and raw data

- [API Reference](https://docs.fastsuffix.dev/api/) -- Python, JavaScript, CLI, and Rust APIs

- [Guide](https://docs.fastsuffix.dev/guide/) -- when to use what, how it works under the hood

- [Use Cases](https://docs.fastsuffix.dev/use-cases/) -- bioinformatics, browser search, data pipelines, and more

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for setup, testing, and PR guidelines.

## License

MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sophiaconsulting/fast-suffix-array

Awesome Lists containing this project

README

fast-suffix-array