{"id":48459379,"url":"https://github.com/sophiaconsulting/fast-suffix-array","last_synced_at":"2026-04-12T06:01:06.944Z","repository":{"id":349306838,"uuid":"1201824551","full_name":"sophiaconsulting/fast-suffix-array","owner":"sophiaconsulting","description":"An extremely fast text search and analysis library, written in Rust. Python, JavaScript/WASM, and CLI.","archived":false,"fork":false,"pushed_at":"2026-04-07T21:05:24.000Z","size":19422,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-08T02:03:47.478Z","etag":null,"topics":["bioinformatics","duplicate-detection","fm-index","npm","python","rust","suffix-array","text-search","wasm"],"latest_commit_sha":null,"homepage":"https://docs.fastsuffix.dev","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sophiaconsulting.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"vmasrani"}},"created_at":"2026-04-05T08:01:10.000Z","updated_at":"2026-04-07T21:05:31.000Z","dependencies_parsed_at":null,"dependency_job_id":"55143adc-dc03-4399-b02d-9a0bc55e7d79","html_url":"https://github.com/sophiaconsulting/fast-suffix-array","commit_stats":null,"previous_names":["sophiaconsulting/fast-suffix-array"],"tags_count":7,"template":false,"template_full_name":null,"purl":"pkg:github/sophiaconsulting/fast-suffix-array","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sophiaconsulting%2Ffast-suffix-array","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sophiaconsulting%2Ffast-suffix-array/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sophiaconsulting%2Ffast-suffix-array/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sophiaconsulting%2Ffast-suffix-array/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sophiaconsulting","download_url":"https://codeload.github.com/sophiaconsulting/fast-suffix-array/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sophiaconsulting%2Ffast-suffix-array/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31583290,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"online","status_checked_at":"2026-04-09T02:00:06.848Z","response_time":112,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bioinformatics","duplicate-detection","fm-index","npm","python","rust","suffix-array","text-search","wasm"],"created_at":"2026-04-07T01:02:39.746Z","updated_at":"2026-04-09T03:01:06.559Z","avatar_url":"https://github.com/sophiaconsulting.png","language":"Rust","funding_links":["https://github.com/sponsors/vmasrani"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"fast-suffix-array-logo.svg\" alt=\"fast-suffix-array\" width=\"120\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003efast-suffix-array\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eSuffix arrays for every platform: Rust, Python, JavaScript/WASM, and the command line.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/sophiaconsulting/fast-suffix-array/actions/workflows/ci.yml\"\u003e\u003cimg src=\"https://github.com/sophiaconsulting/fast-suffix-array/actions/workflows/ci.yml/badge.svg\" alt=\"CI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://crates.io/crates/fast-suffix-array-core\"\u003e\u003cimg src=\"https://img.shields.io/crates/v/fast-suffix-array-core.svg\" alt=\"Crates.io\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.npmjs.com/package/fast-suffix-array\"\u003e\u003cimg src=\"https://img.shields.io/npm/v/fast-suffix-array.svg\" alt=\"npm\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://pypi.org/project/fast-suffix-array\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/fast-suffix-array.svg\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-MIT-blue.svg\" alt=\"License: MIT\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/wasm-sa.svg\" alt=\"WASM suffix array vs mnemonist\" width=\"800\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/assets/fm-scaling.svg\" alt=\"FM-index query scaling in JavaScript/WASM\" width=\"800\" /\u003e\n\u003c/p\u003e\n\n---\n\nThe suffix array is the foundational data structure behind full-text search, data compression, bioinformatics, and duplicate detection. This library provides O(n) suffix array construction in pure Rust, compiling to both native code and WebAssembly.\n\nA single `pip install` or `npm install` gives you suffix arrays, LCP arrays, FM-index, BWT, longest common substring, duplicate phrase detection, LPF arrays, and sentence splitting. The library ships as a Rust crate (`fast-suffix-array-core`), Python package (`fast-suffix-array` via PyO3), npm/WASM package (`fast-suffix-array`), and CLI tool (`fsa`).\n\nIt is the **only WASM suffix array on npm** (21x faster than mnemonist) and the **only WASM FM-index on npm** (1,287x faster than `indexOf` for multi-pattern search). On PyPI, it sits alongside [math-hiyoko's fm-index](https://pypi.org/project/fm-index/) as one of two actively maintained FM-indexes, differentiating on breadth: SA, LCP, BWT, LCS, LPF, duplicate detection, and WASM browser support in a single package.\n\nIt is also the **only library in any language** for word-boundary-aware duplicate phrase detection -- the same O(n) suffix array algorithm Google uses to deduplicate training corpora, accessible via pip and npm.\n\n## The substring gap\n\nSuffix-array-based search finds arbitrary substrings that word-tokenizing search libraries miss. Every popular JavaScript search library -- [fuse.js](https://www.npmjs.com/package/fuse.js) (8.4M downloads/week), [lunr](https://www.npmjs.com/package/lunr) (5.6M), [minisearch](https://www.npmjs.com/package/minisearch) (900K) -- tokenizes text into words. They answer \"which documents contain these words?\" but cannot find substrings like `\"rown fox\"` or `\"ATTA\"` (a DNA motif inside `GATTACA`).\n\n| Query | fuse.js | lunr | minisearch | FM-index |\n|-------|---------|------|------------|----------|\n| `\"rown fox\"` | MISS | FOUND | FOUND | **FOUND** |\n| `\"ATTA\"` | MISS | MISS | MISS | **FOUND** |\n| `\"Script dev\"` | MISS | MISS | MISS | **FOUND** |\n| `\"ows-Whe\"` | MISS | MISS | MISS | **FOUND** |\n| `\"ix arr\"` | MISS | MISS | MISS | **FOUND** |\n| `\"uick brown f\"` | MISS | FOUND | FOUND | **FOUND** |\n| **Score** | 0/6 | 2/6 | 2/6 | **6/6** |\n\n## Install\n\n```bash\npip install fast-suffix-array     # Python (wheels for Linux, macOS, Windows)\nnpm install fast-suffix-array     # JavaScript / WASM (Node.js + browsers)\ncargo install fast-suffix-array-core --features cli   # CLI\n```\n\n```toml\n# Rust — add to Cargo.toml\n[dependencies]\nfast-suffix-array-core = \"0.1\"\n```\n\nPlatforms: Linux (x86_64, aarch64), macOS (x86_64, Apple Silicon), Windows (x86_64), WASM (any browser/runtime).\n\n## Quick start\n\n### Python\n\n```python\n# FM-index: build once, query in O(p log sigma) time\nfrom fast_suffix_array import FMIndex\n\nindex = FMIndex(open(\"genome.fasta\", \"rb\").read())\nindex.count(b\"GATTACA\")        # O(p log sigma) — independent of text length\nindex.locate(b\"GATTACA\")       # numpy array of all positions\n```\n\n```python\n# Find all repeated phrases in a text\nimport fast_suffix_array\n\nresults = fast_suffix_array.find_duplicates(\n    \"the quick brown fox jumped over the lazy dog. \"\n    \"the quick brown fox sat on the mat.\",\n    min_words=3,\n)\n\nfor r in results:\n    print(f'{r[\"count\"]}x  ({r[\"number_of_words\"]} words)  \"{r[\"phrase\"]}\"')\n```\n\n```python\n# 466x faster than pylcs for longest common substring\nlcs = fast_suffix_array.longest_common_substring(doc_a, doc_b)\n\n# Count unique substrings in O(n)\ncount = fast_suffix_array.distinct_substring_count(text)\n```\n\n### JavaScript / WASM\n\n```js\n// FM-index: compressed full-text index with O(p log sigma) queries\nconst { WasmFMIndex } = require('fast-suffix-array');\nconst index = new WasmFMIndex(text, 2);\nindex.count('pattern');     // O(p log sigma) time\nindex.locate('pattern');    // Uint32Array of positions\nindex.heapSize();            // ~1.8x text size (vs 5-9x for suffix array)\n```\n\n```js\n// Find all repeated phrases\nconst { process_test_string } = require('fast-suffix-array');\n\nconst results = process_test_string(\n  text,     // text to analyze\n  text,     // source text (offsets reference this)\n  [],       // removed_ranges (empty = analyze source directly)\n  4,        // min phrase length (words)\n  9,        // min string length (chars)\n  50,       // max phrase length (words)\n  2,        // min words in substring\n  true,     // enable block detection\n);\n```\n\n```js\n// Drop-in replacement for mnemonist (21x faster)\nconst SuffixArray = require('fast-suffix-array/suffix-array');\nconst sa = new SuffixArray('banana');\nsa.array;   // [5, 3, 1, 0, 4, 2]\n```\n\n### CLI\n\n```bash\n# Find duplicate phrases in a file\nfsa scan manuscript.md --top 20\n\n# JSON output for scripting\nfsa scan manuscript.md --json\n\n# Compare two files (longest common substring)\nfsa lcs chapter1.md chapter2.md\n\n# File statistics (word count, distinct substrings)\nfsa info manuscript.md\n```\n\n## Feature overview\n\n| Capability | What it does | Complexity | Status |\n|-----------|-------------|-----------|--------|\n| **Suffix array** | O(n) SA-IS, 3 backends (pure Rust, divsufsort, libsais) | O(n) | Only WASM SA on npm. 21x faster than mnemonist. |\n| **FM-index** | Compressed full-text search, count/locate, self-index | O(p log sigma) count | Only FM-index on npm. On PyPI alongside math-hiyoko. |\n| **Duplicate detection** | Word-boundary phrases, Aho-Corasick dedup, block detection | O(n * w) | Only library for this in any language. |\n| **String analysis** | LCS, distinct substrings, longest repeated, LPF | O(n+m) / O(n) | 466x faster than pylcs for LCS. |\n| **BWT** | Forward/inverse Burrows-Wheeler transform | O(n) | Enables compression pipelines. |\n| **Tokenization** | Sentence splitting, word boundaries, 50+ abbreviations | O(n) | CJK, Devanagari, Arabic, Ethiopic, Myanmar. |\n\n## Documentation\n\nFull API reference, benchmarks with methodology, architecture guide, and use cases:\n\n**[docs.fastsuffix.dev](https://docs.fastsuffix.dev/)**\n\n- [Benchmarks](https://docs.fastsuffix.dev/benchmarks/) -- full benchmark suite with methodology and raw data\n- [API Reference](https://docs.fastsuffix.dev/api/) -- Python, JavaScript, CLI, and Rust APIs\n- [Guide](https://docs.fastsuffix.dev/guide/) -- when to use what, how it works under the hood\n- [Use Cases](https://docs.fastsuffix.dev/use-cases/) -- bioinformatics, browser search, data pipelines, and more\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for setup, testing, and PR guidelines.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsophiaconsulting%2Ffast-suffix-array","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsophiaconsulting%2Ffast-suffix-array","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsophiaconsulting%2Ffast-suffix-array/lists"}