{"id":50288624,"url":"https://github.com/themankindproject/txtfp","last_synced_at":"2026-05-28T04:04:17.194Z","repository":{"id":354187629,"uuid":"1222429282","full_name":"themankindproject/txtfp","owner":"themankindproject","description":"Text fingerprinting: MinHash + LSH, SimHash, TLSH, ONNX semantic embeddings (BGE/E5/MiniLM), with byte-stable hash layouts and no_std + alloc default builds.","archived":false,"fork":false,"pushed_at":"2026-05-26T07:55:59.000Z","size":493,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T08:05:19.562Z","etag":null,"topics":["deduplication","embeddings","fingerprinting","locality-sensitive-hashing","lsh","minhash","near-duplicate","no-std","onnx","rust","sdk","semantic-search","simhash","text-processing","tlsh","wasm"],"latest_commit_sha":null,"homepage":"https://docs.rs/txtfp","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/themankindproject.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE-MIT","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-27T11:04:45.000Z","updated_at":"2026-05-26T07:55:54.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/themankindproject/txtfp","commit_stats":null,"previous_names":["themankindproject/txtfp"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/themankindproject/txtfp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/themankindproject%2Ftxtfp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/themankindproject%2Ftxtfp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/themankindproject%2Ftxtfp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/themankindproject%2Ftxtfp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/themankindproject","download_url":"https://codeload.github.com/themankindproject/txtfp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/themankindproject%2Ftxtfp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33593420,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deduplication","embeddings","fingerprinting","locality-sensitive-hashing","lsh","minhash","near-duplicate","no-std","onnx","rust","sdk","semantic-search","simhash","text-processing","tlsh","wasm"],"created_at":"2026-05-28T04:04:11.079Z","updated_at":"2026-05-28T04:04:17.186Z","avatar_url":"https://github.com/themankindproject.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# txtfp\n\n[![Crates.io](https://img.shields.io/crates/v/txtfp.svg)](https://crates.io/crates/txtfp)\n[![Docs.rs](https://docs.rs/txtfp/badge.svg)](https://docs.rs/txtfp)\n[![License](https://img.shields.io/crates/l/txtfp)](LICENSE-MIT)\n[![Build Status](https://img.shields.io/github/actions/workflow/status/themankindproject/txtfp/ci.yml)](https://github.com/themankindproject/txtfp/actions)\n![Rust Version](https://img.shields.io/badge/rust-1.88%2B-blue)\n\nHigh-performance text fingerprinting SDK for Rust with **classical sketches** (MinHash + LSH, SimHash, TLSH), **Unicode-correct canonicalization**, and **semantic embeddings** (ONNX local).\n\n## Overview\n\n`txtfp` produces compact, deterministic, byte-stable hashes for text deduplication, near-duplicate detection, and semantic search:\n\n| Method      | Use case                          | Output         | Complexity      |\n| ----------- | --------------------------------- | -------------- | --------------- |\n| **MinHash** | Set-similarity dedup (Jaccard)    | `[u64; H]`     | O(n) sketch     |\n| **LSH**     | Sub-linear near-duplicate lookup  | bucketed index | O(1) avg query  |\n| **SimHash** | Bit-LSH near-dup (Hamming)        | `u64`          | O(n) sketch     |\n| **TLSH**    | Byte-level locality-sensitive hash | hex string     | O(n) sketch     |\n| **Embedding** | Semantic similarity (ANN)        | `Vec\u003cf32\u003e`     | model-dependent |\n\nIt is the text counterpart to [`audiofp`](https://crates.io/crates/audiofp) (audio) and [`imgfprint`](https://crates.io/crates/imgfprint) (image), and is consumed by the cross-modal `ucfp` integrator.\n\nPerfect for:\n\n- LLM training-set deduplication\n- RAG retrieval ranking\n- Content moderation\n- Plagiarism detection\n- Email / document de-dup at scale\n\n## Features\n\n- **Byte-stable hash layouts** — `MinHashSig\u003cH\u003e` and `SimHash64` are `repr(C)` `bytemuck::Pod`. Schema-versioned, semver-frozen, golden-byte enforced (18 fixtures).\n- **Production canonicalization** — NFKC + simple casefold + Bidi/format strip; defends against Trojan Source, ZWJ injection, NFC bombs.\n- **`no_std + alloc`-clean default features** — builds for `wasm32-unknown-unknown` out of the box.\n- **Streaming + offline fingerprinters** — every classical sketcher has both a `Fingerprinter` (whole-doc) and `StreamingFingerprinter` (chunk-fed) variant.\n- **Local embeddings** — `LocalProvider` (ONNX via `ort` + Hugging Face Hub) for semantic similarity. Cloud-hosted providers are out of scope; implement `EmbeddingProvider` against your HTTP client of choice.\n- **Markup helpers** — HTML → text, Markdown → text.\n- **Unicode security** — UTS #39 confusable skeleton behind the `security` feature.\n- **CJK tokenizer** — `jieba-rs` with `OnceLock`-lazy dictionary for Simplified Chinese.\n- **Cross-SDK parity** — `EmbeddingProvider`, `Embedding`, `semantic_similarity`, `FORMAT_VERSION` aligned with `imgfprint` / `audiofp`.\n\n## Installation\n\n```toml\n[dependencies]\ntxtfp = \"0.2\"\n```\n\n\u003e **Upgrading from 0.1.x?** v0.2.0 flipped the default hash family from\n\u003e `MurmurHash3_x64_128` to `Xxh3_64` for both MinHash and SimHash —\n\u003e signature bytes change. Pin to `0.1` or pass\n\u003e `HashFamily::MurmurHash3_x64_128` explicitly for v0.1.x / Python\n\u003e `datasketch` byte parity. v0.2.1 is API- and bytes-compatible with\n\u003e v0.2.0 (patch release).\n\n### Feature flags\n\n| Feature      | Default | Pulls                                                       |\n| ------------ | :-----: | ----------------------------------------------------------- |\n| `std`           |   ✅    | libstd. Without it, `no_std + alloc`.                          |\n| `minhash`       |   ✅    | MinHash sketcher.                                              |\n| `simhash`       |   ✅    | SimHash sketcher.                                              |\n| `lsh`           |   ✅    | Banded LSH index over MinHash signatures.                      |\n| `markup`        |         | `html_to_text`, `markdown_to_text`.                            |\n| `cjk`           |         | `CjkTokenizer` (jieba, Simplified Chinese).                    |\n| `tlsh`          |         | `TlshFingerprinter`.                                           |\n| `security`      |         | UTS #39 confusable skeleton in the canonicalizer.              |\n| `serde`         |         | `Serialize` / `Deserialize` on signatures (incl. const-generic MinHash). |\n| `parallel`      |         | Rayon-powered batch helpers.                                   |\n| `semantic`      |         | `LocalProvider` via `ort` + Hugging Face Hub.                  |\n\nFor Japanese / Korean tokenization or PDF text extraction, implement\nthe `Tokenizer` / `Canonicalizer` upstream of this crate against\nyour preferred dedicated library (`lindera`, `vibrato`, `pdf-extract`,\n`poppler`, …). Cloud-hosted embedding endpoints (OpenAI, Voyage,\nCohere, …) are similarly out of scope; implement\n`EmbeddingProvider` against any HTTP client of choice — see\n[`USAGE.md`](USAGE.md#implementing-embeddingprovider) for a\nworked example.\n\nMinimal build (no_std + alloc, MinHash + SimHash only — drops LSH):\n\n```toml\n[dependencies]\ntxtfp = { version = \"0.3\", default-features = false, features = [\"minhash\", \"simhash\"] }\n```\n\nWithout LSH (still on default `std`):\n\n```toml\n[dependencies]\ntxtfp = { version = \"0.3\", default-features = false, features = [\"std\", \"minhash\", \"simhash\"] }\n```\n\nWith local ONNX embeddings:\n\n```toml\n[dependencies]\ntxtfp = { version = \"0.3\", features = [\"semantic\"] }\n```\n\n## Quick Start\n\n```rust\nuse txtfp::{\n    Canonicalizer, Fingerprinter, MinHashFingerprinter,\n    ShingleTokenizer, WordTokenizer, jaccard,\n};\n\nfn main() -\u003e Result\u003c(), txtfp::Error\u003e {\n    let canon = Canonicalizer::default();\n    let tok = ShingleTokenizer { k: 5, inner: WordTokenizer };\n    let fp = MinHashFingerprinter::\u003c_, 128\u003e::new(canon, tok);\n\n    let a = fp.fingerprint(\"the quick brown fox jumps over the lazy dog at noon today\")?;\n    let b = fp.fingerprint(\"the quick brown fox jumps over the lazy dog at dusk today\")?;\n\n    let similarity = jaccard(\u0026a, \u0026b);\n    println!(\"Jaccard estimate: {:.2}\", similarity);\n\n    if similarity \u003e 0.6 {\n        println!(\"near-duplicate\");\n    }\n    Ok(())\n}\n```\n\n### LSH for sub-linear near-duplicate lookup\n\n```rust\n# #[cfg(feature = \"lsh\")]\n# fn demo() -\u003e Result\u003c(), txtfp::Error\u003e {\nuse txtfp::{\n    Canonicalizer, Fingerprinter, LshIndex, LshIndexBuilder,\n    MinHashFingerprinter, ShingleTokenizer, WordTokenizer,\n};\n\nlet canon = Canonicalizer::default();\nlet tok = ShingleTokenizer { k: 5, inner: WordTokenizer };\nlet fp = MinHashFingerprinter::\u003c_, 128\u003e::new(canon, tok);\n\n// Optimize bands/rows for a Jaccard threshold of 0.7.\nlet mut idx: LshIndex\u003c128\u003e = LshIndexBuilder::for_threshold(0.7, 128)?.build();\n\nidx.insert(0, fp.fingerprint(\"the quick brown fox jumps over the lazy dog at noon today\")?);\nidx.insert(1, fp.fingerprint(\"astronomers detect cosmic background radiation\")?);\n\nlet probe = fp.fingerprint(\"the quick brown fox jumps over the lazy dog at dusk today\")?;\nlet neighbours = idx.query_with_threshold(\u0026probe, 0.5);\nprintln!(\"near-duplicates: {neighbours:?}\");\n# Ok(()) }\n```\n\n## Documentation\n\nFor the complete API reference and worked examples, see [USAGE.md](USAGE.md).\n\n## Architecture\n\n### Pipeline\n\n```\ninput bytes\n    │\n    ▼\ncanonicalize  (NFKC + casefold + Bidi/format strip)\n    │\n    ▼\ntokenize      (Word | Grapheme | Shingle | CJK)\n    │\n    ▼\nsketch        (MinHash | SimHash | TLSH | Embedding)\n    │\n    ▼\ncompare       (jaccard | hamming | cosine_estimate | semantic_similarity)\n```\n\nEvery layer is independently swappable: pick a canonicalizer config, plug any `Tokenizer`, choose a `HashFamily`, and the same input always produces the same byte-stable signature.\n\n### Signature byte layouts (frozen for v0.1.x)\n\n```\nMinHashSig\u003cH\u003e                       SimHash64\n├── schema: u16  (= 1)              └── 8 bytes (u64, little-endian)\n├── _pad:   [u8; 6] (zero)\n└── hashes: [u64; H], LE\n\nTotal size: 8 + 8*H bytes\n```\n\nThese layouts are enforced by 18 byte-frozen golden-test fixtures (`tests/data/golden/`). Failing a golden test is a hard breakage that requires a major-version bump.\n\n### Algorithms\n\n- **MinHash** uses double-hashing (Indyk–Motwani 1998 + Kirsch–Mitzenmacher 2008): one `xxh3_128` per shingle, then derive `H` slots as `low + (i * high)`. v0.2.0+ default; pass `HashFamily::MurmurHash3_x64_128` for `datasketch` byte parity.\n- **SimHash** is Charikar 2002: token-weighted bag, 64-lane signed accumulator, sign-extract.\n- **LSH** is banded: `bands * rows == H`. `LshIndexBuilder::for_threshold` numerically minimizes false-positive + false-negative integral over `[0, threshold]` and `[threshold, 1]` to pick the partition.\n- **TLSH** wraps `tlsh2` 128/1.\n- **Local embeddings** load HF Hub ONNX models, tokenize with `tokenizers`, run `ort` 2.0, and pool with `Pooling::{Cls, Mean, MeanNoNorm, Max}`. The pooling default is looked up per-model (BGE → Cls, E5 → Mean, etc.).\n\n## Performance\n\nSingle-thread throughput on a 2024-class x86_64 laptop, **fat-LTO release with `RUSTFLAGS=\"-C target-cpu=native\"` and mimalloc** as the benches' global allocator, measured with `cargo bench --features lsh` over the 5 KB `lorem_ipsum` (ASCII) corpus:\n\nv0.2.0+ baseline (`HashFamily::Xxh3_64` default):\n\n| Operation                    | Time        | Throughput            |\n| ---------------------------- | ----------- | --------------------- |\n| MinHash sketch (h=128)       | ~110 µs/doc | **~9K docs/sec**      |\n| MinHash sketch (h=64)        | ~76 µs/doc  | ~13K docs/sec         |\n| SimHash sketch (b=64)        | ~205 µs/doc | ~5K docs/sec¹         |\n| Canonicalize NFKC (ASCII)    | ~540 ns/doc | ~1.9M docs/sec        |\n| LSH insert (h=128)           | ~1.9 µs/sig | ~530K signatures/sec  |\n| LSH query (10K-doc index)    | ~393 µs²    | ~2.5K queries/sec     |\n| Hamming compare (`hamming`)  | ~0.5 ns     | ~2B comparisons/sec   |\n| Jaccard compare (h=128)      | ~50 ns      | ~20M comparisons/sec  |\n\n¹ SimHash 5 KB throughput improved 40% from v0.1.2 (345 µs → 205 µs)\nvia the streaming `±1`-per-occurrence accumulator under `Weighting::Tf`.\n\n² LSH query is slower than v0.1.x **on adversarial bench corpora** —\nxxh3's collision profile produces 1.62× more bucket candidates than\nMurmurHash3 on a 9/10-shared-words corpus. Per-candidate cost is\nunchanged. Pin `HashFamily::MurmurHash3_x64_128` if your workload\nmatches the bench shape and you need v0.1.x query latency. See\nCHANGELOG.md for the analysis.\n\nRun benchmarks:\n\n```bash\nRUSTFLAGS=\"-C target-cpu=native\" cargo bench --features lsh\n```\n\n### Optimization knobs\n\n- The canonicalizer takes a single-pass ASCII fast path. v0.2.0 extends it to ASCII + droppable bidi/format codepoints (BOM, ZWSP, RLO, variation selectors) — measured **17×** faster on a 5 KB corpus with one BOM and a ZWSP every 80 bytes (170 µs → 9.8 µs).\n- `Tokenizer::for_each_token` is a callback-style API that skips per-token `String` allocation; classical sketchers route through it.\n- mimalloc gives ~2× on `LSH insert` (alloc-heavy), ~6% on SimHash, marginal elsewhere.\n- The MinHash slot-update inner loop and the SimHash 64-lane accumulator are already auto-vectorized by LLVM (verified via release-build assembly: `vpcmpltuq` + AVX-512 mask blending on `ymm` registers). No hand-rolled SIMD planned.\n- `LshIndex::extend_par` (v0.2.0, `parallel` feature) shards bulk insert by band across the rayon thread pool: measured **1.74×** speedup on 8 cores for 10K-doc bench.\n\n## Stability\n\n- **Hash byte struct layouts** (`MinHashSig\u003cH\u003e`, `SimHash64`, `TlshFingerprint`): frozen since v0.1.0. Golden tests enforce on every PR.\n- **Hash byte values**: changed once at v0.2.0 with the default-hasher flip from MurmurHash3 to xxh3. The struct layout did not change. v0.1.x byte parity is one builder call away (`with_hasher(HashFamily::MurmurHash3_x64_128)`); golden fixtures regenerated, no further byte changes planned for v0.2.x.\n- **`EmbeddingProvider`, `Embedding`, `semantic_similarity`**: parity-compatible with `imgfprint` 0.4.x and `audiofp` 0.2.x.\n- **`FORMAT_VERSION = 1`**: mirrored across the cross-modal sibling crates so the integrator (`ucfp`) can refuse to open a database whose layout predates the running build.\n- **Cross-config comparisons** are gated by `FingerprintMetadata::config_hash`. Two fingerprints with different non-zero `config_hash` values must not be compared.\n- **SemVer enforcement**: every PR runs `cargo-semver-checks` (added in v0.2.1) against the published baseline. Accidental SemVer breaks fail CI.\n\n## Security\n\n- **OOM protection**: streaming sketchers cap their internal buffer at 16 MiB; oversized chunks are rejected at `update` time.\n- **Trojan Source / homoglyph defense**: canonicalizer strips Bidi controls and the Cf category. `security` feature adds the UTS #39 confusable skeleton so Cyrillic 'а' folds to Latin 'a'.\n- **NFC bombs bounded**: NFKC growth capped at 18× (Unicode-spec-mandated worst case).\n- **Deterministic output**: same input always produces the same byte-identical signature; no hidden RNG, no clock dependency.\n- **Cryptographic-level attacks on the hash families**: out of scope. MurmurHash3, xxh3, and SimHash are non-cryptographic by design.\n\n## Comparison with alternatives\n\n| Feature                      | txtfp | datasketch (py) | sourmash (py) | rapidfuzz |\n| ---------------------------- | :---: | :-------------: | :-----------: | :-------: |\n| MinHash                      |  ✓   |       ✓        |      ✓       |    —     |\n| Banded LSH                   |  ✓   |       ✓        |      ✓       |    —     |\n| SimHash                      |  ✓   |       ✓        |      —       |    —     |\n| TLSH                         |  ✓   |       —        |      —       |    —     |\n| Streaming sketches           |  ✓   |       ✓        |      ✓       |    —     |\n| Unicode canonicalization     |  ✓   |       —        |      —       |   ~      |\n| Trojan Source defense        |  ✓   |       —        |      —       |    —     |\n| Local ONNX embeddings        |  ✓   |       —        |      —       |    —     |\n| Byte-stable hash layouts     |  ✓   |       —        |      —       |    —     |\n| `no_std + alloc`             |  ✓   |       —        |      —       |    —     |\n| Pure Rust (no Python GIL)    |  ✓   |       —        |      —       |    ✓     |\n\n## Examples\n\nSee the `examples/` directory:\n\n- `dedup.rs` — MinHash + LSH end-to-end deduplication\n- `near_dup.rs` — SimHash near-duplicate detection\n- `semantic.rs` — Local ONNX embedding similarity (requires `semantic`)\n- `regen_goldens.rs` — Regenerate the byte-frozen test fixtures (do not run on a patch release; only when intentionally bumping a minor)\n\n```bash\ncargo run --example dedup --features lsh --release\ncargo run --example near_dup --release\ncargo run --example semantic --features semantic --release\n```\n\n## Contributing\n\nContributions welcome. The contract:\n\n1. Fork the repository.\n2. Branch (`git checkout -b feature/x`).\n3. Run the matrix locally: `cargo test --no-default-features --features \"std,minhash,simhash,lsh,tlsh,markup,security,serde,parallel\"`.\n4. Run clippy: `cargo clippy --all-targets -- -D warnings`.\n5. Run benches if the change touches a hot path: `cargo bench`.\n6. **Never regenerate golden fixtures unless you're explicitly bumping a minor version.**\n7. Open a PR. CI gates on fmt, clippy, doc, deny, audit, semver-checks, and a 60-second fuzz smoke (`canonicalize` and `minhash_streaming` targets under `fuzz/`).\n8. Releases: see [`RELEASING.md`](RELEASING.md).\n\n### Development\n\n```bash\ngit clone https://github.com/themankindproject/txtfp\ncd txtfp\n\n# Default-feature smoke\ncargo test\n\n# Full classical surface (no semantic — pulls heavy ONNX deps)\ncargo test --features \"lsh,markup,security,serde,parallel,tlsh,cjk\"\n\n# Build the docs\ncargo doc --no-deps --open\n\n# Run the fuzz harness locally (requires nightly + cargo-fuzz)\ncd fuzz \u0026\u0026 cargo +nightly fuzz run canonicalize -- -max_total_time=60\n```\n\n## License\n\nLicensed under the [MIT","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthemankindproject%2Ftxtfp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthemankindproject%2Ftxtfp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthemankindproject%2Ftxtfp/lists"}