{"id":50457549,"url":"https://github.com/clark-labs-inc/clark-hash","last_synced_at":"2026-06-01T03:05:49.945Z","repository":{"id":360647977,"uuid":"1218809758","full_name":"clark-labs-inc/clark-hash","owner":"clark-labs-inc","description":"Clark Hash, 32x smaller searchable sketches for embeddings","archived":false,"fork":false,"pushed_at":"2026-05-27T08:49:48.000Z","size":116,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T10:18:35.628Z","etag":null,"topics":["embedding-vectors","embeddings","embeddings-similarity","lsh","sketching-algorithm"],"latest_commit_sha":null,"homepage":"https://www.clarkchat.com","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clark-labs-inc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-23T08:34:07.000Z","updated_at":"2026-05-27T08:49:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/clark-labs-inc/clark-hash","commit_stats":null,"previous_names":["clark-labs-inc/clark-hash"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/clark-labs-inc/clark-hash","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clark-labs-inc%2Fclark-hash","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clark-labs-inc%2Fclark-hash/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clark-labs-inc%2Fclark-hash/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clark-labs-inc%2Fclark-hash/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clark-labs-inc","download_url":"https://codeload.github.com/clark-labs-inc/clark-hash/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clark-labs-inc%2Fclark-hash/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33757791,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-01T02:00:06.963Z","response_time":115,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embedding-vectors","embeddings","embeddings-similarity","lsh","sketching-algorithm"],"created_at":"2026-06-01T03:05:49.094Z","updated_at":"2026-06-01T03:05:49.931Z","avatar_url":"https://github.com/clark-labs-inc.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Clark Hash\n\nClark Hash is a Rust package for compact, searchable sketches of neural embeddings.\nIt packages a stateless sparse Johnson-Lindenstrauss projection with fixed scalar\nquantization, so each database vector can be encoded independently and searched later\nwith an asymmetric floating-point query sketch.\n\nThe core codec was originally developed under the internal name `SQuaJL`. The Rust API\nkeeps the `SQuaJL` and `SQuaJLConfig` names for compatibility, and also exports\n`ClarkHash` and `ClarkHashConfig` aliases for new code.\n\n## Links\n\n- Crate: [crates.io/crates/clark-hash](https://crates.io/crates/clark-hash)\n- API docs: [docs.rs/clark-hash](https://docs.rs/clark-hash/latest/clark_hash/)\n- Source: [github.com/clark-labs-inc/clark-hash](https://github.com/clark-labs-inc/clark-hash)\n- Paper sources: [arxiv_submission/](arxiv_submission/)\n\n## Main Use Cases\n\n- **Cheaper embedding memory:** store 384-dimensional `f32` sentence embeddings as\n  48-byte searchable sketches in the default profile.\n- **Online semantic memory:** encode vectors as they arrive, without training a\n  codebook or recalibrating on the whole corpus.\n- **Large text streams:** map documents, chunks, logs, conversations, or agent\n  traces into compact semantic tokens for cheaper storage, movement, and scan.\n- **Retrieval prefilters:** use compressed sketch scores as a low-cost first pass\n  before reranking with dense vectors, text, or a stronger retrieval model.\n- **Local and edge search:** keep more semantic state in RAM, local disk, browser\n  storage, or customer-controlled deployments where bandwidth and sync size matter.\n\n## Repository Scope\n\nThis repository is now focused on the Clark Hash embedding codec:\n\n- Stateless sparse-JL sketching and scalar quantization for dense embeddings.\n- Bit-packed database-side vectors and floating-point query sketches.\n- A simple flat compressed-scan index for evaluation and small deployments.\n- Optional `fastembed` integration for local text-embedding examples.\n- Reproducible sentence-similarity benchmarks and paper sources.\n\nModel-runtime compression experiments are intentionally outside this package. The\nlibrary surface here is the embedding sketch codec and its benchmark harnesses.\n\n## Why Use It\n\nA common 384-dimensional `f32` sentence embedding costs 1,536 bytes per vector. The\ndefault Clark Hash profile stores the same vector as a 48-byte cosine sketch:\n\n| Representation | Bytes per vector | Storage ratio |\n| --- | ---: | ---: |\n| Dense `f32`, 384 dimensions | 1,536 | 1.0000 |\n| Clark Hash, `m = 96`, `b = 4` | 48 | 0.03125 |\n\nThat is 32x smaller, or 96.875% less vector memory, for this configuration. The quality\ntradeoff depends on the embedding model, sketch dimension, bit width, hash count, and\nretrieval workload; the benchmark section below shows measured results rather than a\nuniversal guarantee.\n\nClark Hash is useful when embeddings arrive continuously and you do not want a training\nor calibration pass before storing each vector:\n\n- Encode one vector at a time with a deterministic seed.\n- Store compact bit-packed sketches for hot memory, local cache, disk, or object storage.\n- Keep query vectors in floating point for asymmetric scoring.\n- Avoid corpus-specific codebooks, centroids, rotations, or learned quantization tables.\n- Use the same codec in simple flat scans, evaluation harnesses, and larger retrieval systems.\n\n## Install\n\nFrom crates.io:\n\n```toml\n[dependencies]\nclark-hash = \"0.1\"\n```\n\nWith local text embedding support through `fastembed`:\n\n```toml\n[dependencies]\nclark-hash = { version = \"0.1\", features = [\"fastembed\"] }\n```\n\nWith serialization support for quantized codes:\n\n```toml\n[dependencies]\nclark-hash = { version = \"0.1\", features = [\"serde\"] }\n```\n\nIn Rust code, the crate is imported as `clark_hash`.\n\n## Quick Start\n\n```rust\nuse clark_hash::{ClarkHash, ClarkHashConfig, FlatIndex, SimilarityMetric};\n\nfn main() -\u003e clark_hash::Result\u003c()\u003e {\n    let codec = ClarkHash::new(\n        ClarkHashConfig::new(384)\n            .with_sketch_dim(96)\n            .with_bits(4)\n            .with_hashes_per_input(4)\n            .with_metric(SimilarityMetric::Cosine),\n    )?;\n\n    let doc_a = vec![0.1_f32; 384];\n    let doc_b = vec![0.2_f32; 384];\n    let query = vec![0.15_f32; 384];\n\n    let mut index = FlatIndex::new(codec);\n    index.add_vector(\u0026doc_a)?;\n    index.add_vector(\u0026doc_b)?;\n\n    let hits = index.search(\u0026query, 2)?;\n    println!(\"{hits:#?}\");\n\n    Ok(())\n}\n```\n\n## Text Embedding Pipeline\n\nEnable the `fastembed` feature when you want local text embeddings and immediate\nquantization in one pipeline.\n\n```rust\nuse clark_hash::{ClarkHash, ClarkHashConfig, FastEmbedQuantizer, FlatIndex};\nuse fastembed::EmbeddingModel;\n\nfn main() -\u003e clark_hash::Result\u003c()\u003e {\n    let codec = ClarkHash::new(\n        ClarkHashConfig::new(384)\n            .with_sketch_dim(96)\n            .with_bits(4)\n            .with_hashes_per_input(4),\n    )?;\n\n    let mut pipeline = FastEmbedQuantizer::new(EmbeddingModel::AllMiniLML6V2, codec)?;\n\n    let documents = vec![\n        \"passage: Rust is a systems programming language.\",\n        \"passage: Embeddings can preserve semantic similarity.\",\n        \"passage: Quantization reduces memory usage.\",\n    ];\n\n    let codes = pipeline.quantize_texts(\u0026documents, Some(32))?;\n    let query = pipeline.embed_query(\"query: semantic vector compression\")?;\n    let index = FlatIndex::from_encoded(pipeline.codec().clone(), codes)?;\n\n    println!(\"{:#?}\", index.search_prepared(\u0026query, 3)?);\n    Ok(())\n}\n```\n\nRun the example:\n\n```bash\ncargo run --release --features fastembed --example fastembed_quantize\n```\n\n## How It Works\n\nFor an input vector `x in R^d`, the codec:\n\n1. Computes the input norm.\n2. Projects the normalized vector into a lower-dimensional sparse signed JL sketch.\n3. Rescales the projected coordinates by `sqrt(sketch_dim)`.\n4. Clips and uniformly quantizes every sketch coordinate into `1..=8` bits.\n5. Optionally stores a two-byte norm channel for raw dot-product scoring.\n\nThe database side stores a `QuantizedVector`. The query side uses a floating-point\n`QuerySketch`. Scoring happens in sketch space, which is a natural fit for cosine\nsimilarity over normalized sentence embeddings.\n\nFor the compact mathematical note and paper, see:\n\n- [Clark Hash paper PDF](docs/Clark_Hash_Paper.pdf)\n- [Typst paper source](docs/CLARK_HASH_PAPER.typ)\n- [Editable paper note](docs/CLARK_HASH_PAPER.md)\n- [arXiv LaTeX source package](arxiv_submission/)\n\nRegenerate the PDF with:\n\n```bash\ntypst compile docs/CLARK_HASH_PAPER.typ docs/Clark_Hash_Paper.pdf\n```\n\n## Configuration Guide\n\nFor common 384-dimensional sentence embeddings, start here:\n\n```rust\nClarkHashConfig::new(384)\n    .with_sketch_dim(96)\n    .with_bits(4)\n    .with_hashes_per_input(4)\n    .with_metric(SimilarityMetric::Cosine)\n```\n\nUseful tuning directions:\n\n- `sketch_dim = 64` with `bits = 2` or `3` gives more aggressive compression.\n- `sketch_dim = 128` with `bits = 4` or `6` gives better quality.\n- `SimilarityMetric::Cosine` is best for normalized semantic embeddings.\n- `SimilarityMetric::Dot` stores a small norm channel and is better when raw inner product matters.\n- `seed` controls the deterministic projection, so keep it stable across indexed data.\n\n## Benchmarks\n\nRun the core encode and scan Criterion benchmark:\n\n```bash\ncargo bench --bench throughput\n```\n\nRun the local text embedding plus quantization benchmark:\n\n```bash\ncargo bench --features fastembed --bench fastembed_pipeline\n```\n\nRun the synthetic retrieval sanity check:\n\n```bash\ncargo run --release --example quality_report\n```\n\n## Hugging Face Sentence Similarity Benchmark\n\nThe real-text benchmark downloads multilingual sentence-similarity corpora from Hugging\nFace, embeds each unique sentence once, quantizes the embeddings, and compares score\ncorrelations.\n\nDefault `all-MiniLM-L6-v2` run:\n\n```bash\ncargo run --release --features fastembed --example hf_sentence_similarity\n```\n\nMultilingual model run:\n\n```bash\ncargo run --release --features fastembed --example hf_sentence_similarity -- \\\n  --model ParaphraseMLMiniLML12V2 \\\n  --report target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json\n```\n\nFast smoke run:\n\n```bash\ncargo run --release --features fastembed --example hf_sentence_similarity -- \\\n  --max-pairs-per-subset 200\n```\n\nThe benchmark currently uses:\n\n- `mteb/sts17-crosslingual-sts`\n- `mteb/sts22-crosslingual-sts`\n\nIt reports:\n\n- Dense cosine score vs. human similarity correlation.\n- Clark Hash approximate score vs. human similarity correlation.\n- Quantized score vs. dense score correlation.\n- Macro averages across language-pair subsets.\n\n## Benchmark Results\n\nThese results were produced locally on April 23, 2026 with:\n\n- `sketch_dim = 96`\n- `bits = 4`\n- `hashes_per_input = 4`\n- cosine scoring\n- 48 bytes per stored vector\n- 0.03125 compression ratio vs. dense `f32`\n\nThe full benchmark used 9,304 labeled sentence pairs across 29 multilingual subsets and\n17,000 unique sentences.\n\n| Model | Dataset | Dense Spearman | Sketch Spearman | Sketch Loss | Sketch vs Dense Pearson |\n| --- | --- | ---: | ---: | ---: | ---: |\n| `all-MiniLM-L6-v2` | `mteb/sts17-crosslingual-sts` | 0.3644 | 0.2719 | -0.0926 | 0.7242 |\n| `all-MiniLM-L6-v2` | `mteb/sts22-crosslingual-sts` | 0.4168 | 0.2876 | -0.1292 | 0.8531 |\n| `paraphrase-multilingual-MiniLM-L12-v2` | `mteb/sts17-crosslingual-sts` | 0.8144 | 0.7460 | -0.0684 | 0.9099 |\n| `paraphrase-multilingual-MiniLM-L12-v2` | `mteb/sts22-crosslingual-sts` | 0.2973 | 0.2472 | -0.0501 | 0.9460 |\n\nThe main readout is that model fit matters more than quantization in this test. The\nEnglish-centric `all-MiniLM-L6-v2` model is weak on many cross-lingual subsets. The\nmultilingual MiniLM backbone is much stronger on STS17, and the sketch preserves a large\npart of that ranking signal while storing each vector in 48 bytes.\n\nSTS22 is a harder and more mixed corpus. The multilingual model is not universally better\nthere, but the quantized sketches still track dense scores more closely than they did\nwith the English MiniLM baseline.\n\nFull JSON reports from the local run:\n\n- `target/hf-sts-report.json`\n- `target/hf-sts-report-paraphrase-multilingual-minilm-l12-v2.json`\n\n## API Overview\n\nCore types:\n\n- `ClarkHash` / `SQuaJL`: stateless codec used to encode vectors, sketch queries, and score codes.\n- `ClarkHashConfig` / `SQuaJLConfig`: sketch size, bit width, hash count, clip range, seed, and metric.\n- `QuantizedVector`: bit-packed database-side sketch.\n- `QuerySketch`: floating-point query-side sketch.\n- `FlatIndex`: reference exact scan over compressed vectors.\n- `FastEmbedQuantizer`: optional text embedding and quantization pipeline.\n\n## Limitations\n\n- Clark Hash is a quantization library, not a full approximate-nearest-neighbor engine.\n- `FlatIndex` scans compressed vectors exactly and is meant for evaluation and simple deployments.\n- Quality depends on the embedding model, sketch dimension, bit width, and workload.\n- No fixed sketch dimension can preserve every future pair in an adversarial unbounded stream.\n- This package does not claim that Johnson-Lindenstrauss transforms, feature hashing,\n  scalar quantization, or compressed retrieval are new. It documents and implements one\n  practical stateless combination for Clark's embedding and memory workloads.\n\n## Citation\n\nMLA:\n\n\u003e Clark Labs Inc., Autoresearch, and Stanislav Kirdey. \"Clark Hash: Stateless Sparse\n\u003e Johnson-Lindenstrauss Quantization for Neural Embeddings.\" Clark Labs Inc., 2026,\n\u003e GitHub, https://github.com/clark-labs-inc/clark-hash.\n\nBibTeX:\n\n```bibtex\n@misc{clark_hash_2026,\n  author = {{Clark Labs Inc.} and {Autoresearch} and {Stanislav Kirdey}},\n  title = {Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings},\n  year = {2026},\n  publisher = {Clark Labs Inc.},\n  url = {https://github.com/clark-labs-inc/clark-hash}\n}\n```\n\n## Development\n\n```bash\ncargo fmt --all -- --check\ncargo clippy --all-targets --all-features -- -D warnings\ncargo test --all-features\ncargo bench --bench throughput --no-run\n```\n\nThe `fastembed` benchmark and examples may download models on first use.\n\n## License\n\nLicensed under either of:\n\n- Apache License, Version 2.0\n- MIT license\n\nat your option.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclark-labs-inc%2Fclark-hash","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclark-labs-inc%2Fclark-hash","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclark-labs-inc%2Fclark-hash/lists"}