{"id":50798020,"url":"https://github.com/smarthi/pymuvera","last_synced_at":"2026-06-12T16:03:28.835Z","repository":{"id":353466578,"uuid":"1219519573","full_name":"smarthi/pymuvera","owner":"smarthi","description":"Python library for MUVERA multi-vector retrieval via Fixed Dimensional Encodings. ColBERT / ColQwen2 / ColQwen3.5 compatible.","archived":false,"fork":false,"pushed_at":"2026-05-19T20:44:03.000Z","size":843,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-19T23:56:28.043Z","etag":null,"topics":["approximate-nearest-neighbor","approximate-nearest-neighbor-search","colbert","colqwen2","embeddings","late-interaction","multi-vector-retrieval","muvera","rag","simhash"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/pymuvera/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/smarthi.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-24T00:45:49.000Z","updated_at":"2026-05-19T20:44:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/smarthi/pymuvera","commit_stats":null,"previous_names":["smarthi/muvera-fde","smarthi/pymuvera"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/smarthi/pymuvera","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smarthi%2Fpymuvera","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smarthi%2Fpymuvera/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smarthi%2Fpymuvera/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smarthi%2Fpymuvera/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/smarthi","download_url":"https://codeload.github.com/smarthi/pymuvera/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/smarthi%2Fpymuvera/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34251777,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-12T02:00:06.859Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["approximate-nearest-neighbor","approximate-nearest-neighbor-search","colbert","colqwen2","embeddings","late-interaction","multi-vector-retrieval","muvera","rag","simhash"],"created_at":"2026-06-12T16:03:26.857Z","updated_at":"2026-06-12T16:03:28.822Z","avatar_url":"https://github.com/smarthi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pymuvera — MUVERA + EGGROLL + Spectral SimHash: Fixed Dimensional Encodings for Multi-Vector Retrieval\n\n**Sublinear ANN retrieval for ColBERT, ColPali, ColQwen2, and ColQwen3.5.**\n\n[![PyPI](https://img.shields.io/pypi/v/pymuvera)](https://pypi.org/project/pymuvera/)\n[![Python](https://img.shields.io/pypi/pyversions/pymuvera)](https://pypi.org/project/pymuvera/)\n[![CI](https://github.com/smarthi/pymuvera/actions/workflows/ci.yml/badge.svg)](https://github.com/smarthi/pymuvera/actions)\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](LICENSE)\n\nA pure-Python port of Google's graph-mining MUVERA implementation, extended with\n**low-rank SimHash factorisation** (EGGROLL, Sarkar et al., 2025),\n**Subsampled Randomized Hadamard Transform** (SRHT, Woolfe, Liberty, Rokhlin \u0026 Tygert, 2008),\n**Cross-Polytope LSH** (Andoni \u0026 Razenshteyn, 2015),\n**Densifying LSH fill** (Shrivastava, 2014), and\n**Calibrated Eigenbasis SimHash** with eigenvalue-weighted partitioning\n(inspired by SpectralQuant, Vangara \u0026 Gopinath, 2026).\n\n| | Reference |\n|---|---|\n| MUVERA paper | [Dhulipala et al., 2024](https://arxiv.org/abs/2405.19504) |\n| EGGROLL paper (LOW_RANK_GAUSSIAN) | [Sarkar et al., 2025](https://eshyperscale.github.io/imgs/paper.pdf) |\n| SRHT | [Woolfe, Liberty, Rokhlin \u0026 Tygert, 2008](https://doi.org/10.1016/j.acha.2007.12.002) |\n| Cross-Polytope LSH | [Andoni \u0026 Razenshteyn, 2015](https://arxiv.org/abs/1509.02897) |\n| Densifying LSH | [Shrivastava, 2014](https://arxiv.org/abs/1401.4605) |\n| CALIBRATED_EIGENBASIS inspiration | [SpectralQuant](https://github.com/Dynamis-Labs/spectralquant), Vangara \u0026 Gopinath, 2026 |\n| Original C++ implementation | [google/graph-mining](https://github.com/google/graph-mining/tree/main/sketching/point_cloud) |\n\n---\n\n## v0.4.2 highlights\n\nv0.4.2 documents `CALIBRATED_EIGENBASIS` as an experimental SpectralQuant-inspired\nFDE/LSH adaptation, adds explicit SpectralQuant attribution, and calls out the main\nEigenbasis reconstruction-risk tradeoff: eigenvalue weighting can improve semantic\ncollisions when high-variance directions carry signal, but it can hurt recall when\nimportant matches live in low-variance tail directions. The reconstruction-error\nsection below includes the restored plots from the v0.4.1 docs plus a new\nEigenbasis-specific spectral-bias plot and caveats. The plot PNGs can be\nregenerated with `python docs/generate_readme_plots.py`.\n\n---\n\n## What this library adds beyond the original paper\n\nThe MUVERA paper uses a full-rank Gaussian matrix for SimHash partitioning and\nHamming nearest-neighbor fill for empty partitions. This library adds five\ncapabilities:\n\n**`LOW_RANK_GAUSSIAN`** (EGGROLL, Sarkar et al., 2025) factors the SimHash matrix\nas AB⊤ (`A ∈ ℝ^{d×r}`, `B ∈ ℝ^{k×r}`, `r ≪ k`), cutting partition cost from\n`O(N·d·k)` to `O(N·d·r + N·r·k)`. O(r⁻¹) convergence to full-rank, faster than\nthe CLT rate. At r=4, ColQwen2 (d=128, k=8): **~1.9× faster**, ~25% variance increase.\n\n**`SRHT`** (Woolfe et al., 2008) applies a structured `S·H·D` transform at\n`O(N·d·log d)` cost, independent of k. The linear projection has a JL-style\ndistance-preservation guarantee; sign partitioning remains a SimHash heuristic.\nFor ColQwen2 (d=128, k=8): 904N vs 1024N ops.\n\n**`CROSS_POLYTOPE`** (Andoni \u0026 Razenshteyn, 2015) uses `argmax(|H·D·x|)` instead\nof sign-based SimHash, producing 2·padded_dim partitions per repetition aligned with\nthe Voronoi cells of the cross-polytope — **theoretically optimal for cosine\nsimilarity** in high dimensions. For ColQwen2 (d=128): 256 partitions at O(d log d)\ncost. For ColQwen3.5 (d=320): 1024 partitions.\n\n**Densifying LSH fill** (Shrivastava, 2014) replaces the Hamming nearest-neighbor fill\n— which costs O(num_tokens × k × num_empty) and can reach 800K+ operations per document\nat k=8 with 512 tokens and 200 empty slots — with a deterministic splitmix64 hash that\nassigns each empty slot a source token in a single operation. Cost scales only with the\nnumber of empty slots, not corpus size or k. Automatically used for `CROSS_POLYTOPE`\n(no sketch matrix available for Hamming distances); opt-in for other modes via\n`densifying_fill=True`.\n\n**`CALIBRATED_EIGENBASIS`** rotates embeddings into the eigenbasis of the empirical\ntoken covariance before SimHash partitioning. With `use_eigenvalue_weighting=True`\n(default), the SimHash projection matrix is sampled from N(0, diag(λ)) in the\nrotated space, so bucket assignment emphasizes high-variance calibrated directions.\nThis is inspired by SpectralQuant's calibrated eigenbasis and water-filled allocation\nfor KV-cache quantization, but it is an experimental FDE/LSH adaptation: pymuvera\ndoes not implement SpectralQuant's semantic/tail split, QJL correction, Lloyd-Max\ncodebooks, or integer bit allocation. Validate this mode against exact MaxSim on\nyour own multimodal corpus.\n\n---\n\n## What is MUVERA?\n\nLate-interaction retrieval models like **ColBERT**, **ColPali**, and **ColQwen2**\nrepresent each query and document as a *variable-length set* of token embeddings\nrather than a single vector. Scoring two sets requires the computationally\nexpensive **MaxSim** (Chamfer Similarity) operation:\n\n```\nChamfer(Q, D) = Σ_{q ∈ Q} max_{d ∈ D} cos(q, d)\n```\n\nThis makes large-scale ANN retrieval impractical with standard indexes.\n\nMUVERA solves this by converting each multi-vector set into a **single\nfixed-dimensional vector** (FDE) such that:\n\n```\nfde_query(Q) · fde_doc(D)  ≈  Chamfer(Q, D)\n```\n\nStandard ANN libraries (FAISS, ScaNN, OpenSearch k-NN) can then index FDE\nvectors directly, restoring sublinear retrieval for late-interaction models.\nFor cosine-style MaxSim, normalize token embeddings before encoding and use the\nraw FDE inner product as the stage-1 score; normalizing the final FDE vectors\nchanges the estimator.\n\n---\n\n## Installation\n\n```bash\npip install pymuvera\n```\n\nRequires Python ≥ 3.12, NumPy ≥ 1.24, Pydantic ≥ 2.0.\n\n---\n\n## Quick start\n\n```python\nimport numpy as np\nfrom pymuvera import MUVERAEncoder\n\n# One encoder instance for both queries and documents — seed must match\nenc = MUVERAEncoder(\n  dimension=128,  # ColBERT / ColQwen2 token embedding dimension\n  num_simhash_projections=4,  # 2^4 = 16 partitions per repetition\n  num_repetitions=2,  # 2 independent repetitions\n  seed=42,\n)\n\nprint(enc)\n# MUVERAEncoder(dimension=128, num_simhash_projections=4, num_repetitions=2,\n#               projection_type=DEFAULT_IDENTITY, fde_dimension=4096)\n\nquery_tokens = np.random.randn(32, 128).astype(np.float32)  # 32 query tokens\ndoc_tokens = np.random.randn(512, 128).astype(np.float32)  # 512 document tokens\n\nq_fde = enc.encode_query(query_tokens)  # shape: (4096,)\nd_fde = enc.encode_document(doc_tokens)  # shape: (4096,)\n\n# Approximate Chamfer Similarity — drop into any ANN index as a float32 vector\nscore = float(q_fde @ d_fde)\n```\n\n---\n\n## API reference\n\n### `MUVERAEncoder`\n\nThe primary entry point. Initialize **once** and reuse for all queries and\ndocuments — the random partition structure (SimHash matrices, Count Sketch\nparameters) must be identical on both sides.\n\n```text\nMUVERAEncoder(\n    dimension: int = 128,\n    num_simhash_projections: int = 4,\n    num_repetitions: int = 1,\n    seed: int = 1,\n    projection_type: ProjectionType = ProjectionType.DEFAULT_IDENTITY,\n    projection_dimension: int | None = None,\n    simhash_rank: int = 1,\n    fill_empty_partitions: bool = False,\n    densifying_fill: bool = False,\n    final_projection_dimension: int | None = None,\n    use_eigenvalue_weighting: bool = True,\n    calibration: EigenbasisCalibration | None = None,\n)\n```\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `dimension` | 128 | Token embedding dimension |\n| `num_simhash_projections` | 4 | SimHash bits *k*; partitions = 2^k |\n| `num_repetitions` | 1 | Independent repetitions (more → better approximation) |\n| `seed` | 1 | Shared RNG seed — **must match** query and document sides |\n| `projection_type` | `DEFAULT_IDENTITY` | `DEFAULT_IDENTITY`, `AMS_SKETCH`, `LOW_RANK_GAUSSIAN` (EGGROLL), `SRHT`, `CROSS_POLYTOPE`, or `CALIBRATED_EIGENBASIS` |\n| `projection_dimension` | `None` | Target dim after Count Sketch; required for `AMS_SKETCH` |\n| `simhash_rank` | 1 | Rank *r* for `LOW_RANK_GAUSSIAN`; must satisfy `1 ≤ r \u003c num_simhash_projections`. r=4 is a practical sweet spot for ColQwen2 (d=128, k≥8) |\n| `fill_empty_partitions` | `False` | Document side: fill empty slots |\n| `densifying_fill` | `False` | Use O(num_empty) Densifying LSH fill (Shrivastava, 2014) instead of O(N×k) Hamming NN fill. When `fill_empty_partitions=True`, this path is automatic for `CROSS_POLYTOPE` |\n| `final_projection_dimension` | `None` | Post-accumulation Count Sketch compression |\n| `use_eigenvalue_weighting` | `True` | `CALIBRATED_EIGENBASIS` only: scale SimHash rows by √λ_i so high-variance eigendirections dominate bucket assignment. Set False for ablation |\n| `calibration` | `None` | `CALIBRATED_EIGENBASIS` only: pre-computed `EigenbasisCalibration`. Alternative to calling `calibrate()` post-construction |\n\n**Property:** `fde_dimension` — output vector length.\n\n---\n\n### Encoding single inputs\n\n```python\nenc = MUVERAEncoder(dimension=128, num_simhash_projections=4, num_repetitions=2)\n\n# Query: SUM aggregation — token embeddings summed into their SimHash partition\nq_fde = enc.encode_query(query_tokens)    # (num_tokens, 128) → (fde_dim,)\n\n# Document: AVERAGE aggregation — centroid of tokens per partition\nd_fde = enc.encode_document(doc_tokens)   # (num_tokens, 128) → (fde_dim,)\n\n# Both also accept flat 1-D input (num_tokens * dimension,)\nq_fde = enc.encode_query(query_tokens.flatten())\n```\n\n---\n\n### Batch encoding\n\n```python\nqueries   = [np.random.randn(32,  128).astype(np.float32) for _ in range(100)]\ndocuments = [np.random.randn(512, 128).astype(np.float32) for _ in range(1000)]\n\nQ = enc.encode_queries_batch(queries)     # shape: (100,  fde_dimension)\nD = enc.encode_documents_batch(documents) # shape: (1000, fde_dimension)\n\n# All-pairs approximate Chamfer Similarities in one matmul\nscores = Q @ D.T   # shape: (100, 1000)\ntop_k  = np.argsort(scores, axis=1)[:, ::-1][:, :10]  # top-10 per query\n```\n\n---\n\n### Reducing FDE size\n\nTwo orthogonal compression knobs:\n\n**Option A — per-partition Count Sketch** (reduces width before accumulation):\n\n```python\nfrom pymuvera import ProjectionType\n\nenc = MUVERAEncoder(\n  dimension=128,\n  num_simhash_projections=4,\n  num_repetitions=4,\n  projection_type=ProjectionType.AMS_SKETCH,\n  projection_dimension=32,  # 128 → 32 per partition slot\n)\n# fde_dimension = 4 reps × 16 partitions × 32 = 2048  (vs 8192 without)\n```\n\n**Option B — post-accumulation Count Sketch** (compresses the final vector):\n\n```python\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=4,\n    num_repetitions=4,\n    final_projection_dimension=512,   # 8192 → 512\n)\n# fde_dimension = 512\n```\n\nBoth preserve dot products in expectation: `E[⟨sketch(x), sketch(y)⟩] = ⟨x, y⟩`.\n\n---\n\n### Projection modes\n\nSeveral projection modes are available, each trading speed, output size, and quality.\n`DEFAULT_IDENTITY`, `LOW_RANK_GAUSSIAN`, `SRHT`, and `CALIBRATED_EIGENBASIS`\nshare the same FDE shape for the same `(dimension, k, repetitions)` settings.\n`AMS_SKETCH` and `CROSS_POLYTOPE` intentionally change that shape.\n\n#### Mode 1: `DEFAULT_IDENTITY` — full-rank Gaussian (baseline)\n\nSamples a fresh `(d × k)` Gaussian matrix per repetition. This is the full-rank\nrandom-hyperplane SimHash baseline.\n\n```python\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=8,\n    num_repetitions=4,\n)\n# SimHash cost: O(N × 128 × 8) = 1024N ops/rep\n```\n\n---\n\n#### Mode 2: `LOW_RANK_GAUSSIAN` — low-rank factored SimHash (EGGROLL)\n\nFactors `W ≈ AB⊤` where `A ∈ ℝ^{d×r}`, `B ∈ ℝ^{k×r}`, replacing one large\nmatmul with two smaller ones:\n\n```python\nfrom pymuvera import ProjectionType\n\nenc = MUVERAEncoder(\n  dimension=128,\n  num_simhash_projections=8,\n  num_repetitions=4,\n  projection_type=ProjectionType.LOW_RANK_GAUSSIAN,\n  simhash_rank=4,  # r=4: O(N×128×4 + N×4×8) = 544N ops — 1.9× faster\n  seed=42,\n)\n```\n\n**Convergence** (EGGROLL, Sarkar et al. 2025, Theorem 4): the low-rank sign\npattern converges to the full-rank Gaussian at **O(r⁻¹)** — faster than the\n**CLT rate of O(r⁻¹/²)**.\n\n**What is the CLT rate?** The Central Limit Theorem tells us that averaging *n*\nindependent random variables reduces error at O(n⁻¹/²) — the square root of the\nsample size. This is the default convergence rate for most random approximations.\nEGGROLL beats it because the low-rank matrix AB⊤ has a *symmetric* distribution:\nthe sign of each projection is equally likely to be ±1, which causes all **odd\ncumulants** (1st, 3rd, 5th order terms) in the Edgeworth expansion to cancel\nexactly. Since those odd terms are what normally contribute O(r⁻¹/²) error,\ntheir cancellation pushes the leading error down to O(r⁻¹) — the same mechanism\nthat makes symmetric random walks converge faster than asymmetric ones.\n\n| `simhash_rank` r | CLT rate O(r⁻¹/²) | EGGROLL rate O(r⁻¹) | Speedup vs baseline |\n|---|---|---|---|\n| 4 | ~50% error | **~25% error** | 1.9× |\n| 9 | ~33% error | **~11% error** | — |\n| 16 | ~25% error | **~6% error** | — |\n\nCost breakdown for ColQwen2 (d=128, k=8):\n\n| `simhash_rank` | SimHash cost | Speedup |\n|---|---|---|\n| 1 | 136N ops | 7.5× |\n| 4 | 544N ops | 1.9× |\n| 8 | 1088N ops | ~breakeven |\n\n\u003e The 1/√r normalisation is omitted — SimHash sign assignments are\n\u003e scale-invariant (`sign(αx) = sign(x)`), so it has no effect.\n\n---\n\n#### Mode 3: `SRHT` — Subsampled Randomized Hadamard Transform\n\nApplies the structured transform `S·H·D` row-wise:\n\n* **D** — random diagonal ±1 (Rademacher sign flip)\n* **H** — Walsh-Hadamard transform (O(d log d) butterfly)\n* **S** — random row subsampling to k dimensions\n\nInput is zero-padded to the next power of 2 ≥ d before applying H.\n\n```python\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=8,\n    num_repetitions=4,\n    projection_type=ProjectionType.SRHT,\n    seed=42,\n)\n# SimHash cost: O(N × 128 × log₂(128) + N × 8) = O(N × 128 × 7 + N × 8) = 904N ops\n# Linear SRHT projection has a JL guarantee; sign partitioning remains a SimHash heuristic\n# Constraint: num_simhash_projections \u003c= next_power_of_2(dimension)\n```\n\n**Theoretical note:** SRHT is a structured Johnson-Lindenstrauss projection —\nthe linear projection preserves pairwise distances to ε with high probability\nunder the usual SRHT assumptions. In this library it feeds sign-based SimHash\npartitioning, so the JL result is motivation for projection quality rather than\na direct guarantee on bucket assignments.\nTropp (2011) provides the tightest known analysis, proving that\n`ℓ ≥ (1+ι) · k log(k)` subsampled dimensions suffice to preserve an entire\nk-dimensional subspace with optimal constants via matrix Chernoff inequalities.\nFor SimHash (sign-only) use, sign assignments are scale-invariant, so the\nembedding constants do not apply directly.\n\n---\n\n#### Mode 4: `CROSS_POLYTOPE` — theoretically optimal cosine partitioning\n\nApplies a full SRHT rotation (no subsampling), then assigns each token to its\n**dominant coordinate** — the coordinate with the largest absolute value after rotation:\n\n```text\ny = H D x_padded                    # full Walsh-Hadamard rotation\nj = argmax_i |y_i|                  # dominant coordinate\ns = int(y_j \u003e 0)                    # sign of dominant coordinate\npartition = 2*j + s                 # in [0, 2 * padded_dim)\n```\n\n```python\nfrom pymuvera import ProjectionType\n\nenc = MUVERAEncoder(\n  dimension=128,\n  num_repetitions=4,\n  projection_type=ProjectionType.CROSS_POLYTOPE,\n  fill_empty_partitions=True,  # densifying fill path selected automatically\n  seed=42,\n)\n# num_partitions = 2 * next_power_of_2(128) = 256  (NOT 2^k)\n# fde_dimension  = 4 × 256 × 128 = 131,072\n# num_simhash_projections is IGNORED for CROSS_POLYTOPE\n```\n\n**Why Cross-Polytope is theoretically superior to SimHash:** SimHash partitions space\nwith random hyperplanes — each bit is independent. Cross-Polytope partitions by\nfinding the Voronoi cell of the cross-polytope that contains the rotated vector. For\ncosine similarity, Cross-Polytope cells are provably more collision-efficient: two\nnearly-identical vectors are more likely to share the same dominant coordinate than\nto agree on all k sign bits (Andoni \u0026 Razenshteyn, 2015).\n\n| Model | `dimension` | `padded_dim` | `num_partitions` per rep |\n|---|---|---|---|\n| ColQwen2 | 128 | 128 | 256 |\n| ColQwen3.5 v3 | 320 | 512 | 1,024 |\n\n\u003e Because `num_partitions` grows with `dimension`, strongly consider\n\u003e `fill_empty_partitions=True` for sparse document clouds; the densifying fill path is\n\u003e selected automatically for `CROSS_POLYTOPE`.\n\n---\n\n#### Mode 5: `CALIBRATED_EIGENBASIS` — SpectralQuant-inspired calibrated SimHash\n\nRotates embeddings into the eigenbasis of the empirical token covariance before\nSimHash partitioning. Requires a one-time calibration pass on representative corpus\nembeddings.\n\n```python\nfrom pymuvera import MUVERAEncoder, ProjectionType, calibrate_from_embeddings\n\n# Step 1: calibrate on a sample of corpus token embeddings.\ncalibration = calibrate_from_embeddings(corpus_embeddings)  # shape: (N, 128)\ncalibration.save(\"colqwen2_calibration.npz\")\n\n# Step 2: pass calibration at construction.\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=8,\n    num_repetitions=8,\n    projection_type=ProjectionType.CALIBRATED_EIGENBASIS,\n    fill_empty_partitions=True,\n    seed=42,\n    calibration=calibration,\n)\n\nq_fde = enc.encode_query(query_tokens)\nd_fde = enc.encode_document(doc_tokens)\n```\n\n**How it works:** `calibrate_from_embeddings()` computes the empirical covariance\nΣ of the calibration embeddings, eigendecomposes it, and stores the eigenbasis U\n(eigenvectors sorted by descending eigenvalue λ). At encode time, each embedding is\nrotated into this basis (`z = x @ U`) before SimHash. The FDE partition centroids\nlive in the eigenbasis space; inner products are preserved exactly because U is\northogonal.\n\n**Eigenvalue weighting (default):** the SimHash projection matrix in the rotated\nspace is sampled from N(0, diag(λ)) rather than N(0, I). Scaling row *i* by √λ_i\nmakes bucket assignment care more about high-variance calibrated coordinates. This\nis a loose SimHash analog of SpectralQuant's water-filled allocation idea: spend\nmore representational budget where the calibrated spectrum says the variance lives.\n\n**Important caveat:** uniform Gaussian SimHash is rotation-invariant. With\n`use_eigenvalue_weighting=False`, the eigenbasis rotation alone is mostly an\nablation/control. The experimental behavior comes from the λ-weighted bucket\nassignment geometry, not from rotation by itself.\n\n**Motivation:** SpectralQuant reports that LLM key covariances can have very low\neffective rank. ColQwen-style retrieval embeddings may show similar low-effective-rank\nstructure, but that is a hypothesis for multimodal retrieval embeddings, not a\nguarantee. Inspect `calibration.participation_ratio` and evaluate recall against\nexact MaxSim before using this mode in production.\n\n```python\ncal = calibrate_from_embeddings(your_embeddings)\nprint(f\"deff = {cal.participation_ratio:.1f} / {cal.eigenvectors.shape[0]}\")\n# Low deff, for example \u003c 10 at d=128, is a useful signal to test this mode.\n```\n\n**Cost:** O(N·d²) for the rotation matmul per token, plus O(N·d·k) for SimHash.\nCalibration cost depends mostly on how you produce the calibration embeddings; the\neigendecomposition itself is small at typical embedding dimensions.\n\n**Constraint:** `num_simhash_projections ≥ 1`; encoding before calibration raises\n`RuntimeError`.\n\n---\n\n#### Densifying LSH fill — O(num_empty) fill for all projection types\n\nBy default, `fill_empty_partitions=True` uses **Hamming nearest-neighbor fill**:\nfor each empty slot, find the token with the smallest Hamming distance in the SimHash\nsign space. This is geometrically accurate but costs O(num_tokens × k × num_empty).\n\n**Densifying LSH fill** (Shrivastava, 2014) replaces this with a deterministic hash:\n\n```\nfor each empty slot p:\n    token_idx = splitmix64(p ⊕ seed) % num_tokens\n    rep_slice[p] = projected[token_idx]\n```\n\nCost scales only with the number of empty slots — independent of num_tokens and k:\n\n\u003e **Cost: O(num_empty)**\n\u003e\n\u003e Same example: 200 empty slots → **200 operations**. ~4,000× less work.\n\n```python\n# Explicit opt-in for sign-based modes\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=10,   # 1024 partitions — many will be empty\n    num_repetitions=4,\n    fill_empty_partitions=True,\n    densifying_fill=True,          # O(num_empty) instead of O(N*k)\n)\n\n# Automatic for CROSS_POLYTOPE when fill_empty_partitions=True\nenc = MUVERAEncoder(\n    dimension=320,\n    num_repetitions=8,\n    projection_type=ProjectionType.CROSS_POLYTOPE,\n    fill_empty_partitions=True,    # densifying fill path is selected automatically\n    final_projection_dimension=81920,\n)\n```\n\n| Fill strategy | Cost | Quality | When to use |\n|---|---|---|---|\n| Hamming NN (default) | O(num_tokens × k × num_empty) | Geometrically precise | k ≤ 8, short docs, moderate corpus |\n| Densifying LSH | O(num_empty) — scales only with empty slots | Less precise, ~4000× faster at k=8 | k ≥ 10, large corpus, `CROSS_POLYTOPE` |\n\n---\n\n#### Projection mode comparison (ColQwen2, d=128)\n\n| Mode | SimHash cost (d=128) | vs baseline | Quality | Extra constraint |\n|---|---|---|---|---|\n| `DEFAULT_IDENTITY` | 1024N ops (k=8) | 1× | Full-rank Gaussian baseline | None |\n| `LOW_RANK_GAUSSIAN` r=4 | 544N ops (k=8) | **1.9×** | O(r⁻¹) convergence, ~25% variance ↑ | `1 ≤ r \u003c k` |\n| `LOW_RANK_GAUSSIAN` r=1 | 136N ops (k=8) | **7.5×** | ~100% variance baseline | `1 ≤ r \u003c k` |\n| `SRHT` | 904N ops (k=8) | 1.1× | Structured JL projection feeding SimHash | `k ≤ next_pow2(d)` |\n| `CROSS_POLYTOPE` | 896N ops (all partitions) | 1.1× | Theoretically optimal cosine | `fill` recommended |\n| `CALIBRATED_EIGENBASIS` | (1024+d²)N ops (k=8) | ~0.5× | Experimental spectral SimHash | `calibrate()` required |\n\n#### Empty-slot fill strategies — comparison\n\nWhen `fill_empty_partitions=True`, two fill strategies are available:\n\n| Strategy | Cost | Precision | When to use |\n|---|---|---|---|\n| **Hamming NN** (default) | O(num_tokens × k × num_empty) | High — nearest token by SimHash distance | k ≤ 8, short docs, moderate corpus |\n| **Densifying LSH** (`densifying_fill=True`) | O(num_empty) — scales only with empty slots | Lower — hash-based, no geometry (~4,000× faster at k=8) | k ≥ 10, large corpora, `CROSS_POLYTOPE` (automatic) |\n\nDensifying LSH fill (Shrivastava, 2014) assigns each empty slot a source token\ndeterministically via a splitmix64 hash of the partition index — no distance\ncomputation, no sketch matrix required. When `fill_empty_partitions=True`, it is\n**automatically used for `CROSS_POLYTOPE`** (no sketch matrix exists for Hamming\ndistances) and opt-in for all other modes via `densifying_fill=True`.\n\n**When to use each:**\n\n* **`DEFAULT_IDENTITY`** — default choice; correctness baseline, no constraints.\n* **`LOW_RANK_GAUSSIAN`** — when speed is the priority and mild quality loss is acceptable.\n  **Requires k ≥ 16 and r/k ≤ 0.25** to make the tradeoff meaningful. r=4, k=6 (r/k=0.67)\n  is nearly full-rank — all the variance penalty, almost no speed gain. Avoid.\n* **`SRHT`** — structured JL projection at sub-quadratic cost. Good to test when\n  projection speed matters and you want to avoid the low-rank approximation.\n* **`CROSS_POLYTOPE`** — when you want theoretically optimal cosine similarity\n  partitioning without tuning k. Best for high-d models (ColQwen3.5 d=320) where\n  num_partitions = 2×512 = 1024 gives fine-grained coverage. Always pair with\n  `fill_empty_partitions=True` (densifying fill is automatic).\n* **`CALIBRATED_EIGENBASIS`** — experimental. Test it when your corpus has a stable\n  domain and calibration shows strongly non-isotropic embeddings. Run\n  `calibrate_from_embeddings()` on a representative sample; inspect\n  `calibration.participation_ratio` to confirm low effective rank (\u003c 10 for d=128\n  is a useful signal). Save the calibration object alongside your FAISS index.\n* **Densifying LSH fill** — when fill cost is a bottleneck. At k=8 with 512-token\n  documents, Hamming NN fill costs O(num_tokens × k × num_empty) — up to 819,200\n  operations per document for 200 empty slots. Densifying LSH reduces this to\n  O(num_empty) — 200 operations, ~4,000× faster — by assigning each empty slot a\n  source token via a single deterministic hash. Enable with `densifying_fill=True`.\n  Automatically used for `CROSS_POLYTOPE` (no sketch matrix available for Hamming).\n\n---\n\n### Filling empty partition slots\n\nWith few document tokens and many partitions (large *k*), many slots will be\nempty (all-zero). Enabling `fill_empty_partitions` copies the projection of\nthe nearest token by SimHash Hamming distance into each empty slot, improving\nrecall for short documents:\n\n```python\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=4,\n    num_repetitions=2,\n    fill_empty_partitions=True,   # document side only; queries ignore this flag\n)\n\nshort_doc_tokens = np.random.randn(8, 128).astype(np.float32)\nd_fde = enc.encode_document(short_doc_tokens)   # no all-zero partition blocks\n```\n\n---\n\n### Low-level functional API\n\nBypass the encoder class entirely when you need to manage parameters manually\n(e.g. distributed indexing where workers share pre-built parameters):\n\n```python\nfrom pymuvera import (\n    FDEConfig,\n    MUVERAEncoder,\n    generate_query_fde,\n    generate_document_fde,\n)\n\nconfig = FDEConfig(\n  dimension=128,\n  num_repetitions=2,\n  num_simhash_projections=4,\n  seed=42,\n)\n\nq_fde = generate_query_fde(query_tokens, config)\nd_fde = generate_document_fde(doc_tokens, config)\n\n# Pass pre-built RepParams to skip RNG sampling on every call\nenc = MUVERAEncoder(dimension=128, num_repetitions=2, num_simhash_projections=4, seed=42)\nq_fde = generate_query_fde(query_tokens, config, enc._rep_params)\n```\n\n---\n\n### `FDEConfig` serialization\n\n`FDEConfig` is a frozen Pydantic model — save it alongside your ANN index so\nthe encoder configuration is always recoverable:\n\n```python\nimport json\nfrom pymuvera import FDEConfig\n\nconfig = FDEConfig(dimension=128, num_repetitions=4, num_simhash_projections=4, seed=42)\n\n# Save. JSON mode serializes enums as their string values.\nwith open(\"fde_config.json\", \"w\") as f:\n  json.dump(config.model_dump(mode=\"json\"), f)\n\n# Load\nwith open(\"fde_config.json\") as f:\n  config2 = FDEConfig(**json.load(f))\n\nassert config == config2\n```\n\n---\n\n## Configuration guide\n\nMost users hit poor results not because of a wrong projection type but because of a\nmisconfigured `num_simhash_projections` / `num_repetitions` / `simhash_rank` combination.\nThis section explains every tradeoff in plain terms, with concrete numbers for ColQwen2\n(128-dim) and ColQwen3.5 (320-dim) — the two most common production models.\n\n---\n\n### Know your embedding dimension first\n\nDifferent models produce different per-token embedding dimensions. Set `dimension` to\nmatch your model exactly — this is the single most important parameter.\n\n| Model | `dimension` | Notes |\n|---|---|---|\n| ColBERT v2 | 128 | Original late-interaction baseline |\n| ColQwen2 | 128 | Most widely deployed as of 2025 |\n| ColQwen3.5 v1 | 128 | Early checkpoint |\n| ColQwen3.5 v3 | 320 | Current recommended checkpoint |\n| Ops-ColQwen3-4B | 320 | OpenSearch variant, up to 2560 via extended head |\n\n\u003e **Common mistake:** Using `dimension=128` with ColQwen3.5 v3 (which is 320-dim) silently\n\u003e truncates every token embedding to 128 dims, discarding 60% of the representation before\n\u003e MUVERA even runs. Always verify with `model.config.projection_dim` or check the model card.\n\n---\n\n### The two knobs that matter most\n\n#### `num_simhash_projections` (k) — partition granularity\n\nEach repetition divides embedding space into **2^k buckets**. Tokens that land in the\nsame bucket get averaged together into one FDE slot.\n\n| k | Partitions | Tokens/partition (512-token doc) | Recommendation |\n|---|---|---|---|\n| 4 | 16 | 32 | coarse; fast but high collision rate |\n| 6 | 64 | 8 | reasonable default |\n| 8 | 256 | 2 | good quality; use `fill_empty_partitions=True` |\n| 10 | 1,024 | 0.5 | too sparse for most docs; many empty slots |\n\n\u003e **Rule of thumb:** aim for **4–10 tokens per partition** on average.\n\u003e For a 512-token ColQwen3.5 page: k=6 (8 tokens/partition) or k=8 with fill enabled.\n\n#### `num_repetitions` — approximation quality\n\nEach repetition is an independent random partition of the same embedding space. More\nrepetitions directly improves recall and is the safest quality knob to increase.\n\n- More repetitions **always** improves recall.\n- Cost scales linearly: 2× repetitions = 2× FDE size = 2× encode time.\n- Diminishing returns set in around 8–16 repetitions for most corpora.\n\n\u003e **Rule of thumb:** start with `num_repetitions=8`. If recall is poor, double it before\n\u003e touching any other parameter.\n\n---\n\n### The budget equation\n\n```\nfde_dimension = num_repetitions × 2^k × dimension\n```\n\nFor a fixed FDE budget, spending it on **more repetitions beats larger k** for most corpora:\n\n| Config | fde_dimension (ColQwen3.5, d=320) | Notes |\n|---|---|---|\n| k=6, reps=20 | 20 × 64 × 320 = 409,600 | many repetitions, coarse partitions |\n| k=8, reps=10 | 10 × 256 × 320 = 819,200 | balanced — usually better recall |\n| k=8, reps=5 | 5 × 256 × 320 = 409,600 | same budget as first row; better quality |\n\nUse `final_projection_dimension` to compress to a target index size after choosing\nthe right k/repetitions balance:\n\n```python\nenc = MUVERAEncoder(\n    dimension=320,               # ColQwen3.5 v3\n    num_simhash_projections=8,\n    num_repetitions=10,\n    fill_empty_partitions=True,\n    final_projection_dimension=81920,  # compress to target index size\n)\n```\n\n---\n\n### When to use `fill_empty_partitions`\n\nWith k=8 (256 partitions) and a short document (\u003c 200 tokens), many partition slots\nwill be empty — all zeros in the FDE. Zeros contribute nothing to the dot product and\ndirectly hurt recall.\n\nEnable `fill_empty_partitions=True` whenever:\n\n```\nnum_doc_tokens / 2^k \u003c 2\n```\n\n| k | Enable fill if doc tokens \u003c |\n|---|---|\n| 6 | 128 |\n| 8 | 512 |\n| 10 | 2,048 |\n\nFor ColQwen3.5 pages at k=8: nearly always enable fill, since most document pages\nproduce fewer than 512 tokens.\n\n---\n\n### `LOW_RANK_GAUSSIAN` — when it helps and when it does not\n\nLow-rank SimHash only makes theoretical sense when **r is much smaller than k**.\nThe computational benefit comes from the ratio r/k — if that ratio is close to 1,\nyou get all the approximation error with almost no speed gain.\n\n| k | r | r/k ratio | Assessment |\n|---|---|---|---|\n| 6 | 4 | 0.67 | ❌ nearly full-rank — avoid |\n| 8 | 4 | 0.50 | ⚠️ marginal benefit |\n| 16 | 4 | 0.25 | ✅ good tradeoff (~1.9× faster, ~25% variance ↑) |\n| 16 | 2 | 0.13 | ✅ aggressive (~4× faster, ~50% variance ↑) |\n\n\u003e **The k=6, rank=4 trap:** this is a near-full-rank approximation of a 6-bit matrix.\n\u003e You pay ~25% variance penalty with only a 1.4× compute saving. This combination\n\u003e produces the worst results of all modes (as seen in early ColQwen3.5 benchmarks).\n\u003e **Minimum recommended config for LOW_RANK_GAUSSIAN: k ≥ 16, rank ≤ k//4.**\n\n---\n\n### Recommended starting configs\n\n#### ColQwen2 (d=128) — general purpose\n\n```python\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=8,\n    num_repetitions=8,\n    fill_empty_partitions=True,\n    seed=42,\n)\n# fde_dimension = 8 × 256 × 128 = 262,144\n# tokens/partition at 512 tokens: 2 — fill is essential\n```\n\n#### ColQwen3.5 v3 (d=320) — general purpose\n\n```python\nenc = MUVERAEncoder(\n    dimension=320,\n    num_simhash_projections=8,\n    num_repetitions=8,\n    fill_empty_partitions=True,\n    seed=42,\n)\n# fde_dimension = 8 × 256 × 320 = 655,360\n# use final_projection_dimension if index size is a constraint\n```\n\n#### ColQwen3.5 v3 — speed-optimized (SRHT)\n\n```python\nenc = MUVERAEncoder(\n    dimension=320,\n    num_simhash_projections=8,\n    num_repetitions=8,\n    projection_type=ProjectionType.SRHT,\n    fill_empty_partitions=True,\n    seed=42,\n)\n# Structured JL projection feeding SimHash, ~12% faster than DEFAULT_IDENTITY at k=8\n# Best quality/speed tradeoff in benchmarks\n```\n\n#### ColQwen3.5 v3 — Cross-Polytope (theoretically optimal)\n\n```python\nenc = MUVERAEncoder(\n    dimension=320,\n    num_repetitions=4,\n    projection_type=ProjectionType.CROSS_POLYTOPE,\n    fill_empty_partitions=True,    # densifying fill automatic\n    seed=42,\n    final_projection_dimension=81920,\n)\n# num_partitions = 2 * 512 = 1024 per repetition\n# raw fde = 4 * 1024 * 320 = 1,310,720 -\u003e compressed to 81,920\n```\n\n---\n\n#### ColQwen3.5 v3 — Cross-Polytope (theoretically optimal cosine partitioning)\n\n```python\nfrom pymuvera import ProjectionType\n\nenc = MUVERAEncoder(\n  dimension=320,\n  num_repetitions=8,\n  projection_type=ProjectionType.CROSS_POLYTOPE,\n  fill_empty_partitions=True,  # densifying fill used automatically — O(num_empty)\n  final_projection_dimension=81920,\n  seed=42,\n)\n# num_partitions = 2 * 512 = 1024 per repetition (next_power_of_2(320)=512)\n# fde_dimension before compression = 8 × 1024 × 320 = 2,621,440\n# Recommended for high-quality retrieval on complex document pages (tables, charts)\n```\n\n#### ColQwen3.5 v3 — low-rank (correctly configured)\n\n```python\nenc = MUVERAEncoder(\n    dimension=320,\n    num_simhash_projections=16,   # k must be large for low-rank to help\n    num_repetitions=4,\n    projection_type=ProjectionType.LOW_RANK_GAUSSIAN,\n    simhash_rank=4,               # r/k = 4/16 = 0.25 — meaningful low-rank\n    fill_empty_partitions=True,\n    seed=42,\n)\n# fde_dimension = 4 × 65536 × 320 = 83,886,080 — use final_projection_dimension\n```\n\n#### ColQwen2 (d=128) — calibrated eigenbasis (experimental, domain-specific corpora)\n\n```python\nfrom pymuvera import MUVERAEncoder, ProjectionType, calibrate_from_embeddings\n\n# One-time calibration on a representative corpus sample.\n# corpus_embeddings: (N, 128) token embeddings from your target corpus.\ncalibration = calibrate_from_embeddings(corpus_embeddings)\nprint(f\"Effective rank: {calibration.participation_ratio:.1f} / 128\")\ncalibration.save(\"colqwen2_calibration.npz\")\n\nenc = MUVERAEncoder(\n    dimension=128,\n    num_simhash_projections=8,\n    num_repetitions=8,\n    projection_type=ProjectionType.CALIBRATED_EIGENBASIS,\n    fill_empty_partitions=True,\n    seed=42,\n    calibration=calibration,\n)\n# fde_dimension = 8 × 256 × 128 = 262,144\n# Partition assignment emphasizes high-variance calibrated eigendirections.\n# Validate against exact MaxSim before using this mode in production.\n```\n\n---\n\n### Quality vs. exact MaxSim — setting realistic expectations\n\nMUVERA FDE retrieval is a **first-stage filter**, not a replacement for exact MaxSim.\nTypical recall gaps on a 512-token ColQwen3.5 corpus:\n\n| Stage | R@1 (typical) | Retrieval time |\n|---|---|---|\n| Exact MaxSim (multi-vector) | ~0.88 | slow, scales with corpus size |\n| MUVERA FDE + ANN (first stage) | ~0.63 | fast, sub-linear |\n| MUVERA FDE → MaxSim rerank top-100 | ~0.86 | fast + small rerank overhead |\n\nThe ~25 point R@1 gap between exact and FDE-only is normal and expected. Always pair\npymuvera with a MaxSim reranking step on the ANN shortlist for production use.\n\n---\n\n## Two-stage retrieval pipeline\n\nThe intended production pattern for ColQwen2 / ColBERT:\n\n```\nOffline:\n  doc token embeddings  →  encode_document()  →  FDE vector  →  ANN index\n\nOnline:\n  query token embeddings  →  encode_query()  →  FDE vector\n                                                     │\n                                              ANN search (fast, sub-linear)\n                                                     │\n                                            top-K candidate docs\n                                                     │\n                                       MaxSim re-rank on raw token embeddings\n                                                     │\n                                               final top-K results\n```\n\nStage 1 (ANN on FDE vectors) eliminates 99%+ of the corpus cheaply.\nStage 2 (exact MaxSim on raw token embeddings) reranks the small candidate\nset for full accuracy.\n\n### Minimal FAISS integration\n\n```python\nimport faiss\nimport numpy as np\nfrom pymuvera import MUVERAEncoder\n\nenc = MUVERAEncoder(dimension=128, num_simhash_projections=4, num_repetitions=2, seed=42)\ndim = enc.fde_dimension  # 4096\n\n# Build index\nindex = faiss.IndexFlatIP(dim)  # inner product ≈ Chamfer Similarity\n\n# Index documents (offline)\ndoc_embeddings = [...]  # list of (num_tokens, 128) float32 arrays\nD = enc.encode_documents_batch(doc_embeddings)  # (N, 4096)\nindex.add(D)\n\n# Query (online)\nquery_tokens = np.random.randn(32, 128).astype(np.float32)\nq_fde = enc.encode_query(query_tokens).reshape(1, -1)\n\n_, candidate_ids = index.search(q_fde, k=100)  # stage 1: raw IP approximates MaxSim\n# stage 2: MaxSim re-rank candidate_ids with raw token embeddings ...\n```\n\n---\n\n## Reconstruction error — what degrades retrieval quality and how to fix it\n\nFDE retrieval approximates Chamfer Similarity — it does not compute it exactly.\nUnderstanding the error sources helps you configure pymuvera correctly and set\nrealistic expectations.\n\nAll plots in this section are illustrative diagrams and can be regenerated with:\n\n```bash\npython docs/generate_readme_plots.py\n```\n\n\u003e **The key insight:** all FDE reconstruction error is recoverable by the MaxSim\n\u003e reranking step. The error only affects *which* candidates enter your shortlist,\n\u003e not how accurately they are ranked once there.\n\n### Error source 1: SimHash partitioning error *(dominant)*\n\nTwo similar tokens may land in **different partitions** because a random hyperplane\nboundary falls between them. When this happens, their contribution to the dot product\nis zero instead of `cos(q, d)`.\n\nThe MUVERA paper proves the FDE dot product is an **unbiased estimator** of Chamfer\nSimilarity in expectation, but individual pairs have variance around that expectation.\n\n**Mitigation:** more `num_repetitions`. Each repetition draws an independent W matrix.\nVariance decreases as `1/num_repetitions`.\n\n![Variance vs repetitions](docs/images/plot1_variance_vs_repetitions.png)\n\n### Error source 2: Aggregation error *(centroid approximation)*\n\nEach non-empty partition slot holds the **centroid** of all tokens that landed there.\nWhen a query token's nearest document token shares a partition with many others, the\ncentroid may point in a meaningfully different direction.\n\n**Mitigation:** tune k so tokens-per-partition stays in the 4–8 range.\n\n![Tokens per partition vs k](docs/images/plot2_tokens_per_partition.png)\n\n### Error source 3: Empty partition error\n\nAn empty slot contributes zero to the dot product — as if no document token exists\nin that region. For a query token that would have matched a document token there,\nthe score is suppressed.\n\n**Mitigation:** `fill_empty_partitions=True`.\n\n### Error source 4: Count Sketch compression error *(if used)*\n\n`AMS_SKETCH` or `final_projection_dimension` add another approximation layer.\nCount Sketch is unbiased — `E[⟨sketch(x), sketch(y)⟩] = ⟨x, y⟩` — but variance\nscales as `1/projection_dimension`.\n\n**Mitigation:** keep `projection_dimension ≥ 64`; `final_projection_dimension ≥ 4×` your top-k shortlist size.\n\n### Error source 5: LOW_RANK_GAUSSIAN extra error *(if used)*\n\nFactoring W as AB⊤ adds SimHash partitioning error on top of Source 1. At r=4 you\nadd roughly 25% more variance. This is still faster convergence than the standard\nCLT rate of O(r⁻¹/²) — EGGROLL's O(r⁻¹) rate is better because symmetry cancels\nall odd cumulants — but it is real additional error.\n\n**Mitigation:** require `r/k ≤ 0.25`. At `r=4, k=6` (r/k=0.67) you pay the full\nvariance penalty for almost no speed gain.\n\n![EGGROLL vs CLT convergence](docs/images/plot4_eggroll_vs_clt.png)\n\n### Error source 6: CALIBRATED_EIGENBASIS spectral bias error *(experimental)*\n\n`CALIBRATED_EIGENBASIS` deliberately changes the SimHash bucket-assignment geometry:\nhigh-variance calibrated eigendirections receive more partitioning influence. This can\nreduce reconstruction error when those high-variance directions carry the retrieval\nsignal, because similar semantic tokens are more likely to collide in the same FDE\nslots.\n\nIt can also add error. If important matches live in low-variance tail directions\n(rare visual details, small text marks, table structure, or domain-specific outliers),\neigenvalue-weighted partitioning may under-partition those directions and suppress\nrecall. This is the FDE analog of spending too much budget on the principal subspace.\n\nThe figure below is synthetic intuition, not benchmark data. Use it as the mental\nmodel for why weighted Eigenbasis must be evaluated against exact MaxSim on your\nown corpus.\n\n![Eigenbasis spectral bias tradeoff](docs/images/plot7_eigenbasis_spectral_bias.png)\n\n**Mitigation:** treat this mode as an ablation-backed experiment:\n\n- Compare `DEFAULT_IDENTITY` against `CALIBRATED_EIGENBASIS` with\n  `use_eigenvalue_weighting=False` and `True`.\n- Inspect `calibration.participation_ratio`; low effective rank is a signal to test,\n  not a guarantee.\n- Calibrate on the same domain you index.\n- Always measure FDE+ANN recall against exact MaxSim, especially for tail-heavy\n  corpora such as charts, forms, receipts, tables, or OCR-heavy pages.\n\n### Error source 7: Densifying LSH fill error *(if used)*\n\nDensifying LSH assigns empty slots via a deterministic hash rather than the\ngeometrically nearest token. The filled token may be far from the partition's\nregion of embedding space.\n\nThis is geometrically worse than Hamming NN fill, but the practical impact is small:\nany fill is better than zero, and the hash is consistent across queries and documents\nso the error is systematic rather than random.\n\n**Cost comparison — why you'd accept this tradeoff:**\n\n\u003e **Hamming NN fill:** O(num_tokens × k × num_empty)\n\u003e Example: 200 empty slots, 512 tokens, k=8 → 200 × 512 × 8 = **819,200 operations**\n\u003e\n\u003e **Densifying LSH fill:** O(num_empty)\n\u003e Same example: 200 empty slots → **200 operations** (~4,000× faster)\n\n![Fill cost comparison](docs/images/plot3_fill_cost_comparison.png)\n\n### Error breakdown across common configs\n\n![Error breakdown by source](docs/images/plot6_error_breakdown.png)\n\nKey observations from the breakdown:\n\n- **SimHash partitioning error dominates** across all configs. More repetitions is the most effective quality knob.\n- **Empty slot error disappears** with `fill_empty_partitions=True` — the bar for `k=8 + fill` is much shorter.\n- **LOW_RANK_GAUSSIAN** at r=4 adds a visible extra band. Use r/k ≤ 0.25 to keep it small.\n- **SRHT** matches DEFAULT_IDENTITY in error profile — structured projection, no rank approximation.\n- **CALIBRATED_EIGENBASIS** is intentionally shown separately because its error can\n  move in either direction depending on the corpus spectrum and where retrieval\n  signal lives. See the spectral-bias plot above and evaluate it with\n  weighted/unweighted ablations.\n\n### The two-stage pipeline and error recovery\n\n![Two-stage recall](docs/images/plot5_two_stage_recall.png)\n\nFDE error shows up as the ~28-point R@1 gap between exact MaxSim (~0.89) and\nFDE-only retrieval (~0.61). The reranking step recovers most of this — FDE + rerank\nreaches ~0.86 R@1, within 3 points of exact.\n\nThe **irreducible error** is relevant documents that fall entirely outside the top-100\nANN candidates — the ones where SimHash partitioning error was severe enough to exclude\nthem from the shortlist. This is directly controlled by `num_repetitions`.\n\n\u003e ⚠️ **Common mistake:** measuring pymuvera quality by FDE-only R@1 without a\n\u003e reranking step. Always evaluate the two-stage pipeline.\n\n---\n\n## Attribution\n\nPython port of the C++ implementation in\n[Google's graph-mining project](https://github.com/google/graph-mining/tree/main/sketching/point_cloud),\nlicensed under Apache 2.0.\n\nLow-rank SimHash extension inspired by\n[EGGROLL: Evolution Strategies at the Hyperscale](https://eshyperscale.github.io/imgs/paper.pdf)\n(Sarkar et al., 2025).\n\nSubsampled Randomized Hadamard Transform, (SRHT, Woolfe, Liberty, Rokhlin \u0026 Tygert, 2008)\n\nCross-Polytope LSH: Andoni \u0026 Razenshteyn, 2015 — *Optimal Data-Dependent Hashing for Approximate Near Neighbors*.\n\nDensifying LSH: Shrivastava, 2014 — *Asymmetric LSH (ALSH) for Sublinear Time Maximum Inner Product Search*.\n\nCalibrated Eigenbasis SimHash (`CALIBRATED_EIGENBASIS`) inspired by\n[SpectralQuant](https://github.com/Dynamis-Labs/spectralquant), Vangara \u0026 Gopinath, 2026.\n\nSee [NOTICE](NOTICE) for the full upstream attribution.\n\n---\n\n## License\n\nApache 2.0 — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmarthi%2Fpymuvera","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsmarthi%2Fpymuvera","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsmarthi%2Fpymuvera/lists"}