https://github.com/stffns/snapvec
Fast compressed ANN search via randomized Hadamard transform + Lloyd-Max quantization. Pure NumPy.
https://github.com/stffns/snapvec
ann embeddings hadamard numpy quantization rag vector-search
Last synced: 3 months ago
JSON representation
Fast compressed ANN search via randomized Hadamard transform + Lloyd-Max quantization. Pure NumPy.
- Host: GitHub
- URL: https://github.com/stffns/snapvec
- Owner: stffns
- Created: 2026-03-31T16:53:27.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-31T17:34:41.000Z (3 months ago)
- Last Synced: 2026-04-04T13:48:28.541Z (3 months ago)
- Topics: ann, embeddings, hadamard, numpy, quantization, rag, vector-search
- Language: Python
- Homepage: https://pypi.org/project/snapvec/
- Size: 22.5 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# snapvec
**Fast compressed approximate nearest-neighbor search. Pure NumPy. No heavy dependencies.**
`snapvec` implements the TurboQuant compression pipeline — randomized Hadamard transform followed by optimal Gaussian scalar quantization (Lloyd-Max) — as a self-contained Python library for embedding vector search. It achieves **8–12× compression** with **>0.92 recall@10** against float32 brute-force, using only NumPy.
```
pip install snapvec
```
---
## Quick start
```python
import numpy as np
from snapvec import SnapIndex
# Build index
idx = SnapIndex(dim=384, bits=4) # 4-bit, ~8x compression
idx.add_batch(ids=list(range(N)), vectors=embeddings)
# Query
results = idx.search(query_vector, k=10) # [(id, score), ...]
# Persist
idx.save("my_index.snpv")
idx2 = SnapIndex.load("my_index.snpv") # atomic save, v1/v2 compatible
```
---
## Technical background
### The problem: embedding vectors are expensive
Modern embedding models produce float32 vectors of dimension `d ∈ {384, 768, 1536}`.
Storing N vectors requires `4·N·d` bytes; brute-force search costs `O(N·d)` per query.
For N = 1M, d = 384: **1.5 GB RAM**, with inner products dominating inference time.
Product Quantization (PQ) splits vectors into M sub-vectors and quantizes each
independently. It is effective but requires training a K-means codebook per dataset.
Random Binary Quantization (RaBitQ, 1-bit) is fast but coarse.
**TurboQuant** (Zandieh et al., ICLR 2026, [arXiv:2504.19874](https://arxiv.org/abs/2504.19874))
achieves near-optimal distortion at b bits per coordinate **without training codebooks**,
by first rotating the space with a randomized Hadamard transform to make coordinates
approximately Gaussian, then quantizing each coordinate independently with the
optimal scalar quantizer for N(0,1).
---
### Algorithm
#### Step 1 — Normalize
Given a raw embedding `v ∈ ℝᵈ`, compute the unit vector `v̂ = v / ‖v‖` and store
`‖v‖` separately (float32, 4 bytes per vector).
#### Step 2 — Randomized Hadamard Transform (RHT)
Pad `v̂` to the next power of 2 (`d' = 2^⌈log₂ d⌉`), then apply:
```
x = (1/√d') · H · D · v̂
```
where:
- `D = diag(σ₁, …, σ_d')` — diagonal matrix of i.i.d. ±1 random signs (seed-deterministic)
- `H` — unnormalized Walsh-Hadamard matrix (butterfly pattern)
By the Johnson-Lindenstrauss lemma, each coordinate `xᵢ ≈ N(0, 1/d')`.
After rescaling `x̃ = x · √d'`, the coordinates are approximately `N(0,1)`
regardless of the original distribution of `v`.
**Complexity:** O(d log d) — no matrix multiplication, no codebook training.
#### Step 3 — Lloyd-Max scalar quantization
The optimal scalar quantizer for N(0,1) at b bits partitions ℝ into 2^b intervals
and assigns each the conditional mean as reconstruction value.
These boundaries and centroids are precomputed and hardcoded in `snapvec._codebooks`
(no scipy required at runtime):
| bits | levels | distortion (MSE) | bytes/coord (disk) |
|------|--------|------------------|--------------------|
| 2 | 4 | 0.1175 | 0.25 |
| 3 | 8 | 0.0311 | 0.375 |
| 4 | 16 | 0.0077 | 0.50 |
The quantized vector is stored as a `uint8` index matrix, bit-packed to `b/8`
bytes per coordinate on disk.
#### Step 4 — Approximate inner product
At search time the query `q` is rotated (not quantized) and the approximate
cosine similarity is computed as:
```
score(q, v) = (1/d') · Σᵢ centroid[idx_qᵢ] · centroid[idx_vᵢ]
```
This is a single float16 matrix–vector product against the cached centroid expansions.
---
### TurboQuant_prod: unbiased estimator with QJL correction
The MSE quantizer introduces a small systematic downward bias. The `use_prod=True` mode
corrects this using a **Quantized Johnson-Lindenstrauss (QJL)** residual:
**Build time (per stored vector):**
1. Quantize at `(b-1)` bits MSE, compute residual `r = x̃ - x̃_MSE`
2. Store `sign(S·r)` as a 1-bit vector (int8 ±1 in practice),
where `S ∈ ℝ^(d'×d')` is a fixed random Gaussian matrix
3. Store `‖r‖ / √d'` (one float32 per vector)
**Query time (correction term):**
```
correctionᵢ = √(π/2) / d' · ‖rᵢ‖ · dot(S·q̂, sign(S·rᵢ))
final_scoreᵢ = mse_scoreᵢ + correctionᵢ
```
This follows from Lemma 4 of Zandieh et al. (2025):
`E[sign(S·r)] = √(2/π) · S·r / ‖S·r‖`, giving an unbiased estimate of `⟨r, q̂⟩`.
**When to use `use_prod=True`:**
- When you need accurate inner product magnitudes (KV-cache, attention approximation)
- **Not** recommended for pure ranking/NNS — the added QJL variance degrades recall@k
relative to MSE-only at equal total bits
---
### Compression ratios
For N vectors of dimension `d = 384` (BGE-small):
| Backend | Bytes/vector (disk) | Ratio vs float32 |
|----------------|---------------------|------------------|
| float32 | 1 536 | 1.0× |
| 4-bit snapvec | 192 + 4 | **7.9×** |
| 3-bit snapvec | 144 + 4 | **10.4×** |
| 2-bit snapvec | 96 + 4 | **15.4×** |
| int8 (naïve) | 384 + 4 | 3.9× |
The 4-byte overhead is the per-vector norm. In RAM, indices are stored as uint8
(~3× vs float32); bit-packing applies on disk (~8× vs float32 at 4-bit).
---
### Recall benchmarks
Measured on synthetic unit-sphere vectors (`d=384`, `N=10 000`, 100 queries).
**Baseline: exact cosine float32 brute-force.**
| bits | recall@1 | recall@10 | recall@50 |
|------|----------|-----------|-----------|
| 2 | 0.72 | 0.83 | 0.91 |
| 3 | 0.81 | 0.91 | 0.96 |
| 4 | 0.86 | 0.93 | 0.95 |
Recall improves with clustered (real-world) data. On BGE-small-en embeddings
from mixed document corpora, 4-bit achieves **recall@10 ≈ 0.95**.
> **Note on published results:** The TurboQuant paper (Zandieh et al., 2025) reports
> recall up to 0.99, measured against HNSW graph navigation (not brute-force float32),
> on GloVe `d=200` data, using recall@1 with large `k_probe`. These conditions differ
> from the above; both results are correct under their respective definitions.
---
### File format (`.snpv`)
```
Offset Size Field
──────────────────────────────────────────────────
0 4 B magic: "HDMX"
4 4 B version: uint32 (1 or 2)
8 4 B dim: uint32 — original embedding dimension
12 4 B bits: uint32 — total bits (2, 3, or 4)
16 4 B seed: uint32 — rotation seed
20 4 B n: uint32 — number of stored vectors
24 4 B flags: uint32 — bit-0: use_prod [v2 only]
──────────────────────────────────────────────────
28 4 B packed_len: uint32
32 * indices: bit-packed uint8 MSE indices
n×4 B norms: float32 per-vector original norms
[prod only]
n×d' B qjl_signs: int8 sign(S·r) per vector
n×4 B rnorms: float32 ‖r‖/√d per vector
──────────────────────────────────────────────────
n×(2+L) ids: uint16-length-prefixed UTF-8 strings
```
Saves are **atomic** on POSIX: writes to `.snpv.tmp` then `os.replace()`.
Backward compatible: v1 files (mse-only) load correctly in any version.
---
## API reference
### `SnapIndex(dim, bits=4, seed=0, use_prod=False)`
| Parameter | Type | Default | Description |
|------------|-------|---------|-------------|
| `dim` | int | — | Embedding dimension |
| `bits` | int | 4 | Bits per coordinate: 2, 3, or 4 |
| `seed` | int | 0 | Rotation seed — must be consistent across build and query |
| `use_prod` | bool | False | Enable QJL unbiased estimator (requires bits ≥ 3) |
### Methods
```python
idx.add(id, vector) # Add one vector
idx.add_batch(ids, vectors) # Add N vectors (~50x faster than loop)
idx.delete(id) -> bool # Remove by id, O(1) lookup
idx.search(query, k=10) -> list # [(id, score), ...] descending
idx.save(path) # Atomic binary save to .snpv
SnapIndex.load(path) # Load from .snpv file
idx.stats() -> dict # Compression / memory diagnostics
len(idx) # Number of stored vectors
repr(idx) # SnapIndex(dim=384, bits=4, mode=mse, n=1000)
```
---
## Relation to TurboQuant / PolarQuant
`snapvec` implements the core compression pipeline from:
> Zandieh, A., Daliri, M., Hadian, A., & Mirrokni, V. (2025).
> **TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.**
> *ICLR 2026.* [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
The same algorithm was published concurrently as "PolarQuant" at AISTATS 2026.
Both names were already taken on PyPI; `snapvec` = **Hada**mard + Lloyd-**Max**,
named after its two core operations.
Key contributions of this implementation over the reference:
- **No scipy** — codebooks hardcoded, numpy is the only runtime dependency
- **Batch WHT** — single O(n·d·log d) call for bulk inserts (~50x faster than loop)
- **Float16 cache** — centroid expansions in half precision, ~2x faster matmul
- **O(1) delete** — `_id_to_pos` dict + position compaction
- **Atomic saves** — `.snpv.tmp` → `os.replace()` pattern
- **Versioned format** — v1/v2 both loadable, forward-compatible flags field
---
## Installation
```bash
pip install snapvec
```
**Requirements:** Python >= 3.10, NumPy >= 1.24. No other runtime dependencies.
For development:
```bash
git clone https://github.com/stffns/snapvec
cd snapvec
pip install -e .
pytest tests/ -v
```
---
## License
MIT © 2025 Jayson Steffens.
The TurboQuant algorithm is described in [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
by Zandieh et al. (Google Research / ICLR 2026). This package is an independent implementation.