https://github.com/stffns/snapvec

Fast compressed ANN search via randomized Hadamard transform + Lloyd-Max quantization. Pure NumPy.
https://github.com/stffns/snapvec
ann embeddings hadamard numpy quantization rag vector-search
Last synced: 3 months ago
JSON representation
Fast compressed ANN search via randomized Hadamard transform + Lloyd-Max quantization. Pure NumPy.
Host: GitHub
URL: https://github.com/stffns/snapvec
Owner: stffns
Created: 2026-03-31T16:53:27.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-31T17:34:41.000Z (4 months ago)
Last Synced: 2026-04-04T13:48:28.541Z (3 months ago)
Topics: ann, embeddings, hadamard, numpy, quantization, rag, vector-search
Language: Python
Homepage: https://pypi.org/project/snapvec/
Size: 22.5 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # snapvec

**Fast compressed approximate nearest-neighbor search.  Pure NumPy.  No heavy dependencies.**

`snapvec` implements the TurboQuant compression pipeline — randomized Hadamard transform followed by optimal Gaussian scalar quantization (Lloyd-Max) — as a self-contained Python library for embedding vector search.  It achieves **8–12× compression** with **>0.92 recall@10** against float32 brute-force, using only NumPy.

```

pip install snapvec

```

---

## Quick start

```python

import numpy as np

from snapvec import SnapIndex

# Build index

idx = SnapIndex(dim=384, bits=4)          # 4-bit, ~8x compression

idx.add_batch(ids=list(range(N)), vectors=embeddings)

# Query

results = idx.search(query_vector, k=10)     # [(id, score), ...]

# Persist

idx.save("my_index.snpv")

idx2 = SnapIndex.load("my_index.snpv")   # atomic save, v1/v2 compatible

```

---

## Technical background

### The problem: embedding vectors are expensive

Modern embedding models produce float32 vectors of dimension `d ∈ {384, 768, 1536}`.

Storing N vectors requires `4·N·d` bytes; brute-force search costs `O(N·d)` per query.

For N = 1M, d = 384: **1.5 GB RAM**, with inner products dominating inference time.

Product Quantization (PQ) splits vectors into M sub-vectors and quantizes each

independently. It is effective but requires training a K-means codebook per dataset.

Random Binary Quantization (RaBitQ, 1-bit) is fast but coarse.

**TurboQuant** (Zandieh et al., ICLR 2026, [arXiv:2504.19874](https://arxiv.org/abs/2504.19874))

achieves near-optimal distortion at b bits per coordinate **without training codebooks**,

by first rotating the space with a randomized Hadamard transform to make coordinates

approximately Gaussian, then quantizing each coordinate independently with the

optimal scalar quantizer for N(0,1).

---

### Algorithm

#### Step 1 — Normalize

Given a raw embedding `v ∈ ℝᵈ`, compute the unit vector `v̂ = v / ‖v‖` and store

`‖v‖` separately (float32, 4 bytes per vector).

#### Step 2 — Randomized Hadamard Transform (RHT)

Pad `v̂` to the next power of 2 (`d' = 2^⌈log₂ d⌉`), then apply:

```

x = (1/√d') · H · D · v̂

```

where:

- `D = diag(σ₁, …, σ_d')` — diagonal matrix of i.i.d. ±1 random signs (seed-deterministic)

- `H` — unnormalized Walsh-Hadamard matrix (butterfly pattern)

By the Johnson-Lindenstrauss lemma, each coordinate `xᵢ ≈ N(0, 1/d')`.

After rescaling `x̃ = x · √d'`, the coordinates are approximately `N(0,1)`

regardless of the original distribution of `v`.

**Complexity:** O(d log d) — no matrix multiplication, no codebook training.

#### Step 3 — Lloyd-Max scalar quantization

The optimal scalar quantizer for N(0,1) at b bits partitions ℝ into 2^b intervals

and assigns each the conditional mean as reconstruction value.

These boundaries and centroids are precomputed and hardcoded in `snapvec._codebooks`

(no scipy required at runtime):

| bits | levels | distortion (MSE) | bytes/coord (disk) |

|------|--------|------------------|--------------------|

| 2    | 4      | 0.1175           | 0.25               |

| 3    | 8      | 0.0311           | 0.375              |

| 4    | 16     | 0.0077           | 0.50               |

The quantized vector is stored as a `uint8` index matrix, bit-packed to `b/8`

bytes per coordinate on disk.

#### Step 4 — Approximate inner product

At search time the query `q` is rotated (not quantized) and the approximate

cosine similarity is computed as:

```

score(q, v) = (1/d') · Σᵢ centroid[idx_qᵢ] · centroid[idx_vᵢ]

```

This is a single float16 matrix–vector product against the cached centroid expansions.

---

### TurboQuant_prod: unbiased estimator with QJL correction

The MSE quantizer introduces a small systematic downward bias. The `use_prod=True` mode

corrects this using a **Quantized Johnson-Lindenstrauss (QJL)** residual:

**Build time (per stored vector):**

1. Quantize at `(b-1)` bits MSE, compute residual `r = x̃ - x̃_MSE`

2. Store `sign(S·r)` as a 1-bit vector (int8 ±1 in practice),

   where `S ∈ ℝ^(d'×d')` is a fixed random Gaussian matrix

3. Store `‖r‖ / √d'` (one float32 per vector)

**Query time (correction term):**

```

correctionᵢ = √(π/2) / d' · ‖rᵢ‖ · dot(S·q̂, sign(S·rᵢ))

final_scoreᵢ = mse_scoreᵢ + correctionᵢ

```

This follows from Lemma 4 of Zandieh et al. (2025):

`E[sign(S·r)] = √(2/π) · S·r / ‖S·r‖`, giving an unbiased estimate of `⟨r, q̂⟩`.

**When to use `use_prod=True`:**

- When you need accurate inner product magnitudes (KV-cache, attention approximation)

- **Not** recommended for pure ranking/NNS — the added QJL variance degrades recall@k

  relative to MSE-only at equal total bits

---

### Compression ratios

For N vectors of dimension `d = 384` (BGE-small):

| Backend        | Bytes/vector (disk) | Ratio vs float32 |

|----------------|---------------------|------------------|

| float32        | 1 536               | 1.0×             |

| 4-bit snapvec  | 192 + 4             | **7.9×**         |

| 3-bit snapvec  | 144 + 4             | **10.4×**        |

| 2-bit snapvec  | 96 + 4              | **15.4×**        |

| int8 (naïve)   | 384 + 4             | 3.9×             |

The 4-byte overhead is the per-vector norm. In RAM, indices are stored as uint8

(~3× vs float32); bit-packing applies on disk (~8× vs float32 at 4-bit).

---

### Recall benchmarks

Measured on synthetic unit-sphere vectors (`d=384`, `N=10 000`, 100 queries).

**Baseline: exact cosine float32 brute-force.**

| bits | recall@1 | recall@10 | recall@50 |

|------|----------|-----------|-----------|

| 2    | 0.72     | 0.83      | 0.91      |

| 3    | 0.81     | 0.91      | 0.96      |

| 4    | 0.86     | 0.93      | 0.95      |

Recall improves with clustered (real-world) data. On BGE-small-en embeddings

from mixed document corpora, 4-bit achieves **recall@10 ≈ 0.95**.

> **Note on published results:** The TurboQuant paper (Zandieh et al., 2025) reports

> recall up to 0.99, measured against HNSW graph navigation (not brute-force float32),

> on GloVe `d=200` data, using recall@1 with large `k_probe`. These conditions differ

> from the above; both results are correct under their respective definitions.

---

### File format (`.snpv`)

```

Offset  Size   Field

──────────────────────────────────────────────────

0       4 B    magic: "HDMX"

4       4 B    version: uint32 (1 or 2)

8       4 B    dim: uint32  — original embedding dimension

12      4 B    bits: uint32 — total bits (2, 3, or 4)

16      4 B    seed: uint32 — rotation seed

20      4 B    n: uint32    — number of stored vectors

24      4 B    flags: uint32 — bit-0: use_prod  [v2 only]

──────────────────────────────────────────────────

28      4 B    packed_len: uint32

32      *      indices: bit-packed uint8 MSE indices

       n×4 B   norms: float32 per-vector original norms

[prod only]

       n×d' B  qjl_signs: int8 sign(S·r) per vector

       n×4 B   rnorms: float32 ‖r‖/√d per vector

──────────────────────────────────────────────────

       n×(2+L) ids: uint16-length-prefixed UTF-8 strings

```

Saves are **atomic** on POSIX: writes to `.snpv.tmp` then `os.replace()`.

Backward compatible: v1 files (mse-only) load correctly in any version.

---

## API reference

### `SnapIndex(dim, bits=4, seed=0, use_prod=False)`

| Parameter  | Type  | Default | Description |

|------------|-------|---------|-------------|

| `dim`      | int   | —       | Embedding dimension |

| `bits`     | int   | 4       | Bits per coordinate: 2, 3, or 4 |

| `seed`     | int   | 0       | Rotation seed — must be consistent across build and query |

| `use_prod` | bool  | False   | Enable QJL unbiased estimator (requires bits ≥ 3) |

### Methods

```python

idx.add(id, vector)                    # Add one vector

idx.add_batch(ids, vectors)            # Add N vectors (~50x faster than loop)

idx.delete(id) -> bool                 # Remove by id, O(1) lookup

idx.search(query, k=10) -> list        # [(id, score), ...] descending

idx.save(path)                         # Atomic binary save to .snpv

SnapIndex.load(path)                # Load from .snpv file

idx.stats() -> dict                    # Compression / memory diagnostics

len(idx)                               # Number of stored vectors

repr(idx)                              # SnapIndex(dim=384, bits=4, mode=mse, n=1000)

```

---

## Relation to TurboQuant / PolarQuant

`snapvec` implements the core compression pipeline from:

> Zandieh, A., Daliri, M., Hadian, A., & Mirrokni, V. (2025).

> **TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate.**

> *ICLR 2026.* [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)

The same algorithm was published concurrently as "PolarQuant" at AISTATS 2026.

Both names were already taken on PyPI; `snapvec` = **Hada**mard + Lloyd-**Max**,

named after its two core operations.

Key contributions of this implementation over the reference:

- **No scipy** — codebooks hardcoded, numpy is the only runtime dependency

- **Batch WHT** — single O(n·d·log d) call for bulk inserts (~50x faster than loop)

- **Float16 cache** — centroid expansions in half precision, ~2x faster matmul

- **O(1) delete** — `_id_to_pos` dict + position compaction

- **Atomic saves** — `.snpv.tmp` → `os.replace()` pattern

- **Versioned format** — v1/v2 both loadable, forward-compatible flags field

---

## Installation

```bash

pip install snapvec

```

**Requirements:** Python >= 3.10, NumPy >= 1.24.  No other runtime dependencies.

For development:

```bash

git clone https://github.com/stffns/snapvec

cd snapvec

pip install -e .

pytest tests/ -v

```

---

## License

MIT © 2025 Jayson Steffens.

The TurboQuant algorithm is described in [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)

by Zandieh et al. (Google Research / ICLR 2026). This package is an independent implementation.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stffns/snapvec

Awesome Lists containing this project

README