https://github.com/ilyajob05/turboquant-space

SIMD-accelerated 4/8-bit vector quantization for approximate nearest neighbor search, based on TurboQuant (ICLR 2026). Standalone C++17 library with Python bindings
https://github.com/ilyajob05/turboquant-space

avx cpp17 header-only python quantization simd turbo-quant turboquant vector-quantization

Last synced: 2 months ago
JSON representation

SIMD-accelerated 4/8-bit vector quantization for approximate nearest neighbor search, based on TurboQuant (ICLR 2026). Standalone C++17 library with Python bindings

Host: GitHub
URL: https://github.com/ilyajob05/turboquant-space
Owner: ilyajob05
License: mit
Created: 2026-04-11T21:38:56.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-15T22:50:47.000Z (3 months ago)
Last Synced: 2026-04-17T00:39:37.244Z (2 months ago)
Topics: avx, cpp17, header-only, python, quantization, simd, turbo-quant, turboquant, vector-quantization
Language: C++
Homepage:
Size: 253 KB
Stars: 6
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # turboquant-space

![License](https://img.shields.io/pypi/l/turboquant-space)

![Build](https://img.shields.io/github/actions/workflow/status/ilyajob05/turboquant-space/publish.yml)

![Python](https://img.shields.io/pypi/pyversions/turboquant-space)

![PyPI](https://img.shields.io/pypi/v/turboquant-space)

This library was inspired by the article https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/. The library is optimized for efficient data allocation in memory for 3+1 and 7+1 bit quantization schemes.

SIMD-accelerated 4/8-bit vector quantization for approximate nearest neighbor

search, based on **TurboQuant** (ICLR 2026). Standalone C++17 library with

Python bindings.

```bash

pip install turboquant-space

```

```python

from turboquant import TurboQuantSpace

import numpy as np

space = TurboQuantSpace(dim=128, bits_per_coord=8, num_threads=4)

X = np.random.randn(100_000, 128).astype(np.float32)

q = np.random.randn(128).astype(np.float32)

codes = space.encode_batch(X)              # (100_000, code_size) uint8

dists = space.distance_1_to_n(q, codes)    # (100_000,) float32

```

That is the whole mental model: `encode` once, then `distance_*` against the

codes. No index to build, no state to persist beyond `codes`.

### Torch Example

```python

from turboquant import TurboQuantSpace

import numpy as np

torch.manual_seed(42)

n, dim = 10000, 768

x = torch.randn(n, dim)

x = torch.nn.functional.normalize(x, dim=1)

tq = TurboQuantSpace(dim=dim, bits_per_coord=8)

x_np = x.detach().cpu().numpy().astype(np.float32, copy=False)

x_np = np.ascontiguousarray(x_np)

codes = tq.encode_batch(x_np)

raw_bytes = x.numel() * 4

comp_bytes = codes.nbytes

print(f"raw   : {raw_bytes / 1e6:.2f} MB")

print(f"codes : {comp_bytes / 1e6:.2f} MB  ({raw_bytes / comp_bytes:.1f}x)")

print(f"code_size_bytes = {tq.code_size_bytes()}")

q = x_np[0]

d = tq.distance_1_to_n(q, codes)

print("top-5 nearest:", np.argsort(d)[:5])

```

---

## What it does, briefly

TurboQuant encodes each float32 vector into a compact code of

`bits_per_coord` bits per coordinate using a randomized Walsh–Hadamard

rotation followed by Lloyd–Max scalar quantization, plus one QJL sign bit per

coordinate for an unbiased residual correction. Distances between a raw query

and a packed code (asymmetric) or between two packed codes (symmetric) are

computed directly on the quantized representation with hand-written NEON /

SSE / AVX kernels.

**Concretely you get:**

| bits_per_coord | layout               | bytes / vec (dim=128) | compression vs fp32 |

|----------------|----------------------|-----------------------|---------------------|

| 4              | nibble-packed        | 76                    | 6.7×                |

| 8              | one byte per coord   | 140                   | 3.7×                |

(Plus 12 bytes of metadata — norm, γ, σ — per code.)

**What it is not:** not a graph index, not an IVF, not a drop-in replacement

for FAISS. It is the *distance* layer. Plug it into your own index, or use

`distance_1_to_n` as brute-force search on batches up to a few million.

---

## Install

```bash

pip install turboquant-space

```

Prebuilt wheels are published for CPython 3.11–3.13 on Linux (x86\_64,

aarch64), macOS (x86\_64, arm64), and Windows (AMD64). They target a

conservative CPU baseline — **x86-64-v3** (AVX2 + FMA + BMI2) on x64 and

**armv8-a** (NEON) on arm64 — so a single wheel runs on anything produced in

the last ~8 years. A C++ compiler is **not** required for this path.

### Build from source for maximum performance

The prebuilt wheels trade a few percent for portability. If you have a C++

compiler and want the binary tuned to *your* CPU (AVX-512 on Zen4 / Ice Lake,

SVE on Graviton, etc.), force pip to skip the wheel and compile from sdist:

```bash

pip install turboquant-space --no-binary turboquant-space

```

This invokes CMake with `-march=native`, so every available instruction set

on the build machine is enabled. Requires CMake ≥ 3.18 and a C++17 compiler;

on macOS also `brew install libomp` for multi-threaded batch ops.

### From a git checkout

```bash

git clone https://github.com/ilyajob05/turboquant-space

cd turboquant-space

uv sync                       # or: pip install -e .

```

Same story: local builds use `-march=native` by default. Pass

`-DTURBOQUANT_PORTABLE=ON` to CMake if you need a portable baseline instead.

---

## API

Everything lives on a single class, `TurboQuantSpace`. All numpy arrays are

`float32`, C-contiguous; all codes are `uint8`.

```python

TurboQuantSpace(

    dim: int,                    # input dimensionality (any positive integer)

    bits_per_coord: int = 4,     # 2..9 — nibble-packed for bits<=4

    rot_seed: int = 42,          # Hadamard rotation seed

    qjl_seed: int = 137,         # QJL sign seed

    num_threads: int = 0,        # 0 = use OMP_NUM_THREADS / all cores

)

```

| method                                    | shape in                     | shape out                   |

|-------------------------------------------|------------------------------|-----------------------------|

| `encode(x)`                               | `(dim,)`                     | `(code_size_bytes,)` uint8  |

| `encode_batch(X)`                         | `(n, dim)`                   | `(n, code_size_bytes)` uint8|

| `encode_into(x, out)` / `encode_batch_into` | in-place into caller buffer | —                           |

| `distance(query, code)`                   | `(dim,)`, `(code_size,)`     | `float`                     |

| `distance_symmetric(code_a, code_b)`      | `(code_size,)` ×2            | `float`                     |

| `distance_1_to_n(q, codes)`               | `(dim,)`, `(n, code_size)`   | `(n,)` float32              |

| `distance_m_to_n(Q, codes)`               | `(m, dim)`, `(n, code_size)` | `(m, n)` float32            |

| `distance_m_to_n_symmetric(codes_a, b)`   | `(m, cs)`, `(n, cs)`         | `(m, n)` float32            |

Accessors: `dim()`, `padded_dim()`, `padded()`, `num_threads()`,

`code_size_bytes()`, `bits_per_coord()`.

### Dimensionality padding

Internally every operation works in a power-of-two dimension (a requirement

of the Walsh–Hadamard transform). If you pass `dim=100`, the space rounds up

to 128 and zero-pads on the fly; a one-time warning is printed, and

`space.padded_dim()` reports the internal size. Correctness is preserved —

zero-padding in ℝᵈ does not change L2 distances — but encode/query cost is

determined by `padded_dim()`, not `dim()`.

### Threading

All batch methods (`encode_batch`, `distance_1_to_n`, `distance_m_to_n`,

`distance_m_to_n_symmetric`) parallelize the outer loop with OpenMP,

`schedule(static)`, so each thread owns a contiguous range of codes —

prefetcher-friendly, no false sharing on output rows. Set `num_threads` in

the constructor, or leave it `0` to respect `OMP_NUM_THREADS`. For small

batches (≤ 64) execution stays single-threaded to avoid fork/join overhead.

Observed scaling on Apple M-series, dim=512, 50k codes × 128 queries, bits=8:

**1→2 = 1.94×, 1→4 = 3.49×, 1→8 = 4.50×** — see `python/benchmarks/` for the

full reproduction.

---

## Benchmarks

```bash

uv run python python/benchmarks/run_benchmark.py

```

On first run this downloads SIFT1M (~170 MB) to

`~/.cache/turboquant/sift/`; subsequent runs reuse the cache. The script

sweeps `bits_per_coord × num_threads` on SIFT1M (with recall@{1,10,100}

against the shipped ground truth) and on synthetic Gaussian data across

several dimensions, writes

`python/benchmarks/results/results_.csv`, and produces seaborn

plots under `results/plots/`:

- `threading_scaling.png` — M-to-N throughput vs `num_threads`, faceted by dim.

- `sift_recall.png` — recall@{1,10,100} vs bits on SIFT1M.

- `synthetic_throughput.png` — encode / 1-to-N / M-to-N vs dim.

Useful flags: `--skip-sift`, `--skip-synthetic`, `--threads 1,4,8`,

`--bits 4,8`, `--no-show` (for headless CI).

Measured numbers from real hardware (Apple M3 and more as they come in)

live in [`docs/benchmarks.md`](docs/benchmarks.md). Headline from M3,

`dim=128, batch=10000, bits=8`: **~88M symmetric M-to-N ops/sec** and

**~2.8M encode/sec** on a single laptop.

---

## Layout and build

```

include/turboquant/

  turbo_quant.h          # Hadamard, Lloyd–Max, TurboQuantCode

  space_turbo_quant.h    # TurboQuantSpace + SIMD distance kernels

python/turboquant/

  bindings.cpp           # pybind11 bindings

  __init__.py

python/tests/            # pytest suite

python/benchmarks/       # run_benchmark.py (CSV + seaborn plots)

CMakeLists.txt           # scikit-build-core entry point

pyproject.toml

```

The library is header-only in spirit — all algorithmic code is in

`include/turboquant/`. Only the Python module (`bindings.cpp`) is compiled as

a shared object. A C++ consumer can depend on the headers alone and call the

same API directly.

Build flags worth knowing:

- `-DTURBOQUANT_HAVE_OPENMP` — set by CMake when OpenMP is detected; enables

  all `#pragma omp` blocks. Absent → sequential fallback, same API.

- Release build uses `-O3 -ffast-math -fno-finite-math-only`. The

  `fno-finite-math-only` is intentional: it keeps `inf`/`nan` handling sane

  while preserving vectorization.

### Recall Benchmark

![recall](python/benchmarks/results/plots/recall.png)

### Tests

```bash

uv run pytest python/tests/ -v

```

Covers asymmetric/symmetric distances across `bits ∈ {4, 8}` and

`dim ∈ {32..4096}`, batch variants, zero-copy torch interop, and padding

correctness.

---

## Roadmap

The immediate priorities, in order:

1. **Publish wheels to PyPI** (cibuildwheel workflow in place; awaiting first tagged release)

Contributions welcome. The codebase is small (two headers, one bindings

file, ~2k lines) and deliberately kept that way — if a change makes it

harder to read, that is a reason to push back on it.

## Citation

If you use this library in academic work, please cite the original TurboQuant

paper (ICLR 2026) in addition to this repository.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ilyajob05/turboquant-space

Awesome Lists containing this project

README