https://github.com/better-with-models/tinyquant

TinyQuant is a CPU-only vector quantization codec that compresses high-dimensional embedding vectors to low-bit representations while preserving cosine similarity rankings.
https://github.com/better-with-models/tinyquant
embedding-models embedding-vectors embeddings embeddings-similarity pgvector
Last synced: 3 months ago
JSON representation
TinyQuant is a CPU-only vector quantization codec that compresses high-dimensional embedding vectors to low-bit representations while preserving cosine similarity rankings.
Host: GitHub
URL: https://github.com/better-with-models/tinyquant
Owner: better-with-models
License: apache-2.0
Created: 2026-04-08T17:39:59.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-04-18T17:15:24.000Z (3 months ago)
Last Synced: 2026-04-18T17:37:49.726Z (3 months ago)
Topics: embedding-models, embedding-vectors, embeddings, embeddings-similarity, pgvector
Language: Rust
Homepage: https://www.alisonaquinas.com/projects/tinyquant
Size: 3.46 MB
Stars: 5
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: .github/README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Roadmap: docs/roadmap.md
- Agents: AGENTS.md
Awesome Lists containing this project

README

          


  

  

# TinyQuant

*Rust-native vector quantization codec for embedding compression — CPU SIMD, optional GPU acceleration, and Python/TypeScript bindings.*

[![PyPI](https://img.shields.io/pypi/v/tinyquant-cpu.svg)](https://pypi.org/project/tinyquant-cpu/)

[![CI](https://github.com/better-with-models/TinyQuant/actions/workflows/ci.yml/badge.svg)](https://github.com/better-with-models/TinyQuant/actions/workflows/ci.yml)

[![Python 3.12+](https://img.shields.io/badge/python-3.12+-blue.svg)](https://www.python.org/downloads/)

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

[![Coverage](https://img.shields.io/badge/coverage-90.95%25-brightgreen.svg)](https://github.com/better-with-models/TinyQuant/actions/workflows/ci.yml)



> [!NOTE]

> **TinyQuant** is a Rust-native vector quantization codec that compresses

> high-dimensional embedding vectors to low-bit representations while

> preserving cosine similarity rankings. It combines random orthogonal

> preconditioning with two-stage scalar quantization and optional FP16

> residual correction to hit **8× compression at 4-bit** with Pearson

> ρ ≈ 0.998 and **95% top-5 recall** on real OpenAI embeddings.

>

> - **What it is:** a Rust library (with Python and TypeScript bindings) that

>   squeezes embedding vectors into 4-bit (or 2-bit) representations without

>   losing retrieval quality, with optional wgpu GPU acceleration for batch

>   workloads above 512 vectors.

> - **Who it's for:** teams running cosine-similarity search on embeddings and

>   paying for RAM or disk by the gigabyte.

> - **Headline number:** 8× compression at 95% top-5 recall on 1536-dim

>   OpenAI embeddings. 1 M vectors go from **5.7 GB to 732 MB**.

---

## At a glance

On a benchmark of **335 real embeddings** from OpenAI's

`text-embedding-3-small` (1536 dimensions), TinyQuant 4-bit achieves

**8× compression** with Pearson ρ = 0.998 and 95% top-5 recall —

reducing a 6 KB embedding to 768 bytes while preserving the similarity

rankings that drive retrieval quality.

| Method                      |  Bytes/vec | Compression | Pearson ρ | Top-5 Recall |

| :-------------------------- | ---------: | ----------: | --------: | -----------: |

| FP32 (baseline)             |      6,144 |          1× |    1.0000 |         100% |

| FP16                        |      3,072 |          2× |    1.0000 |         100% |

| uint8 scalar                |      1,544 |          4× |    1.0000 |         100% |

| **TinyQuant 4-bit**         |    **768** |      **8×** |**0.9981** |      **95%** |

| **TinyQuant 2-bit**         |    **384** |     **16×** |**0.9643** |      **85%** |

| TinyQuant 4-bit + residual  |      3,840 |        1.6× |    1.0000 |         100% |

For a corpus of 1 million 1536-dim vectors, TinyQuant 4-bit reduces

storage from **5.7 GB to 732 MB** with negligible loss in retrieval

quality.

![Compression vs. Fidelity](../experiments/quantization-benchmark/results/plots/compression_vs_fidelity.png)

See the [full benchmark report](../experiments/quantization-benchmark/REPORT.md)

for methodology, all 9 methods compared, throughput measurements, and

publication-quality plots.

---

Contents

- [Installation](#installation)

- [Language bindings](#language-bindings)

- [Quickstart](#quickstart)

- [How it works](#how-it-works)

- [Recipes](#recipes)

- [Key properties](#key-properties)

- [Research lineage](#research-lineage)

- [Repository layout](#repository-layout)

- [Development](#development)

- [Reproducing the benchmark](#reproducing-the-benchmark)

- [Contributing](#contributing)

- [License](#license)

- [Related documentation](#related-documentation)

---

## Installation

TinyQuant is published on PyPI as `tinyquant-cpu` and imports as

`tinyquant_cpu`. The current release is a Rust-backed fat wheel — no

pure-Python fallback.

| I want to...                                   | Install command                                    |

| :--------------------------------------------- | :------------------------------------------------- |

| Python (Rust-backed, current)                  | `pip install tinyquant-cpu`                        |

| Python + PostgreSQL/pgvector support            | `pip install "tinyquant-cpu[pgvector]"`            |

| Rust native crate                               | `cargo add tinyquant-core`                         |

| TypeScript / Node / Bun                         | `npm install @tinyquant/core`                      |

| Work on this repository                         | see the [Development](#development) section below  |

> [!TIP]

> The `[pgvector]` extra pulls in `psycopg[binary]>=3.1` for talking to a

> live PostgreSQL database. Python **3.12+** is required; the Rust workspace

> MSRV is **1.81**, with the optional `tinyquant-gpu-wgpu` crate carved out

> at **1.87** in its own CI lane.

---

## Language bindings

TinyQuant ships the same codec / corpus / backend surface across three

languages, versioned in lockstep via `rust/Cargo.toml`

`workspace.package.version`. All bindings delegate math to the shared

`tinyquant-core` Rust crate — there is no per-language reimplementation.

| Language    | Package                                                 | Install                          | Since |

| :---------- | :------------------------------------------------------ | :------------------------------- | :---- |

| Python      | [`tinyquant-cpu`](https://pypi.org/project/tinyquant-cpu/) ([![PyPI](https://img.shields.io/pypi/v/tinyquant-cpu.svg)](https://pypi.org/project/tinyquant-cpu/)) | `pip install tinyquant-cpu`      | Phase 24 |

| Rust        | [`tinyquant-core`](https://crates.io/crates/tinyquant-core) ([![crates.io](https://img.shields.io/crates/v/tinyquant-core.svg)](https://crates.io/crates/tinyquant-core)) | `cargo add tinyquant-core`       | Phase 22 |

| TypeScript  | [`@tinyquant/core`](https://www.npmjs.com/package/@tinyquant/core) ([![npm](https://img.shields.io/npm/v/@tinyquant/core.svg)](https://www.npmjs.com/package/@tinyquant/core)) | `npm install @tinyquant/core`    | Phase 25 |

All three packages guarantee byte-identical output on `config_hash`,

`Codebook::to_bytes`, and `CompressedVector::to_bytes`. See

[`COMPATIBILITY.md`](../COMPATIBILITY.md) for the supported cross-package

version pairs.

---

## Quickstart

```python

import numpy as np

from tinyquant_cpu.codec import Codec, CodecConfig

from tinyquant_cpu.corpus import Corpus, CompressionPolicy

from tinyquant_cpu.backend import BruteForceBackend

# 1. Configure the codec: 4-bit quantization for 1536-dim vectors

config = CodecConfig(bit_width=4, dimension=1536, seed=42)

codec = Codec()

# 2. Train a codebook from representative vectors

training_vectors = np.random.default_rng(0).standard_normal((1000, 1536)).astype(np.float32)

codebook = codec.build_codebook(training_vectors, config)

# 3. Create a corpus that compresses on insert

corpus = Corpus("my-vectors", config, codebook, CompressionPolicy.COMPRESS)

for i, vec in enumerate(training_vectors):

    corpus.insert(f"vec-{i}", vec)

# 4. Decompress and search

backend = BruteForceBackend()

backend.ingest(corpus.decompress_all())

results = backend.search(training_vectors[42], top_k=5)

for r in results:

    print(f"{r.vector_id}: {r.score:.4f}")

```

What just happened?

1. **Configure** — `CodecConfig(bit_width=4, dimension=1536, seed=42)`

   sets the bit width (`4` → 8× compression), the vector dimension, and

   the RNG seed that controls the random rotation matrix. The seed makes

   the codec **deterministic** — same inputs always produce byte-identical

   output across all language bindings.

2. **Train** — `codec.build_codebook(training_vectors, config)` fits a

   small codebook on a representative sample of your data.

3. **Insert** — `Corpus(..., CompressionPolicy.COMPRESS)` creates a domain

   aggregate that compresses every vector on insert and tracks vector IDs.

4. **Decompress** — `corpus.decompress_all()` produces `(vector_id,

   fp32_vector)` pairs. The Rust core runs these in parallel via Rayon.

5. **Search** — `BruteForceBackend` performs exact cosine search and returns

   `SearchResult` objects with IDs and scores. Swap for `PgvectorAdapter`

   in production, or use the GPU path for large corpora.

---

## How it works

**The problem.** Naive scalar quantization crushes real embedding data because

coordinate distributions are skewed: a handful of dimensions carry most of

the signal and get mapped to the same bucket as noise.

**The trick.** Pre-multiplying each vector by a **random orthogonal matrix**

(derived via QR decomposition of a Gaussian matrix) uniformizes the coordinate

distribution without changing pairwise distances. After rotation, a single

shared scalar quantizer works well across **all** dimensions. This is the core

insight from [TurboQuant][] and [PolarQuant][].

**Two-stage refinement.** An optional **FP16 residual** on top of the 4-bit

coarse codebook gives you a separate point on the rate-distortion curve:

8× compression and ρ ≈ 0.998 without the residual; 1.6× compression and

ρ = 1.000 with it enabled — useful for reranking stages.

**Rust core with CPU and GPU paths.** The codec runs through

`tinyquant-core`, which dispatches SIMD kernels at runtime (AVX2+FMA on

x86_64, NEON on aarch64) and parallelizes batch compression with Rayon.

For workloads exceeding the **512-vector threshold**, the optional

`tinyquant-gpu-wgpu` crate offloads rotate/quantize/dequantize/residual

and corpus cosine search to WGSL compute shaders via wgpu, with lazy

pipeline caching to avoid per-call recompilation.

**Backend-agnostic.** The codec produces `CompressedVector` bytes; search

lives in a separate `SearchBackend` layer (`BruteForceBackend` for in-memory

exact search, `PgvectorAdapter` for PostgreSQL + pgvector, `WgpuBackend` for

GPU-accelerated corpus search), so you can plug TinyQuant into any retrieval

store without coupling storage to search.

[TurboQuant]: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

[PolarQuant]: https://arxiv.org/abs/2503.20024

---

## Recipes

Pick the config that matches your rate-distortion target:

| Config                                            | Bytes/vec | Compression |     ρ | Top-5 | When to use                |

| :------------------------------------------------ | --------: | ----------: | ----: | ----: | :------------------------- |

| `CodecConfig(bit_width=4)`                        |       768 |          8× | 0.998 |   95% | **Default** balance        |

| `CodecConfig(bit_width=2)`                        |       384 |         16× | 0.964 |   85% | Aggressive, needs rerank   |

| `CodecConfig(bit_width=4, residual_enabled=True)` |     3,840 |        1.6× | 1.000 |  100% | Reranking / exact-match    |

Single-vector compression

```python

import numpy as np

from tinyquant_cpu.codec import Codec, CodecConfig

config = CodecConfig(bit_width=4, dimension=768, seed=42)

codec = Codec()

training_data = np.random.default_rng(0).standard_normal((1000, 768)).astype(np.float32)

codebook = codec.build_codebook(training_data, config)

vector = training_data[0]

compressed = codec.compress(vector, config, codebook)

print(f"Original:   {vector.nbytes} bytes")

print(f"Compressed: {compressed.size_bytes} bytes")

print(f"Ratio:      {vector.nbytes / compressed.size_bytes:.1f}x")

restored = codec.decompress(compressed, config, codebook)

```

Batch compression (Rayon-parallel)

```python

# Parallelized via Rayon in the Rust core — byte-identical to serial output

vectors = np.random.default_rng(0).standard_normal((10_000, 768)).astype(np.float32)

compressed_batch = codec.compress_batch(vectors, config, codebook)

restored_batch = codec.decompress_batch(compressed_batch, config, codebook)

```

Tuning the rate–distortion tradeoff

```python

# Maximum compression: 16x at 2-bit

config_2bit = CodecConfig(bit_width=2, dimension=768, seed=42, residual_enabled=False)

# Practical sweet spot: 8x at 4-bit (rho >= 0.998)

config_4bit = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=False)

# Near-perfect fidelity: 4-bit + FP16 residual correction (1.6x, rho = 1.000)

config_4bit_res = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=True)

```

> [!WARNING]

> 2-bit compression drops top-5 recall to ~85%. Only use it when a

> reranking stage (FP16 residual, cross-encoder, exact search) sits

> downstream to recover the missing signal.

Compression policies

A `Corpus` can store vectors in three modes:

```python

from tinyquant_cpu.corpus import Corpus, CompressionPolicy

corpus_compressed = Corpus("c", config, codebook, CompressionPolicy.COMPRESS)

corpus_full       = Corpus("p", config, codebook, CompressionPolicy.PASSTHROUGH)

corpus_fp16       = Corpus("h", config, codebook, CompressionPolicy.FP16)

```

Policies let one corpus mix hot data (PASSTHROUGH), cold data (COMPRESS),

and middle-tier data (FP16) without rebuilding the codec.

Binary serialization (TQCV format)

`CompressedVector` instances serialize to the TQCV versioned binary format

(70-byte header + LSB-first packed indices + optional FP16 residual),

suitable for disk, network, or database storage. Mmap corpus files are

available via the Rust `tinyquant-io` crate for zero-copy access.

```python

from tinyquant_cpu.codec import CompressedVector

raw_bytes = compressed.to_bytes()

restored  = CompressedVector.from_bytes(raw_bytes)

```

PostgreSQL + pgvector backend

```python

import psycopg

from tinyquant_cpu.backend.adapters.pgvector import PgvectorAdapter

adapter = PgvectorAdapter(

    connection_factory=lambda: psycopg.connect("postgresql://user:pass@localhost/mydb"),

    table_name="embeddings",

)

adapter.ingest(corpus.decompress_all())

results = adapter.search(query_vector, top_k=10)

```

> [!IMPORTANT]

> Requires PostgreSQL with the `pgvector` extension installed. CI runs these

> tests against a live `pgvector/pgvector:pg17` container via testcontainers.

GPU acceleration (Rust only — wgpu)

The `tinyquant-gpu-wgpu` crate provides a `WgpuBackend` that offloads

batch compress/decompress and corpus cosine search to WGSL compute shaders.

It is workspace-internal (`publish = false`) and selected automatically

when a batch exceeds `GPU_BATCH_THRESHOLD` (512 vectors).

```rust

use tinyquant_gpu_wgpu::{WgpuBackend, BackendPreference};

// Default adapter (auto-select highest-performance GPU)

let backend = WgpuBackend::new().await?;

// Or select a specific backend:

let backend = WgpuBackend::new_with_preference(BackendPreference::Vulkan).await?;

// Warm up pipeline cache explicitly (optional — lazy otherwise)

backend.load_pipelines().await;

// GPU corpus search

let state = backend.prepare_corpus_for_device(&corpus_vecs).await?;

let results = backend.cosine_topk(&state, &query_vec, top_k).await?;

```

Available `BackendPreference` variants: `Auto`, `Vulkan`, `Metal`, `Dx12`,

`HighPerformance`, `LowPower`, `Software`.

---

## Key properties

- **8× compression** at 4-bit without residuals (ρ = 0.998, 95% recall)

- **16× compression** at 2-bit (ρ = 0.964, 85% recall)

- **Perfect fidelity** with optional FP16 residual correction (ρ = 1.000)

- **Deterministic** — same inputs produce byte-identical output across all language bindings and CPU architectures

- **Rust-native core** — `tinyquant-core`; CPU SIMD dispatch (AVX2+FMA / NEON) via `is_x86_feature_detected!` / ARMv8 base-ISA guarantee; Rayon parallel batch with determinism contract

- **Optional GPU acceleration** — `tinyquant-gpu-wgpu`; WGSL rotate/quantize/dequantize/residual and cosine-topk kernels; lazy `CachedPipelines`; `BackendPreference` adapter selection; auto-routes at ≥ 512 vectors

- **Multi-language** — Python fat wheel (`tinyquant-cpu`), TypeScript/Node (`@tinyquant/core`), Rust native (`tinyquant-core`), C ABI (`tinyquant-sys`)

- **Pluggable backends** — `BruteForceBackend` for in-process exact search; `PgvectorAdapter` for PostgreSQL + pgvector; `WgpuBackend` for GPU corpus search

- **Three compression policies** — COMPRESS, PASSTHROUGH, FP16, mixable within a corpus

- **TQCV serialization** — versioned 70-byte header + LSB-first bit-pack + optional FP16 residual; mmap corpus files via `tinyquant-io`

- **Calibration gates** — Pearson ρ and mean recall-at-k measured against OpenAI calibration fixtures; Criterion benchmarks with 10% regression budget

- **Fully typed** — `py.typed` marker, `mypy --strict` clean, TypeScript strict mode

- **Apache-2.0 licensed**

---

## Research lineage

TinyQuant adapts ideas from published research into a clean-room

implementation:

| Source              | Year | Key contribution                                                  |

| :------------------ | :--: | :---------------------------------------------------------------- |

| [**TurboQuant**][1] | 2025 | Random rotation + scalar quantization, no per-block norms         |

| [**PolarQuant**][2] | 2025 | QR-derived orthogonal preconditioning for coordinate uniformity   |

| [**QJL**][3]        | 2024 | Inner-product preservation bounds under aggressive quantization   |

[1]: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

[2]: https://arxiv.org/abs/2503.20024

[3]: https://arxiv.org/abs/2406.03482

---

## Repository layout

| Path                                          | Purpose                                                                     |

| :-------------------------------------------- | :-------------------------------------------------------------------------- |

| `rust/crates/tinyquant-core/`                 | Codec, corpus, backend trait, SIMD dispatch, Rayon parallel batch           |

| `rust/crates/tinyquant-io/`                   | TQCV serialization format and mmap corpus files                             |

| `rust/crates/tinyquant-gpu-wgpu/`             | Optional wgpu/WGSL GPU accelerator (`publish = false`, workspace-internal)  |

| `rust/crates/tinyquant-py/`                   | pyo3 Python extension — the engine behind `tinyquant-cpu`                   |

| `rust/crates/tinyquant-sys/`                  | C ABI via cbindgen                                                          |

| `rust/crates/tinyquant-cli/`                  | Standalone CLI binary                                                       |

| `rust/crates/tinyquant-js/`                   | napi-rs TypeScript/Node bindings (`@tinyquant/core`)                        |

| `rust/crates/tinyquant-bruteforce/`           | `BruteForceBackend` reference implementation                                |

| `rust/crates/tinyquant-pgvector/`             | PostgreSQL + pgvector ACL adapter                                           |

| `rust/crates/tinyquant-bench/`                | Criterion benchmarks + calibration quality gates                            |

| `tests/reference/tinyquant_py_reference/`     | Pure-Python frozen oracle — differential test reference (not shipped)       |

| `tests/parity/`                               | Cross-implementation parity suite (`pytest -m parity`)                     |

| `tests/`                                      | Python unit, integration, E2E, architecture, and calibration suites         |

| `experiments/`                                | Benchmarks and empirical evaluations                                        |

| `docs/`                                       | Obsidian wiki: design docs, research, SDLC plans, CI/CD specs               |

---

## Development

```bash

git clone https://github.com/better-with-models/TinyQuant.git

cd TinyQuant

# Python dev dependencies

pip install pytest pytest-cov hypothesis numpy ruff mypy build

# Lint and format

ruff check . && ruff format --check .

# Strict type check

mypy --strict .

# Run the full Python suite

pytest --cov=tinyquant_py_reference

# Cross-impl parity (Python ↔ Rust)

pytest -m parity -v

# Rust: lint and test

cd rust

cargo clippy --workspace -- -D warnings

cargo test --workspace

```

The Python test suite includes **289 tests** covering unit, integration,

end-to-end, calibration, parity (cross-impl Python ↔ Rust), and

architecture-enforcement scenarios. Coverage is held above **90%** by CI

(**94%** for the codec subpackage). Live PostgreSQL + pgvector tests run

against a Docker container in CI via `testcontainers`.

> [!TIP]

> CI enforces three strict gates: `ruff check` / `ruff format --check`,

> `mypy --strict`, and `markdownlint-cli2` for all markdown outside `docs/`.

> The `docs/` vault uses Obsidian-flavored markdown under its own rules —

> see [`AGENTS.md`](../AGENTS.md) for the policy.

---

## Reproducing the benchmark

```bash

export OPENAI_API_KEY="your-key-here"

python experiments/quantization-benchmark/generate_embeddings.py

python experiments/quantization-benchmark/run_benchmark.py

python experiments/quantization-benchmark/generate_plots.py

```

This fetches 335 embeddings via the OpenAI API, benchmarks 9 quantization

methods, and produces plots and JSON results in

`experiments/quantization-benchmark/results/`.

---

## Contributing

Contributions are welcome. The short version:

1. **Issues and design discussions** — open a GitHub issue before starting

   non-trivial work so we can agree on scope.

2. **Follow the repo SDLC** — architecture decisions, coding standards, and

   pre-commit expectations live in [`AGENTS.md`](../AGENTS.md) and the

   `docs/design/` vault. Read [`CLAUDE.md`](../CLAUDE.md) if you're driving

   Claude Code or another LLM agent against this repo.

3. **Run the full gate locally** before pushing:

   `ruff check . && ruff format --check . && mypy --strict . && pytest --cov=tinyquant_cpu`

4. **Keep prose aligned** — edits to the project tagline, elevator pitch, or

   headline benchmark numbers must land in `README.md`, `.github/README.md`,

   `AGENTS.md`, and `CLAUDE.md` in the same commit.

---

## License

Apache-2.0. See [LICENSE](../LICENSE).

---

## Related documentation

- [Benchmark Report](../experiments/quantization-benchmark/REPORT.md) —

  full empirical evaluation in CS-paper format

- [CHANGELOG](../CHANGELOG.md) — release notes

- [Design: Storage Codec Architecture](../docs/design/storage-codec-architecture.md)

- [Design: GPU Acceleration](../docs/design/rust/gpu-acceleration.md)

- [Research: Vector Quantization Paper Synthesis](../docs/research/vector-quantization-paper-synthesis.md)

- [QA: Validation Plan](../docs/qa/validation-plan/README.md)

- [CI Plan](../docs/CI-plan/README.md) and [CD Plan](../docs/CD-plan/README.md)
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/better-with-models/tinyquant

Awesome Lists containing this project

README