https://github.com/better-with-models/tinyquant
TinyQuant is a CPU-only vector quantization codec that compresses high-dimensional embedding vectors to low-bit representations while preserving cosine similarity rankings.
https://github.com/better-with-models/tinyquant
embedding-models embedding-vectors embeddings embeddings-similarity pgvector
Last synced: 23 days ago
JSON representation
TinyQuant is a CPU-only vector quantization codec that compresses high-dimensional embedding vectors to low-bit representations while preserving cosine similarity rankings.
- Host: GitHub
- URL: https://github.com/better-with-models/tinyquant
- Owner: better-with-models
- License: apache-2.0
- Created: 2026-04-08T17:39:59.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-04-18T17:15:24.000Z (about 1 month ago)
- Last Synced: 2026-04-18T17:37:49.726Z (about 1 month ago)
- Topics: embedding-models, embedding-vectors, embeddings, embeddings-similarity, pgvector
- Language: Rust
- Homepage: https://www.alisonaquinas.com/projects/tinyquant
- Size: 3.46 MB
- Stars: 5
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: .github/README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Roadmap: docs/roadmap.md
- Agents: AGENTS.md
Awesome Lists containing this project
README

# TinyQuant
*Rust-native vector quantization codec for embedding compression — CPU SIMD, optional GPU acceleration, and Python/TypeScript bindings.*
[](https://pypi.org/project/tinyquant-cpu/)
[](https://github.com/better-with-models/TinyQuant/actions/workflows/ci.yml)
[](https://www.python.org/downloads/)
[](https://opensource.org/licenses/Apache-2.0)
[](https://github.com/better-with-models/TinyQuant/actions/workflows/ci.yml)
> [!NOTE]
> **TinyQuant** is a Rust-native vector quantization codec that compresses
> high-dimensional embedding vectors to low-bit representations while
> preserving cosine similarity rankings. It combines random orthogonal
> preconditioning with two-stage scalar quantization and optional FP16
> residual correction to hit **8× compression at 4-bit** with Pearson
> ρ ≈ 0.998 and **95% top-5 recall** on real OpenAI embeddings.
>
> - **What it is:** a Rust library (with Python and TypeScript bindings) that
> squeezes embedding vectors into 4-bit (or 2-bit) representations without
> losing retrieval quality, with optional wgpu GPU acceleration for batch
> workloads above 512 vectors.
> - **Who it's for:** teams running cosine-similarity search on embeddings and
> paying for RAM or disk by the gigabyte.
> - **Headline number:** 8× compression at 95% top-5 recall on 1536-dim
> OpenAI embeddings. 1 M vectors go from **5.7 GB to 732 MB**.
---
## At a glance
On a benchmark of **335 real embeddings** from OpenAI's
`text-embedding-3-small` (1536 dimensions), TinyQuant 4-bit achieves
**8× compression** with Pearson ρ = 0.998 and 95% top-5 recall —
reducing a 6 KB embedding to 768 bytes while preserving the similarity
rankings that drive retrieval quality.
| Method | Bytes/vec | Compression | Pearson ρ | Top-5 Recall |
| :-------------------------- | ---------: | ----------: | --------: | -----------: |
| FP32 (baseline) | 6,144 | 1× | 1.0000 | 100% |
| FP16 | 3,072 | 2× | 1.0000 | 100% |
| uint8 scalar | 1,544 | 4× | 1.0000 | 100% |
| **TinyQuant 4-bit** | **768** | **8×** |**0.9981** | **95%** |
| **TinyQuant 2-bit** | **384** | **16×** |**0.9643** | **85%** |
| TinyQuant 4-bit + residual | 3,840 | 1.6× | 1.0000 | 100% |
For a corpus of 1 million 1536-dim vectors, TinyQuant 4-bit reduces
storage from **5.7 GB to 732 MB** with negligible loss in retrieval
quality.

See the [full benchmark report](../experiments/quantization-benchmark/REPORT.md)
for methodology, all 9 methods compared, throughput measurements, and
publication-quality plots.
---
Contents
- [Installation](#installation)
- [Language bindings](#language-bindings)
- [Quickstart](#quickstart)
- [How it works](#how-it-works)
- [Recipes](#recipes)
- [Key properties](#key-properties)
- [Research lineage](#research-lineage)
- [Repository layout](#repository-layout)
- [Development](#development)
- [Reproducing the benchmark](#reproducing-the-benchmark)
- [Contributing](#contributing)
- [License](#license)
- [Related documentation](#related-documentation)
---
## Installation
TinyQuant is published on PyPI as `tinyquant-cpu` and imports as
`tinyquant_cpu`. The current release is a Rust-backed fat wheel — no
pure-Python fallback.
| I want to... | Install command |
| :--------------------------------------------- | :------------------------------------------------- |
| Python (Rust-backed, current) | `pip install tinyquant-cpu` |
| Python + PostgreSQL/pgvector support | `pip install "tinyquant-cpu[pgvector]"` |
| Rust native crate | `cargo add tinyquant-core` |
| TypeScript / Node / Bun | `npm install @tinyquant/core` |
| Work on this repository | see the [Development](#development) section below |
> [!TIP]
> The `[pgvector]` extra pulls in `psycopg[binary]>=3.1` for talking to a
> live PostgreSQL database. Python **3.12+** is required; the Rust workspace
> MSRV is **1.81**, with the optional `tinyquant-gpu-wgpu` crate carved out
> at **1.87** in its own CI lane.
---
## Language bindings
TinyQuant ships the same codec / corpus / backend surface across three
languages, versioned in lockstep via `rust/Cargo.toml`
`workspace.package.version`. All bindings delegate math to the shared
`tinyquant-core` Rust crate — there is no per-language reimplementation.
| Language | Package | Install | Since |
| :---------- | :------------------------------------------------------ | :------------------------------- | :---- |
| Python | [`tinyquant-cpu`](https://pypi.org/project/tinyquant-cpu/) ([](https://pypi.org/project/tinyquant-cpu/)) | `pip install tinyquant-cpu` | Phase 24 |
| Rust | [`tinyquant-core`](https://crates.io/crates/tinyquant-core) ([](https://crates.io/crates/tinyquant-core)) | `cargo add tinyquant-core` | Phase 22 |
| TypeScript | [`@tinyquant/core`](https://www.npmjs.com/package/@tinyquant/core) ([](https://www.npmjs.com/package/@tinyquant/core)) | `npm install @tinyquant/core` | Phase 25 |
All three packages guarantee byte-identical output on `config_hash`,
`Codebook::to_bytes`, and `CompressedVector::to_bytes`. See
[`COMPATIBILITY.md`](../COMPATIBILITY.md) for the supported cross-package
version pairs.
---
## Quickstart
```python
import numpy as np
from tinyquant_cpu.codec import Codec, CodecConfig
from tinyquant_cpu.corpus import Corpus, CompressionPolicy
from tinyquant_cpu.backend import BruteForceBackend
# 1. Configure the codec: 4-bit quantization for 1536-dim vectors
config = CodecConfig(bit_width=4, dimension=1536, seed=42)
codec = Codec()
# 2. Train a codebook from representative vectors
training_vectors = np.random.default_rng(0).standard_normal((1000, 1536)).astype(np.float32)
codebook = codec.build_codebook(training_vectors, config)
# 3. Create a corpus that compresses on insert
corpus = Corpus("my-vectors", config, codebook, CompressionPolicy.COMPRESS)
for i, vec in enumerate(training_vectors):
corpus.insert(f"vec-{i}", vec)
# 4. Decompress and search
backend = BruteForceBackend()
backend.ingest(corpus.decompress_all())
results = backend.search(training_vectors[42], top_k=5)
for r in results:
print(f"{r.vector_id}: {r.score:.4f}")
```
What just happened?
1. **Configure** — `CodecConfig(bit_width=4, dimension=1536, seed=42)`
sets the bit width (`4` → 8× compression), the vector dimension, and
the RNG seed that controls the random rotation matrix. The seed makes
the codec **deterministic** — same inputs always produce byte-identical
output across all language bindings.
2. **Train** — `codec.build_codebook(training_vectors, config)` fits a
small codebook on a representative sample of your data.
3. **Insert** — `Corpus(..., CompressionPolicy.COMPRESS)` creates a domain
aggregate that compresses every vector on insert and tracks vector IDs.
4. **Decompress** — `corpus.decompress_all()` produces `(vector_id,
fp32_vector)` pairs. The Rust core runs these in parallel via Rayon.
5. **Search** — `BruteForceBackend` performs exact cosine search and returns
`SearchResult` objects with IDs and scores. Swap for `PgvectorAdapter`
in production, or use the GPU path for large corpora.
---
## How it works
**The problem.** Naive scalar quantization crushes real embedding data because
coordinate distributions are skewed: a handful of dimensions carry most of
the signal and get mapped to the same bucket as noise.
**The trick.** Pre-multiplying each vector by a **random orthogonal matrix**
(derived via QR decomposition of a Gaussian matrix) uniformizes the coordinate
distribution without changing pairwise distances. After rotation, a single
shared scalar quantizer works well across **all** dimensions. This is the core
insight from [TurboQuant][] and [PolarQuant][].
**Two-stage refinement.** An optional **FP16 residual** on top of the 4-bit
coarse codebook gives you a separate point on the rate-distortion curve:
8× compression and ρ ≈ 0.998 without the residual; 1.6× compression and
ρ = 1.000 with it enabled — useful for reranking stages.
**Rust core with CPU and GPU paths.** The codec runs through
`tinyquant-core`, which dispatches SIMD kernels at runtime (AVX2+FMA on
x86_64, NEON on aarch64) and parallelizes batch compression with Rayon.
For workloads exceeding the **512-vector threshold**, the optional
`tinyquant-gpu-wgpu` crate offloads rotate/quantize/dequantize/residual
and corpus cosine search to WGSL compute shaders via wgpu, with lazy
pipeline caching to avoid per-call recompilation.
**Backend-agnostic.** The codec produces `CompressedVector` bytes; search
lives in a separate `SearchBackend` layer (`BruteForceBackend` for in-memory
exact search, `PgvectorAdapter` for PostgreSQL + pgvector, `WgpuBackend` for
GPU-accelerated corpus search), so you can plug TinyQuant into any retrieval
store without coupling storage to search.
[TurboQuant]: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
[PolarQuant]: https://arxiv.org/abs/2503.20024
---
## Recipes
Pick the config that matches your rate-distortion target:
| Config | Bytes/vec | Compression | ρ | Top-5 | When to use |
| :------------------------------------------------ | --------: | ----------: | ----: | ----: | :------------------------- |
| `CodecConfig(bit_width=4)` | 768 | 8× | 0.998 | 95% | **Default** balance |
| `CodecConfig(bit_width=2)` | 384 | 16× | 0.964 | 85% | Aggressive, needs rerank |
| `CodecConfig(bit_width=4, residual_enabled=True)` | 3,840 | 1.6× | 1.000 | 100% | Reranking / exact-match |
Single-vector compression
```python
import numpy as np
from tinyquant_cpu.codec import Codec, CodecConfig
config = CodecConfig(bit_width=4, dimension=768, seed=42)
codec = Codec()
training_data = np.random.default_rng(0).standard_normal((1000, 768)).astype(np.float32)
codebook = codec.build_codebook(training_data, config)
vector = training_data[0]
compressed = codec.compress(vector, config, codebook)
print(f"Original: {vector.nbytes} bytes")
print(f"Compressed: {compressed.size_bytes} bytes")
print(f"Ratio: {vector.nbytes / compressed.size_bytes:.1f}x")
restored = codec.decompress(compressed, config, codebook)
```
Batch compression (Rayon-parallel)
```python
# Parallelized via Rayon in the Rust core — byte-identical to serial output
vectors = np.random.default_rng(0).standard_normal((10_000, 768)).astype(np.float32)
compressed_batch = codec.compress_batch(vectors, config, codebook)
restored_batch = codec.decompress_batch(compressed_batch, config, codebook)
```
Tuning the rate–distortion tradeoff
```python
# Maximum compression: 16x at 2-bit
config_2bit = CodecConfig(bit_width=2, dimension=768, seed=42, residual_enabled=False)
# Practical sweet spot: 8x at 4-bit (rho >= 0.998)
config_4bit = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=False)
# Near-perfect fidelity: 4-bit + FP16 residual correction (1.6x, rho = 1.000)
config_4bit_res = CodecConfig(bit_width=4, dimension=768, seed=42, residual_enabled=True)
```
> [!WARNING]
> 2-bit compression drops top-5 recall to ~85%. Only use it when a
> reranking stage (FP16 residual, cross-encoder, exact search) sits
> downstream to recover the missing signal.
Compression policies
A `Corpus` can store vectors in three modes:
```python
from tinyquant_cpu.corpus import Corpus, CompressionPolicy
corpus_compressed = Corpus("c", config, codebook, CompressionPolicy.COMPRESS)
corpus_full = Corpus("p", config, codebook, CompressionPolicy.PASSTHROUGH)
corpus_fp16 = Corpus("h", config, codebook, CompressionPolicy.FP16)
```
Policies let one corpus mix hot data (PASSTHROUGH), cold data (COMPRESS),
and middle-tier data (FP16) without rebuilding the codec.
Binary serialization (TQCV format)
`CompressedVector` instances serialize to the TQCV versioned binary format
(70-byte header + LSB-first packed indices + optional FP16 residual),
suitable for disk, network, or database storage. Mmap corpus files are
available via the Rust `tinyquant-io` crate for zero-copy access.
```python
from tinyquant_cpu.codec import CompressedVector
raw_bytes = compressed.to_bytes()
restored = CompressedVector.from_bytes(raw_bytes)
```
PostgreSQL + pgvector backend
```python
import psycopg
from tinyquant_cpu.backend.adapters.pgvector import PgvectorAdapter
adapter = PgvectorAdapter(
connection_factory=lambda: psycopg.connect("postgresql://user:pass@localhost/mydb"),
table_name="embeddings",
)
adapter.ingest(corpus.decompress_all())
results = adapter.search(query_vector, top_k=10)
```
> [!IMPORTANT]
> Requires PostgreSQL with the `pgvector` extension installed. CI runs these
> tests against a live `pgvector/pgvector:pg17` container via testcontainers.
GPU acceleration (Rust only — wgpu)
The `tinyquant-gpu-wgpu` crate provides a `WgpuBackend` that offloads
batch compress/decompress and corpus cosine search to WGSL compute shaders.
It is workspace-internal (`publish = false`) and selected automatically
when a batch exceeds `GPU_BATCH_THRESHOLD` (512 vectors).
```rust
use tinyquant_gpu_wgpu::{WgpuBackend, BackendPreference};
// Default adapter (auto-select highest-performance GPU)
let backend = WgpuBackend::new().await?;
// Or select a specific backend:
let backend = WgpuBackend::new_with_preference(BackendPreference::Vulkan).await?;
// Warm up pipeline cache explicitly (optional — lazy otherwise)
backend.load_pipelines().await;
// GPU corpus search
let state = backend.prepare_corpus_for_device(&corpus_vecs).await?;
let results = backend.cosine_topk(&state, &query_vec, top_k).await?;
```
Available `BackendPreference` variants: `Auto`, `Vulkan`, `Metal`, `Dx12`,
`HighPerformance`, `LowPower`, `Software`.
---
## Key properties
- **8× compression** at 4-bit without residuals (ρ = 0.998, 95% recall)
- **16× compression** at 2-bit (ρ = 0.964, 85% recall)
- **Perfect fidelity** with optional FP16 residual correction (ρ = 1.000)
- **Deterministic** — same inputs produce byte-identical output across all language bindings and CPU architectures
- **Rust-native core** — `tinyquant-core`; CPU SIMD dispatch (AVX2+FMA / NEON) via `is_x86_feature_detected!` / ARMv8 base-ISA guarantee; Rayon parallel batch with determinism contract
- **Optional GPU acceleration** — `tinyquant-gpu-wgpu`; WGSL rotate/quantize/dequantize/residual and cosine-topk kernels; lazy `CachedPipelines`; `BackendPreference` adapter selection; auto-routes at ≥ 512 vectors
- **Multi-language** — Python fat wheel (`tinyquant-cpu`), TypeScript/Node (`@tinyquant/core`), Rust native (`tinyquant-core`), C ABI (`tinyquant-sys`)
- **Pluggable backends** — `BruteForceBackend` for in-process exact search; `PgvectorAdapter` for PostgreSQL + pgvector; `WgpuBackend` for GPU corpus search
- **Three compression policies** — COMPRESS, PASSTHROUGH, FP16, mixable within a corpus
- **TQCV serialization** — versioned 70-byte header + LSB-first bit-pack + optional FP16 residual; mmap corpus files via `tinyquant-io`
- **Calibration gates** — Pearson ρ and mean recall-at-k measured against OpenAI calibration fixtures; Criterion benchmarks with 10% regression budget
- **Fully typed** — `py.typed` marker, `mypy --strict` clean, TypeScript strict mode
- **Apache-2.0 licensed**
---
## Research lineage
TinyQuant adapts ideas from published research into a clean-room
implementation:
| Source | Year | Key contribution |
| :------------------ | :--: | :---------------------------------------------------------------- |
| [**TurboQuant**][1] | 2025 | Random rotation + scalar quantization, no per-block norms |
| [**PolarQuant**][2] | 2025 | QR-derived orthogonal preconditioning for coordinate uniformity |
| [**QJL**][3] | 2024 | Inner-product preservation bounds under aggressive quantization |
[1]: https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/
[2]: https://arxiv.org/abs/2503.20024
[3]: https://arxiv.org/abs/2406.03482
---
## Repository layout
| Path | Purpose |
| :-------------------------------------------- | :-------------------------------------------------------------------------- |
| `rust/crates/tinyquant-core/` | Codec, corpus, backend trait, SIMD dispatch, Rayon parallel batch |
| `rust/crates/tinyquant-io/` | TQCV serialization format and mmap corpus files |
| `rust/crates/tinyquant-gpu-wgpu/` | Optional wgpu/WGSL GPU accelerator (`publish = false`, workspace-internal) |
| `rust/crates/tinyquant-py/` | pyo3 Python extension — the engine behind `tinyquant-cpu` |
| `rust/crates/tinyquant-sys/` | C ABI via cbindgen |
| `rust/crates/tinyquant-cli/` | Standalone CLI binary |
| `rust/crates/tinyquant-js/` | napi-rs TypeScript/Node bindings (`@tinyquant/core`) |
| `rust/crates/tinyquant-bruteforce/` | `BruteForceBackend` reference implementation |
| `rust/crates/tinyquant-pgvector/` | PostgreSQL + pgvector ACL adapter |
| `rust/crates/tinyquant-bench/` | Criterion benchmarks + calibration quality gates |
| `tests/reference/tinyquant_py_reference/` | Pure-Python frozen oracle — differential test reference (not shipped) |
| `tests/parity/` | Cross-implementation parity suite (`pytest -m parity`) |
| `tests/` | Python unit, integration, E2E, architecture, and calibration suites |
| `experiments/` | Benchmarks and empirical evaluations |
| `docs/` | Obsidian wiki: design docs, research, SDLC plans, CI/CD specs |
---
## Development
```bash
git clone https://github.com/better-with-models/TinyQuant.git
cd TinyQuant
# Python dev dependencies
pip install pytest pytest-cov hypothesis numpy ruff mypy build
# Lint and format
ruff check . && ruff format --check .
# Strict type check
mypy --strict .
# Run the full Python suite
pytest --cov=tinyquant_py_reference
# Cross-impl parity (Python ↔ Rust)
pytest -m parity -v
# Rust: lint and test
cd rust
cargo clippy --workspace -- -D warnings
cargo test --workspace
```
The Python test suite includes **289 tests** covering unit, integration,
end-to-end, calibration, parity (cross-impl Python ↔ Rust), and
architecture-enforcement scenarios. Coverage is held above **90%** by CI
(**94%** for the codec subpackage). Live PostgreSQL + pgvector tests run
against a Docker container in CI via `testcontainers`.
> [!TIP]
> CI enforces three strict gates: `ruff check` / `ruff format --check`,
> `mypy --strict`, and `markdownlint-cli2` for all markdown outside `docs/`.
> The `docs/` vault uses Obsidian-flavored markdown under its own rules —
> see [`AGENTS.md`](../AGENTS.md) for the policy.
---
## Reproducing the benchmark
```bash
export OPENAI_API_KEY="your-key-here"
python experiments/quantization-benchmark/generate_embeddings.py
python experiments/quantization-benchmark/run_benchmark.py
python experiments/quantization-benchmark/generate_plots.py
```
This fetches 335 embeddings via the OpenAI API, benchmarks 9 quantization
methods, and produces plots and JSON results in
`experiments/quantization-benchmark/results/`.
---
## Contributing
Contributions are welcome. The short version:
1. **Issues and design discussions** — open a GitHub issue before starting
non-trivial work so we can agree on scope.
2. **Follow the repo SDLC** — architecture decisions, coding standards, and
pre-commit expectations live in [`AGENTS.md`](../AGENTS.md) and the
`docs/design/` vault. Read [`CLAUDE.md`](../CLAUDE.md) if you're driving
Claude Code or another LLM agent against this repo.
3. **Run the full gate locally** before pushing:
`ruff check . && ruff format --check . && mypy --strict . && pytest --cov=tinyquant_cpu`
4. **Keep prose aligned** — edits to the project tagline, elevator pitch, or
headline benchmark numbers must land in `README.md`, `.github/README.md`,
`AGENTS.md`, and `CLAUDE.md` in the same commit.
---
## License
Apache-2.0. See [LICENSE](../LICENSE).
---
## Related documentation
- [Benchmark Report](../experiments/quantization-benchmark/REPORT.md) —
full empirical evaluation in CS-paper format
- [CHANGELOG](../CHANGELOG.md) — release notes
- [Design: Storage Codec Architecture](../docs/design/storage-codec-architecture.md)
- [Design: GPU Acceleration](../docs/design/rust/gpu-acceleration.md)
- [Research: Vector Quantization Paper Synthesis](../docs/research/vector-quantization-paper-synthesis.md)
- [QA: Validation Plan](../docs/qa/validation-plan/README.md)
- [CI Plan](../docs/CI-plan/README.md) and [CD Plan](../docs/CD-plan/README.md)