https://github.com/back2matching/turboquant
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
https://github.com/back2matching/turboquant
compression gpu huggingface inference kv-cache llm machine-learning pytorch quantization transformers turboquant vram
Last synced: about 1 month ago
JSON representation
First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.
- Host: GitHub
- URL: https://github.com/back2matching/turboquant
- Owner: back2matching
- License: other
- Created: 2026-03-25T06:56:34.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-26T03:21:32.000Z (3 months ago)
- Last Synced: 2026-03-26T17:40:57.392Z (3 months ago)
- Topics: compression, gpu, huggingface, inference, kv-cache, llm, machine-learning, pytorch, quantization, transformers, turboquant, vram
- Language: Python
- Homepage: https://pypi.org/project/turboquant/
- Size: 86.9 KB
- Stars: 3
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Roadmap: docs/ROADMAP.md
Awesome Lists containing this project
README
# TurboQuant
**Your LLM runs 2x faster at long context.** When the KV cache fills your VRAM, everything grinds. TurboQuant compresses it — and the speed comes back.
| | FP16 (baseline) | TurboQuant 4-bit |
|---|---|---|
| **Qwen 3B @ 4K context** | 2.5 tok/s (thrashing) | **7.4 tok/s** |
| **VRAM saved** | — | **1 GB** |
| **Qwen 7B @ 2K context** | 1.0 tok/s (OOM) | **1.4 tok/s** |
Drop-in for any HuggingFace model:
```python
from turboquant import TurboQuantCache
# Symmetric: 4-bit keys + 4-bit values
cache = TurboQuantCache(bits=4)
# Asymmetric: 4-bit keys + 2-bit values (better quality, less memory)
cache = TurboQuantCache(key_bits=4, value_bits=2)
# Protect sensitive layers at full FP16 precision
cache = TurboQuantCache(key_bits=4, value_bits=2, protected_layers=[0, 1, -1, -2])
outputs = model(**inputs, past_key_values=cache, use_cache=True)
```
```bash
pip install turboquant
```
## Why this matters
When LLMs generate text, they store key-value pairs for every token. This KV cache grows with context length and eats your VRAM. On a 16 GB GPU running a 3B model, the KV cache alone hits 1.2 GB at 4K tokens — and FP16 starts thrashing.
TurboQuant compresses the cache to 4 bits (from 16) using Google's TurboQuant algorithm (ICLR 2026). No training data, no calibration, works with any model. The result: your GPU has room to breathe, and inference stays fast where it used to choke.
## Install
```bash
pip install turboquant
```
Or from source:
```bash
git clone https://github.com/back2matching/turboquant
cd turboquant
pip install -e .
```
## Quick Start
### Drop into any HuggingFace model
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
import torch
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype=torch.float16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct")
# Create compressed cache
cache = TurboQuantCache(bits=4)
# Use it like normal
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model(**inputs, past_key_values=cache, use_cache=True)
```
### Run the inference server
TurboQuant ships with an OpenAI-compatible inference server. Point any OpenAI client at it.
```bash
turboquant-server --model Qwen/Qwen2.5-3B-Instruct --bits 4 --port 8000
```
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}'
```
### Use the core algorithms directly
```python
from turboquant import TurboQuantMSE
# Quantize any vectors (KV cache heads, embeddings, etc.)
tq = TurboQuantMSE(dim=128, bits=4, device='cuda')
# Quantize
indices, norms = tq.quantize(vectors) # vectors: (N, 128)
# Dequantize
vectors_hat = tq.dequantize(indices, norms)
```
## Benchmarks (RTX 4080 16GB)
Independent benchmarks on NVIDIA RTX 4080 (16 GB VRAM), PyTorch 2.5.1, CUDA 12.1. 45 data points across 4 models.
**Reproduce:**
```bash
python benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-3B-Instruct --context "512,1024,2048,4096"
python benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-7B-Instruct --quick # fast sanity check
```
Results are saved per-model (`benchmarks/results_*.json`) and combined (`benchmarks/benchmark_results.json`).
### Qwen2.5-7B-Instruct (14.5 GB model weights)
| Context | KV Mode | Peak VRAM | VRAM Saved | Speed (tok/s) | Output Quality |
|---------|---------|-----------|------------|---------------|----------------|
| 460 | FP16 | 14,833 MB | -- | 17.7 | Coherent |
| 460 | **TQ 4-bit** | **14,758 MB** | **75 MB** | **23.8** | Coherent |
| 460 | TQ 3-bit | 14,758 MB | 75 MB | 20.6 | Minor artifacts |
| 1860 | FP16 | 16,659 MB | -- | 1.0 | Coherent |
| 1860 | **TQ 4-bit** | **16,215 MB** | **444 MB** | **1.4** | Coherent |
| 1860 | TQ 3-bit | 16,217 MB | 442 MB | 1.4 | Coherent |
At 7B with 1.8K context, FP16 exceeds physical VRAM (16,659 > 16,376 MB) and drops to 1 tok/s from swapping. TQ-4bit saves 444 MB and runs **40% faster** in this regime.
### Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model weights)
| Context | KV Mode | Peak VRAM | VRAM Saved | Speed (tok/s) |
|---------|---------|-----------|------------|---------------|
| 460 | FP16 | 6,126 MB | -- | 14.6 |
| 460 | TQ 4-bit | 6,075 MB | 51 MB | 7.8 |
| 930 | FP16 | 6,451 MB | -- | 14.1 |
| 930 | TQ 4-bit | 6,260 MB | 191 MB | 7.4 |
| 1860 | FP16 | 7,359 MB | -- | 15.4 |
| 1860 | TQ 4-bit | 6,835 MB | **524 MB** | 15.5 |
| 3720 | FP16 | 10,222 MB | -- | 2.5 |
| 3720 | TQ 4-bit | 9,174 MB | **1,048 MB** | **7.4** |
VRAM savings scale with context length: 51 MB at 512 tokens up to **1,048 MB at 4K tokens**. At 4K context, FP16 hits memory pressure (2.5 tok/s) while TQ-4bit with nibble packing runs at **7.4 tok/s — 196% faster**.
### Qwen2.5-0.5B-Instruct — Long Context (942 MB model weights)
| Context | FP16 Peak | TQ 4-bit Peak | VRAM Saved | FP16 Speed | TQ 4-bit Speed |
|---------|-----------|---------------|------------|------------|----------------|
| 460 | 1,144 MB | 1,104 MB | 40 MB | 44.3 | 30.5 |
| 930 | 1,417 MB | 1,262 MB | 155 MB | 46.1 | 30.3 |
| 1860 | 2,189 MB | 1,669 MB | 520 MB | 41.7 | 29.1 |
| 3720 | 4,654 MB | 3,621 MB | **1,033 MB** | 31.9 | 26.5 |
| 7440 | 13,265 MB | 11,195 MB | **2,070 MB** | 17.8 | **19.8** |
At 8K context, TQ-4bit saves **2 GB of VRAM** and is **11% faster** than FP16. 16K OOM'd for all modes on 16 GB.
### StableLM-2-1.6B — Cross-Architecture (3.1 GB model weights)
| Context | FP16 Peak | TQ 4-bit Peak | VRAM Diff | FP16 Speed | TQ 4-bit Speed |
|---------|-----------|---------------|-----------|------------|----------------|
| 460 | 3,433 MB | 3,488 MB | +55 MB | 68.9 | 36.7 |
| 930 | 3,724 MB | 3,894 MB | +170 MB | 68.2 | 34.8 |
| 1860 | 4,302 MB | 4,700 MB | +398 MB | 61.4 | 34.7 |
| 3720 | 5,459 MB | 6,318 MB | +859 MB | 56.1 | 33.1 |
On StableLM, TQ uses **more** VRAM than FP16 at every context length. The StableLM results were collected with v0.1.0 (dequantized storage). v0.2.0 stores compressed indices and may show different results on StableLM.
### Key Takeaways
- **VRAM savings scale linearly with context length.** At short contexts (<512 tokens), savings are minimal. At 4K tokens, savings exceed **1 GB**. At 8K, savings reach **2 GB**.
- **Under memory pressure, TQ is significantly faster than FP16.** At 4K context on 3B, FP16 drops to 3.5 tok/s while TQ-4bit runs at 6.1 tok/s (74% faster). At 8K on 0.5B, TQ is 11% faster.
- **v0.2.0 stores compressed indices.** Cache uses uint8 indices + float32 norms instead of dequantized FP16. Real compression with on-the-fly dequantization.
- **Output quality is good at 4-bit on 3B+ models.** Qwen 3B and 7B produce coherent code. On 0.5B, TQ output sometimes degrades to filler repetition — small models are more sensitive to quantization noise.
### Algorithm Verification
| Bits | MSE | Theoretical Bound | Compression |
|------|-----|-------------------|-------------|
| 1 | 0.362 | 0.680 | 12.8x |
| 2 | 0.129 | 0.170 | 7.1x |
| 3 | 0.049 | 0.043 | 4.9x |
| 4 | 0.020 | 0.011 | 3.8x |
## How It Works
TurboQuant uses three ideas from the paper, plus community-validated optimizations:
1. **Random rotation**: Multiply each KV vector by a random orthogonal matrix. This spreads the information evenly across all coordinates, making them nearly independent.
2. **Optimal codebook**: Each coordinate now follows a predictable Beta distribution. We compute the mathematically optimal quantization levels for this distribution. No training data needed.
3. **Residual window**: The most recent 128 tokens stay in full FP16 precision. Only older tokens get compressed. This preserves quality for the tokens attention focuses on most.
**v0.3.0 additions** (adopted from community findings across 11 TurboQuant implementations):
4. **Asymmetric K/V allocation**: Keys need more bits than values — K/V norm disparity can exceed 1000x. Default: 4-bit keys + 2-bit values for the best quality/memory tradeoff.
5. **Layer-adaptive precision**: First and last transformer layers are most sensitive. `protected_layers=[0, 1, -1, -2]` keeps them at full FP16 while compressing middle layers.
6. **MSE-only quantization**: Six independent teams confirmed QJL (Algorithm 2 from the paper) hurts attention quality. We use MSE-optimal quantization only (Algorithm 1). TurboQuantIP is deprecated.
The rotation is computed once (not per-token) and the codebook is derived analytically. No calibration, no fine-tuning, works with any model out of the box.
## When to Use This
**Good fit:**
- You're running long contexts (8K+ tokens) on a VRAM-constrained GPU
- You're serving multiple users and need to fit more KV caches in memory
- You want to run a bigger model by freeing VRAM from KV cache
- Standard transformer models (Llama, Mistral, Qwen2.5)
**Not a good fit:**
- Very short contexts (< 1K tokens) where KV cache is tiny anyway
- Hybrid architectures with recurrent layers (Qwen3.5, Mamba) that already have small KV caches
- Tasks requiring exact bit-level precision (use FP16)
- 3-bit on models smaller than 8B (quality degrades noticeably)
## Comparison with Alternatives
| Method | Where It Runs | Bits | Setup |
|--------|---------------|------|-------|
| **TurboQuant** | Any HuggingFace model | 3-4 | `pip install turboquant` |
| Ollama q8_0 KV | Ollama only | 8 | `OLLAMA_KV_CACHE_TYPE=q8_0` |
| Ollama q4_0 KV | Ollama only | 4 | `OLLAMA_KV_CACHE_TYPE=q4_0` |
| vLLM FP8 KV | vLLM only | 8 | `kv_cache_dtype="fp8"` |
| KIVI | Research code | 2 | Not pip-installable |
TurboQuant is the only pip-installable sub-8-bit KV cache compression that works with any HuggingFace model.
## llama.cpp Integration
A TQ4_0 KV cache type was proposed for llama.cpp:
- **PR:** [ggml-org/llama.cpp#20995](https://github.com/ggml-org/llama.cpp/pull/20995) (closed — premature, multiple competing implementations in progress)
- **Usage (if built from branch):** `--cache-type-k tq4_0 --cache-type-v f16 --no-kv-offload`
- **Status:** Multiple community implementations in progress. Google's official code expected Q2 2026.
## Paper
This implements the algorithm from:
**TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate**
Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni
ICLR 2026 | [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)
This is an independent implementation, not affiliated with Google Research.
## License
Apache 2.0