{"id":49463045,"url":"https://github.com/back2matching/turboquant","last_synced_at":"2026-04-30T11:01:28.656Z","repository":{"id":346819734,"uuid":"1191360511","full_name":"back2matching/turboquant","owner":"back2matching","description":"First open-source TurboQuant KV cache compression for LLM inference. Drop-in for HuggingFace. pip install turboquant.","archived":false,"fork":false,"pushed_at":"2026-03-26T03:21:32.000Z","size":89,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-26T17:40:57.392Z","etag":null,"topics":["compression","gpu","huggingface","inference","kv-cache","llm","machine-learning","pytorch","quantization","transformers","turboquant","vram"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/turboquant/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/back2matching.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-25T06:56:34.000Z","updated_at":"2026-03-26T10:30:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/back2matching/turboquant","commit_stats":null,"previous_names":["back2matching/turboquant"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/back2matching/turboquant","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/back2matching%2Fturboquant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/back2matching%2Fturboquant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/back2matching%2Fturboquant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/back2matching%2Fturboquant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/back2matching","download_url":"https://codeload.github.com/back2matching/turboquant/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/back2matching%2Fturboquant/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32462304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T22:27:22.272Z","status":"online","status_checked_at":"2026-04-30T02:00:05.929Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["compression","gpu","huggingface","inference","kv-cache","llm","machine-learning","pytorch","quantization","transformers","turboquant","vram"],"created_at":"2026-04-30T11:01:25.911Z","updated_at":"2026-04-30T11:01:28.648Z","avatar_url":"https://github.com/back2matching.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TurboQuant\n\n**Your LLM runs 2x faster at long context.** When the KV cache fills your VRAM, everything grinds. TurboQuant compresses it — and the speed comes back.\n\n| | FP16 (baseline) | TurboQuant 4-bit |\n|---|---|---|\n| **Qwen 3B @ 4K context** | 2.5 tok/s (thrashing) | **7.4 tok/s** |\n| **VRAM saved** | — | **1 GB** |\n| **Qwen 7B @ 2K context** | 1.0 tok/s (OOM) | **1.4 tok/s** |\n\nDrop-in for any HuggingFace model:\n\n```python\nfrom turboquant import TurboQuantCache\n\n# Symmetric: 4-bit keys + 4-bit values\ncache = TurboQuantCache(bits=4)\n\n# Asymmetric: 4-bit keys + 2-bit values (better quality, less memory)\ncache = TurboQuantCache(key_bits=4, value_bits=2)\n\n# Protect sensitive layers at full FP16 precision\ncache = TurboQuantCache(key_bits=4, value_bits=2, protected_layers=[0, 1, -1, -2])\n\noutputs = model(**inputs, past_key_values=cache, use_cache=True)\n```\n\n```bash\npip install turboquant\n```\n\n## Why this matters\n\nWhen LLMs generate text, they store key-value pairs for every token. This KV cache grows with context length and eats your VRAM. On a 16 GB GPU running a 3B model, the KV cache alone hits 1.2 GB at 4K tokens — and FP16 starts thrashing.\n\nTurboQuant compresses the cache to 4 bits (from 16) using Google's TurboQuant algorithm (ICLR 2026). No training data, no calibration, works with any model. The result: your GPU has room to breathe, and inference stays fast where it used to choke.\n\n## Install\n\n```bash\npip install turboquant\n```\n\nOr from source:\n\n```bash\ngit clone https://github.com/back2matching/turboquant\ncd turboquant\npip install -e .\n```\n\n## Quick Start\n\n### Drop into any HuggingFace model\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom turboquant import TurboQuantCache\nimport torch\n\nmodel = AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen2.5-3B-Instruct\", dtype=torch.float16, device_map=\"auto\")\ntokenizer = AutoTokenizer.from_pretrained(\"Qwen/Qwen2.5-3B-Instruct\")\n\n# Create compressed cache\ncache = TurboQuantCache(bits=4)\n\n# Use it like normal\ninputs = tokenizer(\"Hello, how are you?\", return_tensors=\"pt\").to(model.device)\noutputs = model(**inputs, past_key_values=cache, use_cache=True)\n```\n\n### Run the inference server\n\nTurboQuant ships with an OpenAI-compatible inference server. Point any OpenAI client at it.\n\n```bash\nturboquant-server --model Qwen/Qwen2.5-3B-Instruct --bits 4 --port 8000\n```\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":100}'\n```\n\n### Use the core algorithms directly\n\n```python\nfrom turboquant import TurboQuantMSE\n\n# Quantize any vectors (KV cache heads, embeddings, etc.)\ntq = TurboQuantMSE(dim=128, bits=4, device='cuda')\n\n# Quantize\nindices, norms = tq.quantize(vectors)  # vectors: (N, 128)\n\n# Dequantize\nvectors_hat = tq.dequantize(indices, norms)\n```\n\n## Benchmarks (RTX 4080 16GB)\n\nIndependent benchmarks on NVIDIA RTX 4080 (16 GB VRAM), PyTorch 2.5.1, CUDA 12.1. 45 data points across 4 models.\n\n**Reproduce:**\n```bash\npython benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-3B-Instruct --context \"512,1024,2048,4096\"\npython benchmarks/benchmark_kv.py --model Qwen/Qwen2.5-7B-Instruct --quick  # fast sanity check\n```\n\nResults are saved per-model (`benchmarks/results_*.json`) and combined (`benchmarks/benchmark_results.json`).\n\n### Qwen2.5-7B-Instruct (14.5 GB model weights)\n\n| Context | KV Mode | Peak VRAM | VRAM Saved | Speed (tok/s) | Output Quality |\n|---------|---------|-----------|------------|---------------|----------------|\n| 460 | FP16 | 14,833 MB | -- | 17.7 | Coherent |\n| 460 | **TQ 4-bit** | **14,758 MB** | **75 MB** | **23.8** | Coherent |\n| 460 | TQ 3-bit | 14,758 MB | 75 MB | 20.6 | Minor artifacts |\n| 1860 | FP16 | 16,659 MB | -- | 1.0 | Coherent |\n| 1860 | **TQ 4-bit** | **16,215 MB** | **444 MB** | **1.4** | Coherent |\n| 1860 | TQ 3-bit | 16,217 MB | 442 MB | 1.4 | Coherent |\n\nAt 7B with 1.8K context, FP16 exceeds physical VRAM (16,659 \u003e 16,376 MB) and drops to 1 tok/s from swapping. TQ-4bit saves 444 MB and runs **40% faster** in this regime.\n\n### Qwen2.5-3B-Instruct — Context Length Sweep (5.9 GB model weights)\n\n| Context | KV Mode | Peak VRAM | VRAM Saved | Speed (tok/s) |\n|---------|---------|-----------|------------|---------------|\n| 460 | FP16 | 6,126 MB | -- | 14.6 |\n| 460 | TQ 4-bit | 6,075 MB | 51 MB | 7.8 |\n| 930 | FP16 | 6,451 MB | -- | 14.1 |\n| 930 | TQ 4-bit | 6,260 MB | 191 MB | 7.4 |\n| 1860 | FP16 | 7,359 MB | -- | 15.4 |\n| 1860 | TQ 4-bit | 6,835 MB | **524 MB** | 15.5 |\n| 3720 | FP16 | 10,222 MB | -- | 2.5 |\n| 3720 | TQ 4-bit | 9,174 MB | **1,048 MB** | **7.4** |\n\nVRAM savings scale with context length: 51 MB at 512 tokens up to **1,048 MB at 4K tokens**. At 4K context, FP16 hits memory pressure (2.5 tok/s) while TQ-4bit with nibble packing runs at **7.4 tok/s — 196% faster**.\n\n### Qwen2.5-0.5B-Instruct — Long Context (942 MB model weights)\n\n| Context | FP16 Peak | TQ 4-bit Peak | VRAM Saved | FP16 Speed | TQ 4-bit Speed |\n|---------|-----------|---------------|------------|------------|----------------|\n| 460 | 1,144 MB | 1,104 MB | 40 MB | 44.3 | 30.5 |\n| 930 | 1,417 MB | 1,262 MB | 155 MB | 46.1 | 30.3 |\n| 1860 | 2,189 MB | 1,669 MB | 520 MB | 41.7 | 29.1 |\n| 3720 | 4,654 MB | 3,621 MB | **1,033 MB** | 31.9 | 26.5 |\n| 7440 | 13,265 MB | 11,195 MB | **2,070 MB** | 17.8 | **19.8** |\n\nAt 8K context, TQ-4bit saves **2 GB of VRAM** and is **11% faster** than FP16. 16K OOM'd for all modes on 16 GB.\n\n### StableLM-2-1.6B — Cross-Architecture (3.1 GB model weights)\n\n| Context | FP16 Peak | TQ 4-bit Peak | VRAM Diff | FP16 Speed | TQ 4-bit Speed |\n|---------|-----------|---------------|-----------|------------|----------------|\n| 460 | 3,433 MB | 3,488 MB | +55 MB | 68.9 | 36.7 |\n| 930 | 3,724 MB | 3,894 MB | +170 MB | 68.2 | 34.8 |\n| 1860 | 4,302 MB | 4,700 MB | +398 MB | 61.4 | 34.7 |\n| 3720 | 5,459 MB | 6,318 MB | +859 MB | 56.1 | 33.1 |\n\nOn StableLM, TQ uses **more** VRAM than FP16 at every context length. The StableLM results were collected with v0.1.0 (dequantized storage). v0.2.0 stores compressed indices and may show different results on StableLM.\n\n### Key Takeaways\n\n- **VRAM savings scale linearly with context length.** At short contexts (\u003c512 tokens), savings are minimal. At 4K tokens, savings exceed **1 GB**. At 8K, savings reach **2 GB**.\n- **Under memory pressure, TQ is significantly faster than FP16.** At 4K context on 3B, FP16 drops to 3.5 tok/s while TQ-4bit runs at 6.1 tok/s (74% faster). At 8K on 0.5B, TQ is 11% faster.\n- **v0.2.0 stores compressed indices.** Cache uses uint8 indices + float32 norms instead of dequantized FP16. Real compression with on-the-fly dequantization.\n- **Output quality is good at 4-bit on 3B+ models.** Qwen 3B and 7B produce coherent code. On 0.5B, TQ output sometimes degrades to filler repetition — small models are more sensitive to quantization noise.\n\n### Algorithm Verification\n\n| Bits | MSE | Theoretical Bound | Compression |\n|------|-----|-------------------|-------------|\n| 1 | 0.362 | 0.680 | 12.8x |\n| 2 | 0.129 | 0.170 | 7.1x |\n| 3 | 0.049 | 0.043 | 4.9x |\n| 4 | 0.020 | 0.011 | 3.8x |\n\n## How It Works\n\nTurboQuant uses three ideas from the paper, plus community-validated optimizations:\n\n1. **Random rotation**: Multiply each KV vector by a random orthogonal matrix. This spreads the information evenly across all coordinates, making them nearly independent.\n\n2. **Optimal codebook**: Each coordinate now follows a predictable Beta distribution. We compute the mathematically optimal quantization levels for this distribution. No training data needed.\n\n3. **Residual window**: The most recent 128 tokens stay in full FP16 precision. Only older tokens get compressed. This preserves quality for the tokens attention focuses on most.\n\n**v0.3.0 additions** (adopted from community findings across 11 TurboQuant implementations):\n\n4. **Asymmetric K/V allocation**: Keys need more bits than values — K/V norm disparity can exceed 1000x. Default: 4-bit keys + 2-bit values for the best quality/memory tradeoff.\n\n5. **Layer-adaptive precision**: First and last transformer layers are most sensitive. `protected_layers=[0, 1, -1, -2]` keeps them at full FP16 while compressing middle layers.\n\n6. **MSE-only quantization**: Six independent teams confirmed QJL (Algorithm 2 from the paper) hurts attention quality. We use MSE-optimal quantization only (Algorithm 1). TurboQuantIP is deprecated.\n\nThe rotation is computed once (not per-token) and the codebook is derived analytically. No calibration, no fine-tuning, works with any model out of the box.\n\n## When to Use This\n\n**Good fit:**\n- You're running long contexts (8K+ tokens) on a VRAM-constrained GPU\n- You're serving multiple users and need to fit more KV caches in memory\n- You want to run a bigger model by freeing VRAM from KV cache\n- Standard transformer models (Llama, Mistral, Qwen2.5)\n\n**Not a good fit:**\n- Very short contexts (\u003c 1K tokens) where KV cache is tiny anyway\n- Hybrid architectures with recurrent layers (Qwen3.5, Mamba) that already have small KV caches\n- Tasks requiring exact bit-level precision (use FP16)\n- 3-bit on models smaller than 8B (quality degrades noticeably)\n\n## Comparison with Alternatives\n\n| Method | Where It Runs | Bits | Setup |\n|--------|---------------|------|-------|\n| **TurboQuant** | Any HuggingFace model | 3-4 | `pip install turboquant` |\n| Ollama q8_0 KV | Ollama only | 8 | `OLLAMA_KV_CACHE_TYPE=q8_0` |\n| Ollama q4_0 KV | Ollama only | 4 | `OLLAMA_KV_CACHE_TYPE=q4_0` |\n| vLLM FP8 KV | vLLM only | 8 | `kv_cache_dtype=\"fp8\"` |\n| KIVI | Research code | 2 | Not pip-installable |\n\nTurboQuant is the only pip-installable sub-8-bit KV cache compression that works with any HuggingFace model.\n\n## llama.cpp Integration\n\nA TQ4_0 KV cache type was proposed for llama.cpp:\n- **PR:** [ggml-org/llama.cpp#20995](https://github.com/ggml-org/llama.cpp/pull/20995) (closed — premature, multiple competing implementations in progress)\n- **Usage (if built from branch):** `--cache-type-k tq4_0 --cache-type-v f16 --no-kv-offload`\n- **Status:** Multiple community implementations in progress. Google's official code expected Q2 2026.\n\n## Paper\n\nThis implements the algorithm from:\n\n**TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate**\nAmir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni\nICLR 2026 | [arXiv:2504.19874](https://arxiv.org/abs/2504.19874)\n\nThis is an independent implementation, not affiliated with Google Research.\n\n## License\n\nApache 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fback2matching%2Fturboquant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fback2matching%2Fturboquant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fback2matching%2Fturboquant/lists"}