https://github.com/jagmarques/nexusquant
Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.
https://github.com/jagmarques/nexusquant
attention compression e8-lattice inference kv-cache llama llm long-context memory-efficient mistral pytorch quantization token-eviction transformers vector-quantization
Last synced: 15 days ago
JSON representation
Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.
- Host: GitHub
- URL: https://github.com/jagmarques/nexusquant
- Owner: jagmarques
- License: other
- Created: 2026-04-07T07:53:52.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-04-15T20:46:16.000Z (about 1 month ago)
- Last Synced: 2026-04-15T22:27:55.018Z (about 1 month ago)
- Topics: attention, compression, e8-lattice, inference, kv-cache, llama, llm, long-context, memory-efficient, mistral, pytorch, quantization, token-eviction, transformers, vector-quantization
- Language: Python
- Size: 1.87 MB
- Stars: 11
- Watchers: 1
- Forks: 0
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-KV-Cache-Compression - **NexusQuant.**
README
NexusQuant
Compress your LLM's KV cache 10-33x. Training-free. One line of code.
---
> **Early-stage research project.** Results validated on Mistral-7B and Phi-3-mini only. NIAH testing shows factual recall degrades under compression (40% at 35% eviction). Not production-ready. Contributions and feedback welcome.
Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.
## Install
```bash
pip install nexusquant-kv
pip install "nexusquant-kv[hf]" # with HuggingFace transformers
```
## Quickstart
```python
from nexusquant import nexusquant_evict
with nexusquant_evict(model, quality="balanced"):
output = model.generate(input_ids, max_new_tokens=512)
```
## Three-tier compression
| Tier | What it does | Compression | PPL impact | NIAH recall | Use case |
|---|---|---|---|---|---|
| Quant-only | E8 lattice VQ, no eviction | ~4x | ~0% (lossless) | 100% | Quality-critical apps |
| Light eviction | E8 VQ + 25% eviction (real scorer) | ~5.3x | +0.20% | 100% | Balanced quality + compression |
| Aggressive eviction | E8 VQ + 35-80% eviction | 8-33x | +0.3-5% | 0% | Memory-critical ("fits vs doesn't fit") |
The NIAH cliff is sharp: factual recall is 100% at 25% eviction and drops to 0% at 35%. Light eviction with the real attention scorer is the sweet spot for most deployments.
```python
from nexusquant import compress_kv_cache
# Lossless ~5x compression, NIAH preserved
compressed_kv = compress_kv_cache(past_key_values, mode="quant_only", rope_base=10000.0)
```
## Why
| Without NexusQuant | With NexusQuant |
|---|---|
| 128K context on 70B = ~42 GB KV cache (GQA) | Same context = ~2.5 GB KV cache (17x) |
| KV cache competes with model weights for VRAM | KV cache fits comfortably alongside weights |
| Long context needs multi-GPU or offloading | Single GPU, single machine |
| Deploy a fine-tuned retrieval model | One `with` block, no code changes |
## Quality presets
Measured on Mistral-7B, Phi-3-mini, Qwen2.5-7B. Compression ratios include all overhead.
| Preset | Compression | PPL degradation | Config |
|---|---|---|---|
| `high` | ~9x | <0.5% | K3V2 + real scorer + 35% evict |
| `asym` | ~14x | ~1% | K3V2 + 60% evict |
| `balanced` | ~17x | ~1.3% | K2V2 + 60% evict |
| `max` | ~33x | +0.66% | K2V2 + real scorer + 80% evict |
**Important: PPL alone does not tell the full quality story.** Eviction modes show 40% NIAH recall at K3V2-35% despite only +0.82% PPL. Quant-only mode (~5x) preserves NIAH recall fully. Use quant-only when factual accuracy matters more than compression ratio. See Limitations below.
Attention sharpening (scaling keys by sqrt(1.05) after quantization) gives a free +0.05pp quality improvement at zero compression cost. Enabled by default.
### Cross-architecture results (Cerebrium A10)
| Model | KV Heads | K3V2 35% | K2V2 35% | Notes |
|---|---|---|---|---|
| Gemma-2-2b-it | 4 | +0.05% | +0.35% | Best result. Large head_dim (256) helps. |
| Mistral-7B | 8 | +0.82% | +0.91% | Main benchmark. |
| Qwen2.5-1.5B | 2 | +5.04% | +28.7% | 2 KV heads = danger zone. boundary=1 mandatory. |
| Gemma 4-E2B | 1 | +54% | +54% | 1 KV head, heterogeneous head_dim (256/512). |
> **Pattern:** More KV heads = more compression-tolerant. Models with 1-2 KV heads need `protect_boundary=1` at minimum. Models with 1 KV head are compression-hostile.
## How it works
1. **Importance scoring** - rank tokens by attention weight. Two options: key-key proxy (fast, no extra pass) or **real attention scorer** (uses `attn_implementation='eager'`, zero quality loss at 35% eviction)
2. **Token eviction** - drop lowest-scoring tokens; always keep BOS and a recent sliding window
3. **RoPE removal** - undo rotary embeddings on keys so they share a common subspace
4. **Hadamard rotation** - spread energy uniformly across dimensions (handles non-power-of-2 head dims via zero-padding)
5. **E8 lattice quantization** - quantize 8-float groups onto the E8 root lattice. **Asymmetric:** 3-bit keys + 2-bit values (keys need more precision due to softmax amplification)
6. **Boundary protection** - optionally keep first/last N layers at FP16 (mandatory for Qwen-family)
7. **Delta coding + zstd** - consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x
Token eviction reduces *count* (2.5x at 60% eviction). E8 quantization reduces *precision* (~7x after entropy coding). Combined: 17x.
## Compared to
| Method | Compression | PPL degradation | Training required | Notes |
|---|---|---|---|---|
| **NexusQuant (K3V2+scorer)** | **9-33x** | **+0.0-0.66%** | **No** | Includes eviction |
| **NexusQuant (K2V2)** | **10-33x** | **+0.4-2.6%** | **No** | Includes eviction |
| TurboQuant+ | 3.8-6.4x | ~0-1% | No | Quant-only, no eviction |
| KVTC (NVIDIA) | up to 20x | <1% | Yes (calibration) | |
| CommVQ (Apple) | ~8x | ~0% | Yes (retraining) | |
| Palu | 11x | ~25% rel | Yes (calibration) | |
NexusQuant ratios include token eviction (10-80% of tokens removed). TurboQuant+ ratios are pure quantization without eviction - not directly comparable. Competitor numbers from their papers.
## Supported models
Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):
- Llama family (Llama-2, Llama-3, Llama-3.1)
- Mistral / Mixtral
- Qwen
- Phi
- Gemma
Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).
## Advanced options
**Graduated layer bit profile** - gives boundary layers (first/last 15%) higher precision (3-bit K+V) while middle layers use standard asymmetric (K3V2). Small but consistent quality win (~0.02pp on Mistral-7B). GPU-validated.
```python
with nexusquant_evict(model, quality="high", layer_bit_profile="graduated"):
output = model.generate(input_ids, max_new_tokens=200)
```
**Hybrid model compression** - for models like Gemma4 with sliding-window + global attention layers, only compress the global layers (which scale with context). SWA layers have fixed memory cost.
```python
with nexusquant_evict(model, compress_layers="global_only"):
output = model.generate(input_ids, max_new_tokens=200)
```
**Soft eviction (experimental, not recommended)** - quantizes evicted tokens at 1-bit instead of removing them. In testing, this performed worse than hard eviction (+2.24% vs +0.82% PPL at 35% eviction on Mistral-7B). The 1-bit tokens corrupt attention patterns more than masking them out. Kept for research purposes.
```python
with nexusquant_evict(model, soft_eviction=True): # not recommended
output = model.generate(input_ids, max_new_tokens=200)
```
## Limitations
- **Quality is text-dependent.** Creative/narrative text degrades more than structured/technical text. Test on your actual workload.
- **Short prefixes hurt.** Prefixes under 500 tokens see more degradation. The scorer needs enough tokens to distinguish signal from noise.
- **Architecture-dependent boundary protection.** Qwen-family models catastrophically fail without `protect_boundary=2`. Mistral and Phi-3 work without it. Always test your specific model.
- **E8 quantization is CPU-bound.** Triton GPU kernel is written (`nexusquant/kernels/e8_triton.py`) but not yet benchmarked for latency. Physical KV truncation (`truncate=True`) is implemented for actual VRAM savings.
- **Eviction hurts factual recall.** NIAH benchmark: baseline 100%, K3V2-35% eviction 40% recall, K3V2-60% eviction 53% recall (Mistral-7B-Instruct, ctx=1024-3072). PPL (+0.82%) hides this damage. Multi-window scoring improves recall to 80% at 35% eviction. If your task requires precise fact retrieval, test with NIAH before deploying.
- **PPL is not a sufficient quality metric.** Always validate with NIAH or downstream accuracy benchmarks. PPL averages over all positions and masks the loss of specific tokens.
- **Results on 7B-class models only.** 70B validation pending. Mistral-7B quantizes "exceptionally well" (ikawrakow) and is not representative of harder models.
- **Batch size > 1 is partially broken.** `NexusQuantSimple` only compresses batch index 0; other batch elements are silently dropped to the first element's compressed result. `NexusQuantEvictTruncate` computes one keep-mask from batch element 0 and applies it to all sequences - incorrect when sequences differ in importance. Validate batch inference results carefully.
- **Multi-turn chat (persistent KV cache) is not supported.** The hook compresses on every incoming prefill (seq > 1). If the same cache is reused across conversation turns, the second turn's user message triggers re-compression of an already-quantized cache. Use a fresh context manager per turn, or call `model.generate` with `past_key_values=None` to reset the cache between turns.
- **Speculative decoding is not supported.** Speculative decoding writes multiple draft tokens to the KV cache during the decode phase. Because the hook triggers on any batch of >1 new tokens, it will incorrectly fire on draft verification steps, compressing decode-phase tokens.
- **KV cache offloading is not supported.** `OffloadedCache` (used by HuggingFace's `accelerate` `max_memory` offloading) does not inherit from `DynamicLayer`, so the NexusQuant hooks do not intercept it. Compression silently does nothing when offloading is active.
- **Encoder-decoder models (T5, BART, Whisper) are not supported.** These models use cross-attention whose KV cache stores encoder representations rather than decoder tokens. RoPE removal in the pipeline assumes decoder self-attention with split-half rotary embeddings, which does not apply to T5-style relative position biases. Applying NexusQuant to encoder-decoder models will produce incorrect results.
- **Vision-language models (LLaVA, Qwen-VL, LLaVA-Next) are untested.** Model config detection handles nested `text_config`, but image tokens are scored for importance and evicted by the same heuristic as text tokens. High-information image tokens may be evicted. Results on VLMs have not been measured.
- **GGUF models are not supported.** GGUF format is typically run via llama.cpp or ctransformers, which do not use HuggingFace `DynamicCache`. The integration hooks have no effect. Only GPTQ/AWQ models loaded through `AutoModelForCausalLM` with HuggingFace are compatible.
- **rope_scaling (extended context) is not accounted for.** Models using linear or NTK rope scaling (e.g., Llama-3.1 at >8K context) read `rope_theta` but ignore `rope_scaling` config. At contexts beyond the original training length, the RoPE removal introduces a small frequency mismatch. Impact is unmeasured.
## Citation
```bibtex
@software{nexusquant2026,
author = {Marques, Jo\~{a}o Andr\'{e} Gomes},
title = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization and Attention-Aware Token Eviction},
year = {2026},
url = {https://github.com/jagmarques/nexusquant},
license = {Apache-2.0},
}
```
## License
Apache 2.0. See [LICENSE](LICENSE).