https://github.com/jagmarques/nexusquant

Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.
https://github.com/jagmarques/nexusquant

attention compression e8-lattice inference kv-cache llama llm long-context memory-efficient mistral pytorch quantization token-eviction transformers vector-quantization

Last synced: 15 days ago
JSON representation

Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.

Host: GitHub
URL: https://github.com/jagmarques/nexusquant
Owner: jagmarques
License: other
Created: 2026-04-07T07:53:52.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2026-04-15T20:46:16.000Z (about 1 month ago)
Last Synced: 2026-04-15T22:27:55.018Z (about 1 month ago)
Topics: attention, compression, e8-lattice, inference, kv-cache, llama, llm, long-context, memory-efficient, mistral, pytorch, quantization, token-eviction, transformers, vector-quantization
Language: Python
Size: 1.87 MB
Stars: 11
Watchers: 1
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-KV-Cache-Compression - **NexusQuant.**

README

          


  NexusQuant





  Compress your LLM's KV cache 10-33x. Training-free. One line of code.





  

  

  

  



---

> **Early-stage research project.** Results validated on Mistral-7B and Phi-3-mini only. NIAH testing shows factual recall degrades under compression (40% at 35% eviction). Not production-ready. Contributions and feedback welcome.

Token eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.

## Install

```bash

pip install nexusquant-kv

pip install "nexusquant-kv[hf]"  # with HuggingFace transformers

```

## Quickstart

```python

from nexusquant import nexusquant_evict

with nexusquant_evict(model, quality="balanced"):

    output = model.generate(input_ids, max_new_tokens=512)

```

## Three-tier compression

| Tier | What it does | Compression | PPL impact | NIAH recall | Use case |

|---|---|---|---|---|---|

| Quant-only | E8 lattice VQ, no eviction | ~4x | ~0% (lossless) | 100% | Quality-critical apps |

| Light eviction | E8 VQ + 25% eviction (real scorer) | ~5.3x | +0.20% | 100% | Balanced quality + compression |

| Aggressive eviction | E8 VQ + 35-80% eviction | 8-33x | +0.3-5% | 0% | Memory-critical ("fits vs doesn't fit") |

The NIAH cliff is sharp: factual recall is 100% at 25% eviction and drops to 0% at 35%. Light eviction with the real attention scorer is the sweet spot for most deployments.

```python

from nexusquant import compress_kv_cache

# Lossless ~5x compression, NIAH preserved

compressed_kv = compress_kv_cache(past_key_values, mode="quant_only", rope_base=10000.0)

```

## Why

| Without NexusQuant | With NexusQuant |

|---|---|

| 128K context on 70B = ~42 GB KV cache (GQA) | Same context = ~2.5 GB KV cache (17x) |

| KV cache competes with model weights for VRAM | KV cache fits comfortably alongside weights |

| Long context needs multi-GPU or offloading | Single GPU, single machine |

| Deploy a fine-tuned retrieval model | One `with` block, no code changes |

## Quality presets

Measured on Mistral-7B, Phi-3-mini, Qwen2.5-7B. Compression ratios include all overhead.

| Preset | Compression | PPL degradation | Config |

|---|---|---|---|

| `high` | ~9x | <0.5% | K3V2 + real scorer + 35% evict |

| `asym` | ~14x | ~1% | K3V2 + 60% evict |

| `balanced` | ~17x | ~1.3% | K2V2 + 60% evict |

| `max` | ~33x | +0.66% | K2V2 + real scorer + 80% evict |

**Important: PPL alone does not tell the full quality story.** Eviction modes show 40% NIAH recall at K3V2-35% despite only +0.82% PPL. Quant-only mode (~5x) preserves NIAH recall fully. Use quant-only when factual accuracy matters more than compression ratio. See Limitations below.

Attention sharpening (scaling keys by sqrt(1.05) after quantization) gives a free +0.05pp quality improvement at zero compression cost. Enabled by default.

### Cross-architecture results (Cerebrium A10)

| Model | KV Heads | K3V2 35% | K2V2 35% | Notes |

|---|---|---|---|---|

| Gemma-2-2b-it | 4 | +0.05% | +0.35% | Best result. Large head_dim (256) helps. |

| Mistral-7B | 8 | +0.82% | +0.91% | Main benchmark. |

| Qwen2.5-1.5B | 2 | +5.04% | +28.7% | 2 KV heads = danger zone. boundary=1 mandatory. |

| Gemma 4-E2B | 1 | +54% | +54% | 1 KV head, heterogeneous head_dim (256/512). |

> **Pattern:** More KV heads = more compression-tolerant. Models with 1-2 KV heads need `protect_boundary=1` at minimum. Models with 1 KV head are compression-hostile.

## How it works

1. **Importance scoring**  - rank tokens by attention weight. Two options: key-key proxy (fast, no extra pass) or **real attention scorer** (uses `attn_implementation='eager'`, zero quality loss at 35% eviction)

2. **Token eviction**  - drop lowest-scoring tokens; always keep BOS and a recent sliding window

3. **RoPE removal**  - undo rotary embeddings on keys so they share a common subspace

4. **Hadamard rotation**  - spread energy uniformly across dimensions (handles non-power-of-2 head dims via zero-padding)

5. **E8 lattice quantization**  - quantize 8-float groups onto the E8 root lattice. **Asymmetric:** 3-bit keys + 2-bit values (keys need more precision due to softmax amplification)

6. **Boundary protection**  - optionally keep first/last N layers at FP16 (mandatory for Qwen-family)

7. **Delta coding + zstd**  - consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x

Token eviction reduces *count* (2.5x at 60% eviction). E8 quantization reduces *precision* (~7x after entropy coding). Combined: 17x.

## Compared to

| Method | Compression | PPL degradation | Training required | Notes |

|---|---|---|---|---|

| **NexusQuant (K3V2+scorer)** | **9-33x** | **+0.0-0.66%** | **No** | Includes eviction |

| **NexusQuant (K2V2)** | **10-33x** | **+0.4-2.6%** | **No** | Includes eviction |

| TurboQuant+ | 3.8-6.4x | ~0-1% | No | Quant-only, no eviction |

| KVTC (NVIDIA) | up to 20x | <1% | Yes (calibration) | |

| CommVQ (Apple) | ~8x | ~0% | Yes (retraining) | |

| Palu | 11x | ~25% rel | Yes (calibration) | |

NexusQuant ratios include token eviction (10-80% of tokens removed). TurboQuant+ ratios are pure quantization without eviction  - not directly comparable. Competitor numbers from their papers.

## Supported models

Any HuggingFace causal LM using split-half RoPE (the standard since Llama-2):

- Llama family (Llama-2, Llama-3, Llama-3.1)

- Mistral / Mixtral

- Qwen

- Phi

- Gemma

Not yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).

## Advanced options

**Graduated layer bit profile** - gives boundary layers (first/last 15%) higher precision (3-bit K+V) while middle layers use standard asymmetric (K3V2). Small but consistent quality win (~0.02pp on Mistral-7B). GPU-validated.

```python

with nexusquant_evict(model, quality="high", layer_bit_profile="graduated"):

    output = model.generate(input_ids, max_new_tokens=200)

```

**Hybrid model compression** - for models like Gemma4 with sliding-window + global attention layers, only compress the global layers (which scale with context). SWA layers have fixed memory cost.

```python

with nexusquant_evict(model, compress_layers="global_only"):

    output = model.generate(input_ids, max_new_tokens=200)

```

**Soft eviction (experimental, not recommended)** - quantizes evicted tokens at 1-bit instead of removing them. In testing, this performed worse than hard eviction (+2.24% vs +0.82% PPL at 35% eviction on Mistral-7B). The 1-bit tokens corrupt attention patterns more than masking them out. Kept for research purposes.

```python

with nexusquant_evict(model, soft_eviction=True):  # not recommended

    output = model.generate(input_ids, max_new_tokens=200)

```

## Limitations

- **Quality is text-dependent.** Creative/narrative text degrades more than structured/technical text. Test on your actual workload.

- **Short prefixes hurt.** Prefixes under 500 tokens see more degradation. The scorer needs enough tokens to distinguish signal from noise.

- **Architecture-dependent boundary protection.** Qwen-family models catastrophically fail without `protect_boundary=2`. Mistral and Phi-3 work without it. Always test your specific model.

- **E8 quantization is CPU-bound.** Triton GPU kernel is written (`nexusquant/kernels/e8_triton.py`) but not yet benchmarked for latency. Physical KV truncation (`truncate=True`) is implemented for actual VRAM savings.

- **Eviction hurts factual recall.** NIAH benchmark: baseline 100%, K3V2-35% eviction 40% recall, K3V2-60% eviction 53% recall (Mistral-7B-Instruct, ctx=1024-3072). PPL (+0.82%) hides this damage. Multi-window scoring improves recall to 80% at 35% eviction. If your task requires precise fact retrieval, test with NIAH before deploying.

- **PPL is not a sufficient quality metric.** Always validate with NIAH or downstream accuracy benchmarks. PPL averages over all positions and masks the loss of specific tokens.

- **Results on 7B-class models only.** 70B validation pending. Mistral-7B quantizes "exceptionally well" (ikawrakow) and is not representative of harder models.

- **Batch size > 1 is partially broken.** `NexusQuantSimple` only compresses batch index 0; other batch elements are silently dropped to the first element's compressed result. `NexusQuantEvictTruncate` computes one keep-mask from batch element 0 and applies it to all sequences  - incorrect when sequences differ in importance. Validate batch inference results carefully.

- **Multi-turn chat (persistent KV cache) is not supported.** The hook compresses on every incoming prefill (seq > 1). If the same cache is reused across conversation turns, the second turn's user message triggers re-compression of an already-quantized cache. Use a fresh context manager per turn, or call `model.generate` with `past_key_values=None` to reset the cache between turns.

- **Speculative decoding is not supported.** Speculative decoding writes multiple draft tokens to the KV cache during the decode phase. Because the hook triggers on any batch of >1 new tokens, it will incorrectly fire on draft verification steps, compressing decode-phase tokens.

- **KV cache offloading is not supported.** `OffloadedCache` (used by HuggingFace's `accelerate` `max_memory` offloading) does not inherit from `DynamicLayer`, so the NexusQuant hooks do not intercept it. Compression silently does nothing when offloading is active.

- **Encoder-decoder models (T5, BART, Whisper) are not supported.** These models use cross-attention whose KV cache stores encoder representations rather than decoder tokens. RoPE removal in the pipeline assumes decoder self-attention with split-half rotary embeddings, which does not apply to T5-style relative position biases. Applying NexusQuant to encoder-decoder models will produce incorrect results.

- **Vision-language models (LLaVA, Qwen-VL, LLaVA-Next) are untested.** Model config detection handles nested `text_config`, but image tokens are scored for importance and evicted by the same heuristic as text tokens. High-information image tokens may be evicted. Results on VLMs have not been measured.

- **GGUF models are not supported.** GGUF format is typically run via llama.cpp or ctransformers, which do not use HuggingFace `DynamicCache`. The integration hooks have no effect. Only GPTQ/AWQ models loaded through `AutoModelForCausalLM` with HuggingFace are compatible.

- **rope_scaling (extended context) is not accounted for.** Models using linear or NTK rope scaling (e.g., Llama-3.1 at >8K context) read `rope_theta` but ignore `rope_scaling` config. At contexts beyond the original training length, the RoPE removal introduces a small frequency mismatch. Impact is unmeasured.

## Citation

```bibtex

@software{nexusquant2026,

  author  = {Marques, Jo\~{a}o Andr\'{e} Gomes},

  title   = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization and Attention-Aware Token Eviction},

  year    = {2026},

  url     = {https://github.com/jagmarques/nexusquant},

  license = {Apache-2.0},

}

```

## License

Apache 2.0. See [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jagmarques/nexusquant

Awesome Lists containing this project

README