{"id":48861446,"url":"https://github.com/jagmarques/nexusquant","last_synced_at":"2026-05-01T16:01:23.599Z","repository":{"id":349730771,"uuid":"1203619705","full_name":"jagmarques/nexusquant","owner":"jagmarques","description":"Training-free KV cache compression for LLMs. 10-33x compression via E8 lattice quantization + attention-aware token eviction. One line of code.","archived":false,"fork":false,"pushed_at":"2026-04-15T20:46:16.000Z","size":1959,"stargazers_count":11,"open_issues_count":5,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-15T22:27:55.018Z","etag":null,"topics":["attention","compression","e8-lattice","inference","kv-cache","llama","llm","long-context","memory-efficient","mistral","pytorch","quantization","token-eviction","transformers","vector-quantization"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jagmarques.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-07T07:53:52.000Z","updated_at":"2026-04-15T20:46:20.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jagmarques/nexusquant","commit_stats":null,"previous_names":["jagmarques/nexusquant"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jagmarques/nexusquant","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jagmarques%2Fnexusquant","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jagmarques%2Fnexusquant/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jagmarques%2Fnexusquant/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jagmarques%2Fnexusquant/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jagmarques","download_url":"https://codeload.github.com/jagmarques/nexusquant/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jagmarques%2Fnexusquant/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32503204,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-30T13:12:12.517Z","status":"online","status_checked_at":"2026-05-01T02:00:05.856Z","response_time":64,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention","compression","e8-lattice","inference","kv-cache","llama","llm","long-context","memory-efficient","mistral","pytorch","quantization","token-eviction","transformers","vector-quantization"],"created_at":"2026-04-15T16:00:38.533Z","updated_at":"2026-05-01T16:01:23.571Z","avatar_url":"https://github.com/jagmarques.png","language":"Python","funding_links":[],"categories":["📊 Evaluation"],"sub_categories":["8️⃣ Non-Autoregressive"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eNexusQuant\u003c/strong\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  Compress your LLM's KV cache 10-33x. Training-free. One line of code.\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/nexusquant-kv/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/nexusquant-kv?style=flat-square\u0026logo=pypi\u0026logoColor=white\" alt=\"PyPI\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/jagmarques/nexusquant/blob/main/LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/License-Apache_2.0-blue.svg?style=flat-square\" alt=\"License\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://www.python.org/downloads/\"\u003e\u003cimg src=\"https://img.shields.io/badge/python-3.9+-blue?style=flat-square\u0026logo=python\u0026logoColor=white\" alt=\"Python\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/jagmarques/nexusquant\"\u003e\u003cimg src=\"https://img.shields.io/github/stars/jagmarques/nexusquant?style=social\" alt=\"Stars\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n\u003e **Early-stage research project.** Results validated on Mistral-7B and Phi-3-mini only. NIAH testing shows factual recall degrades under compression (40% at 35% eviction). Not production-ready. Contributions and feedback welcome.\n\nToken eviction + E8 lattice quantization, applied once after prefill. No training, no calibration data, no model modifications.\n\n## Install\n\n```bash\npip install nexusquant-kv\npip install \"nexusquant-kv[hf]\"  # with HuggingFace transformers\n```\n\n## Quickstart\n\n```python\nfrom nexusquant import nexusquant_evict\n\nwith nexusquant_evict(model, quality=\"balanced\"):\n    output = model.generate(input_ids, max_new_tokens=512)\n```\n\n## Three-tier compression\n\n| Tier | What it does | Compression | PPL impact | NIAH recall | Use case |\n|---|---|---|---|---|---|\n| Quant-only | E8 lattice VQ, no eviction | ~4x | ~0% (lossless) | 100% | Quality-critical apps |\n| Light eviction | E8 VQ + 25% eviction (real scorer) | ~5.3x | +0.20% | 100% | Balanced quality + compression |\n| Aggressive eviction | E8 VQ + 35-80% eviction | 8-33x | +0.3-5% | 0% | Memory-critical (\"fits vs doesn't fit\") |\n\nThe NIAH cliff is sharp: factual recall is 100% at 25% eviction and drops to 0% at 35%. Light eviction with the real attention scorer is the sweet spot for most deployments.\n\n```python\nfrom nexusquant import compress_kv_cache\n\n# Lossless ~5x compression, NIAH preserved\ncompressed_kv = compress_kv_cache(past_key_values, mode=\"quant_only\", rope_base=10000.0)\n```\n\n## Why\n\n| Without NexusQuant | With NexusQuant |\n|---|---|\n| 128K context on 70B = ~42 GB KV cache (GQA) | Same context = ~2.5 GB KV cache (17x) |\n| KV cache competes with model weights for VRAM | KV cache fits comfortably alongside weights |\n| Long context needs multi-GPU or offloading | Single GPU, single machine |\n| Deploy a fine-tuned retrieval model | One `with` block, no code changes |\n\n## Quality presets\n\nMeasured on Mistral-7B, Phi-3-mini, Qwen2.5-7B. Compression ratios include all overhead.\n\n| Preset | Compression | PPL degradation | Config |\n|---|---|---|---|\n| `high` | ~9x | \u003c0.5% | K3V2 + real scorer + 35% evict |\n| `asym` | ~14x | ~1% | K3V2 + 60% evict |\n| `balanced` | ~17x | ~1.3% | K2V2 + 60% evict |\n| `max` | ~33x | +0.66% | K2V2 + real scorer + 80% evict |\n\n**Important: PPL alone does not tell the full quality story.** Eviction modes show 40% NIAH recall at K3V2-35% despite only +0.82% PPL. Quant-only mode (~5x) preserves NIAH recall fully. Use quant-only when factual accuracy matters more than compression ratio. See Limitations below.\n\nAttention sharpening (scaling keys by sqrt(1.05) after quantization) gives a free +0.05pp quality improvement at zero compression cost. Enabled by default.\n\n### Cross-architecture results (Cerebrium A10)\n\n| Model | KV Heads | K3V2 35% | K2V2 35% | Notes |\n|---|---|---|---|---|\n| Gemma-2-2b-it | 4 | +0.05% | +0.35% | Best result. Large head_dim (256) helps. |\n| Mistral-7B | 8 | +0.82% | +0.91% | Main benchmark. |\n| Qwen2.5-1.5B | 2 | +5.04% | +28.7% | 2 KV heads = danger zone. boundary=1 mandatory. |\n| Gemma 4-E2B | 1 | +54% | +54% | 1 KV head, heterogeneous head_dim (256/512). |\n\n\u003e **Pattern:** More KV heads = more compression-tolerant. Models with 1-2 KV heads need `protect_boundary=1` at minimum. Models with 1 KV head are compression-hostile.\n\n## How it works\n\n1. **Importance scoring**  - rank tokens by attention weight. Two options: key-key proxy (fast, no extra pass) or **real attention scorer** (uses `attn_implementation='eager'`, zero quality loss at 35% eviction)\n2. **Token eviction**  - drop lowest-scoring tokens; always keep BOS and a recent sliding window\n3. **RoPE removal**  - undo rotary embeddings on keys so they share a common subspace\n4. **Hadamard rotation**  - spread energy uniformly across dimensions (handles non-power-of-2 head dims via zero-padding)\n5. **E8 lattice quantization**  - quantize 8-float groups onto the E8 root lattice. **Asymmetric:** 3-bit keys + 2-bit values (keys need more precision due to softmax amplification)\n6. **Boundary protection**  - optionally keep first/last N layers at FP16 (mandatory for Qwen-family)\n7. **Delta coding + zstd**  - consecutive tokens produce similar lattice indices; storing deltas then compressing with zstd yields another 2-3x\n\nToken eviction reduces *count* (2.5x at 60% eviction). E8 quantization reduces *precision* (~7x after entropy coding). Combined: 17x.\n\n## Compared to\n\n| Method | Compression | PPL degradation | Training required | Notes |\n|---|---|---|---|---|\n| **NexusQuant (K3V2+scorer)** | **9-33x** | **+0.0-0.66%** | **No** | Includes eviction |\n| **NexusQuant (K2V2)** | **10-33x** | **+0.4-2.6%** | **No** | Includes eviction |\n| TurboQuant+ | 3.8-6.4x | ~0-1% | No | Quant-only, no eviction |\n| KVTC (NVIDIA) | up to 20x | \u003c1% | Yes (calibration) | |\n| CommVQ (Apple) | ~8x | ~0% | Yes (retraining) | |\n| Palu | 11x | ~25% rel | Yes (calibration) | |\n\nNexusQuant ratios include token eviction (10-80% of tokens removed). TurboQuant+ ratios are pure quantization without eviction  - not directly comparable. Competitor numbers from their papers.\n\n## Supported models\n\nAny HuggingFace causal LM using split-half RoPE (the standard since Llama-2):\n\n- Llama family (Llama-2, Llama-3, Llama-3.1)\n- Mistral / Mixtral\n- Qwen\n- Phi\n- Gemma\n\nNot yet supported: models with interleaved RoPE (GPT-NeoX, GPT-J).\n\n## Advanced options\n\n**Graduated layer bit profile** - gives boundary layers (first/last 15%) higher precision (3-bit K+V) while middle layers use standard asymmetric (K3V2). Small but consistent quality win (~0.02pp on Mistral-7B). GPU-validated.\n\n```python\nwith nexusquant_evict(model, quality=\"high\", layer_bit_profile=\"graduated\"):\n    output = model.generate(input_ids, max_new_tokens=200)\n```\n\n**Hybrid model compression** - for models like Gemma4 with sliding-window + global attention layers, only compress the global layers (which scale with context). SWA layers have fixed memory cost.\n\n```python\nwith nexusquant_evict(model, compress_layers=\"global_only\"):\n    output = model.generate(input_ids, max_new_tokens=200)\n```\n\n**Soft eviction (experimental, not recommended)** - quantizes evicted tokens at 1-bit instead of removing them. In testing, this performed worse than hard eviction (+2.24% vs +0.82% PPL at 35% eviction on Mistral-7B). The 1-bit tokens corrupt attention patterns more than masking them out. Kept for research purposes.\n\n```python\nwith nexusquant_evict(model, soft_eviction=True):  # not recommended\n    output = model.generate(input_ids, max_new_tokens=200)\n```\n\n## Limitations\n\n- **Quality is text-dependent.** Creative/narrative text degrades more than structured/technical text. Test on your actual workload.\n- **Short prefixes hurt.** Prefixes under 500 tokens see more degradation. The scorer needs enough tokens to distinguish signal from noise.\n- **Architecture-dependent boundary protection.** Qwen-family models catastrophically fail without `protect_boundary=2`. Mistral and Phi-3 work without it. Always test your specific model.\n- **E8 quantization is CPU-bound.** Triton GPU kernel is written (`nexusquant/kernels/e8_triton.py`) but not yet benchmarked for latency. Physical KV truncation (`truncate=True`) is implemented for actual VRAM savings.\n- **Eviction hurts factual recall.** NIAH benchmark: baseline 100%, K3V2-35% eviction 40% recall, K3V2-60% eviction 53% recall (Mistral-7B-Instruct, ctx=1024-3072). PPL (+0.82%) hides this damage. Multi-window scoring improves recall to 80% at 35% eviction. If your task requires precise fact retrieval, test with NIAH before deploying.\n- **PPL is not a sufficient quality metric.** Always validate with NIAH or downstream accuracy benchmarks. PPL averages over all positions and masks the loss of specific tokens.\n- **Results on 7B-class models only.** 70B validation pending. Mistral-7B quantizes \"exceptionally well\" (ikawrakow) and is not representative of harder models.\n- **Batch size \u003e 1 is partially broken.** `NexusQuantSimple` only compresses batch index 0; other batch elements are silently dropped to the first element's compressed result. `NexusQuantEvictTruncate` computes one keep-mask from batch element 0 and applies it to all sequences  - incorrect when sequences differ in importance. Validate batch inference results carefully.\n- **Multi-turn chat (persistent KV cache) is not supported.** The hook compresses on every incoming prefill (seq \u003e 1). If the same cache is reused across conversation turns, the second turn's user message triggers re-compression of an already-quantized cache. Use a fresh context manager per turn, or call `model.generate` with `past_key_values=None` to reset the cache between turns.\n- **Speculative decoding is not supported.** Speculative decoding writes multiple draft tokens to the KV cache during the decode phase. Because the hook triggers on any batch of \u003e1 new tokens, it will incorrectly fire on draft verification steps, compressing decode-phase tokens.\n- **KV cache offloading is not supported.** `OffloadedCache` (used by HuggingFace's `accelerate` `max_memory` offloading) does not inherit from `DynamicLayer`, so the NexusQuant hooks do not intercept it. Compression silently does nothing when offloading is active.\n- **Encoder-decoder models (T5, BART, Whisper) are not supported.** These models use cross-attention whose KV cache stores encoder representations rather than decoder tokens. RoPE removal in the pipeline assumes decoder self-attention with split-half rotary embeddings, which does not apply to T5-style relative position biases. Applying NexusQuant to encoder-decoder models will produce incorrect results.\n- **Vision-language models (LLaVA, Qwen-VL, LLaVA-Next) are untested.** Model config detection handles nested `text_config`, but image tokens are scored for importance and evicted by the same heuristic as text tokens. High-information image tokens may be evicted. Results on VLMs have not been measured.\n- **GGUF models are not supported.** GGUF format is typically run via llama.cpp or ctransformers, which do not use HuggingFace `DynamicCache`. The integration hooks have no effect. Only GPTQ/AWQ models loaded through `AutoModelForCausalLM` with HuggingFace are compatible.\n- **rope_scaling (extended context) is not accounted for.** Models using linear or NTK rope scaling (e.g., Llama-3.1 at \u003e8K context) read `rope_theta` but ignore `rope_scaling` config. At contexts beyond the original training length, the RoPE removal introduces a small frequency mismatch. Impact is unmeasured.\n\n## Citation\n\n```bibtex\n@software{nexusquant2026,\n  author  = {Marques, Jo\\~{a}o Andr\\'{e} Gomes},\n  title   = {{NexusQuant}: Training-Free {KV} Cache Compression via {E8} Lattice Quantization and Attention-Aware Token Eviction},\n  year    = {2026},\n  url     = {https://github.com/jagmarques/nexusquant},\n  license = {Apache-2.0},\n}\n```\n\n## License\n\nApache 2.0. See [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjagmarques%2Fnexusquant","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjagmarques%2Fnexusquant","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjagmarques%2Fnexusquant/lists"}