{"id":21427878,"url":"https://github.com/nvidia/kvpress","last_synced_at":"2026-04-09T11:25:11.655Z","repository":{"id":263924873,"uuid":"884452470","full_name":"NVIDIA/kvpress","owner":"NVIDIA","description":"LLM KV cache compression made easy","archived":false,"fork":false,"pushed_at":"2025-03-19T15:59:43.000Z","size":5798,"stargazers_count":442,"open_issues_count":2,"forks_count":31,"subscribers_count":13,"default_branch":"main","last_synced_at":"2025-03-25T11:01:34.678Z","etag":null,"topics":["inference","kv-cache","kv-cache-compression","large-language-models","llm","long-context","python","pytorch","transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-06T19:23:20.000Z","updated_at":"2025-03-23T09:17:08.000Z","dependencies_parsed_at":"2025-01-17T16:10:01.649Z","dependency_job_id":"f4eed990-f44c-4afc-a648-deeb111edcbc","html_url":"https://github.com/NVIDIA/kvpress","commit_stats":null,"previous_names":["nvidia/kvpress"],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fkvpress","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fkvpress/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fkvpress/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fkvpress/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/kvpress/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246635959,"owners_count":20809332,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inference","kv-cache","kv-cache-compression","large-language-models","llm","long-context","python","pytorch","transformers"],"created_at":"2024-11-22T22:07:51.269Z","updated_at":"2026-04-09T11:25:11.635Z","avatar_url":"https://github.com/NVIDIA.png","language":"Python","readme":"[![PyPI version](https://badge.fury.io/py/kvpress.svg)](https://badge.fury.io/py/kvpress)\n[![License](https://img.shields.io/badge/License-Apache%202.0-green.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Colab example notebook](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP?usp=drive_link)\n[![Hugging Face Space](https://img.shields.io/badge/🤗%20Hugging%20Face-Space-blue)](https://huggingface.co/spaces/nvidia/kvpress)\n[![Blog post](https://img.shields.io/badge/🤗%20Hugging%20Face-Blog-blue)](https://huggingface.co/blog/nvidia/kvpress)\n[![Hugging Face Leaderboard](https://img.shields.io/badge/🤗%20HuggingFace-Leaderboard-orange)](https://huggingface.co/spaces/nvidia/kvpress-leaderboard)\n[![arXiv](https://img.shields.io/badge/arXiv-2510.00636-b31b1b.svg)](https://arxiv.org/abs/2510.00636v1)\n\n\n![kvpress](kvpress.jpg)\n\n\nDeploying long-context LLMs is costly due to the linear growth of the key-value (KV) cache in transformer models. For example, handling 1M tokens with Llama 3.1-70B in float16 requires up to 330GB of memory. kvpress implements multiple KV cache compression methods and benchmarks using 🤗 transformers, aiming to simplify the development of new methods for researchers and developers in this field.\n\n## Installation\n\n```bash\npip install kvpress\n```\n\nFor a local installation, use [uv](https://docs.astral.sh/uv/):\n\n```bash\ngit clone https://github.com/NVIDIA/kvpress.git\ncd kvpress\nuv sync\n```\n\nTo install with all optional dependencies, run:\n\n```bash\ngit clone https://github.com/NVIDIA/kvpress.git\ncd kvpress\nuv sync --extra eval --extra flash-attn\n```\n\n## Usage\n\nKVPress provides a set of \"presses\" that compress the KV cache during the prefilling-phase. Each press is associated with a `compression_ratio` attribute that measures the compression of the cache. The easiest way to use a press is through our custom `KVPressTextGenerationPipeline`. It is automatically registered as a transformers pipeline with the name \"kv-press-text-generation\" when kvpress is imported and handles chat templates and tokenization for you:\n\n```python\nfrom transformers import pipeline\nfrom kvpress import ExpectedAttentionPress\n\nmodel = \"Qwen/Qwen3-8B\"\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\", dtype=\"auto\")\n\ncontext = \"A very long text you want to compress once and for all\"\nquestion = \"\\nA question about the compressed context\"  # optional\n\npress = ExpectedAttentionPress(compression_ratio=0.5)\nanswer = pipe(context, question=question, press=press)[\"answer\"]\n```\n\nIn the snippet above, the compression is only applied on the context tokens so that you can evaluate the compression for different questions. Check the [Wikipedia notebook demo](notebooks/wikipedia_demo.ipynb) for a more detailed example (also available on Colab [here](https://colab.research.google.com/drive/1JNvaTKuuAHrl49dYB9-mdEH_y52Ib-NP)).\n\n\u003cdetails\u003e\u003csummary\u003e\nDecoding Compression\n\u003c/summary\u003e\nBy default, KVPress applies compression during the prefilling phase. As a new (experimental) feature, we now support decoding compression via the `DecodingPress` wrapper. `DecodingPress` compresses the KV cache periodically during token generation, optionally maintaining a buffer of recent hidden states. `DecodingPress` supports the following parameters:\n\n- `base_press`: Any ScorerPress (e.g., `KNormPress`, `CriticalKVPress`)\n- `compression_interval`: Steps between compressions (default: 10)\n- `target_size`: Target cache size of the cache after compression (default: 1024)\n- `hidden_states_buffer_size`: Number of hidden states to buffer before compression (default: 128). Some presses don't need buffered hidden states and can set this to 0.\n\nUnlike a compression ratio, decoding press uses a `target_size` to compress the cache. This means that the cache is compressed every `compression_interval` steps, and the compression ratio is automatically computed such that the size of the cache after compression equals `target_size`.\n\nAn example for decoding compression:\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress\nfrom kvpress import DecodingPress\n\n# Initialize the pipeline\ndevice = \"cuda:0\"\nmodel = \"meta-llama/Llama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# Create a decoding press that compresses every 10 steps to 512 tokens\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10,\n    token_buffer_size=512\n)\n\n# Use with pipeline\ncontext = \"A very long text you want to compress during generation\"\nquestion = \"Tell me a long story about this context\"\nresponse = pipe(context, question=question, press=decoding_press)[\"answer\"]\n```\n\n\u003e Not all existing presses are fully compatible with DecodingPress due to fundamental differences in how compression works during decoding versus prefilling. in particular, we only support ScorerPresses as base presses.\n\n\u003c/details\u003e\n\n## Available presses\n\nAll current presses are training free and inherit from `BasePress` ([source](kvpress/presses/base_press.py)). \n\nSeveral presses inherit from `ScorerPress` ([source](kvpress/presses/scorer_press.py)) and rely on a score to prune the KV pairs with lowest importance:\n\n- `RandomPress` ([source](kvpress/presses/random_press.py)): random score\n- `KnormPress` ([source](kvpress/presses/knorm_press.py), [paper](https://arxiv.org/abs/2406.11430)): inverse norm of the key\n- `SnapKVPress` ([source](kvpress/presses/snapkv_press.py), [paper](https://arxiv.org/abs/2404.14469)): average attention weight of the last queries\n- `ExpectedAttentionPress` ([source](kvpress/presses/expected_attention_press.py), [notebook](notebooks/expected_attention.ipynb)): expected attention weight during the generation phase \n- `StreamingLLMPress` ([source](kvpress/presses/streaming_llm_press.py), [paper](https://arxiv.org/abs/2309.17453)): keep only the initial and recent tokens \n- `TOVAPress` ([source](kvpress/presses/tova_press.py), [paper](https://arxiv.org/abs/2401.06104)): attention weight of the last query averaged across heads \n- `ObservedAttentionPress` ([source](kvpress/presses/observed_attention_press.py), [paper](https://arxiv.org/abs/2306.14048)): average attention weight observed during in prefilling phase\n- `QFilterPress` ([source](kvpress/presses/qfilter_press.py), [paper](https://arxiv.org/abs/2503.02812)): project the Key representations on the main SVD component of the Query vectors to approximate the attention scores.\n- `PyramidKVPress` ([source](kvpress/presses/pyramidkv_press.py), [paper](https://arxiv.org/abs/2406.02069)): maintain pyramid-like cache sizes, allocating more cache budget to lower layers and less to higher layers\n- `LagKVPress` ([source](kvpress/presses/lagkv_press.py), [paper](https://arxiv.org/abs/2504.04704)): leverage on the KV lag-relative information to compress. It's query free, attention-weight free, and flash-attention compatible.\n- `KeyDiffPress` ([source](kvpress/presses/keydiff_press.py), [paper](https://arxiv.org/abs/2504.15364)): evict tokens based solely on key similarity.\n- `NonCausalAttnPress` ([source](kvpress/presses/non_causal_attention_press.py), [paper](https://arxiv.org/abs/2507.08143)): evict tokens based on non-causal chunked attention scores.\n- `LeverageScorePress` ([source](kvpress/presses/leverage_press.py), [paper](https://arxiv.org/abs/2507.08143)): evict tokens based on approximate statistical leverage (i.e we preserve outliers in the key space).\n- `CompactorPress` ([source](kvpress/presses/compactor_press.py), [paper](https://arxiv.org/abs/2507.08143)): blend `NonCausalAttnPress` and `LeverageScorePress` based on the compression_ratio.\n- `CURPress` ([source](kvpress/presses/cur_press.py), [paper](https://arxiv.org/abs/2509.15038)): prune keys and values based on the CUR decomposition using approximate leverage scores.\n- `KVzapPress` ([source](kvpress/presses/kvzap/kvzap_press.py), [paper](https://arxiv.org/abs/2601.07891), [training](kvzap)): approximate KVzip+ using a fast surrogate model. To be used in conjunction with the `DMSPress`.\n- `FastKVzipPress` ([source](kvpress/presses/fastkvzip_press.py), [paper](https://arxiv.org/abs/2601.17668)): approximate KVzip through a lightweight gating mechanism.\n\nSome presses rely on a different logic:\n- `ThinKPress` ([source](kvpress/presses/think_press.py), [paper](https://arxiv.org/abs/2407.21018)): compress the dimensions of the keys based on the channel attention score on the last queries \n- `SimLayerKVPress` ([source](kvpress/presses/simlayerkv_press.py), [paper](https://arxiv.org/abs/2410.13846)): identify \"lazy\" layers, and apply the StreamingLLM approach to them \n- `DuoAttentionPress` ([source](kvpress/presses/duo_attention_press.py), [paper](https://arxiv.org/abs/2410.10819)): split heads into retrieval heads (no compression) and streaming heads (StreamingLLM approach)\n- `FinchPress` ([source](kvpress/presses/finch_press.py), [paper](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00716/125280)): similar to SnapKV with a dynamic window size and key value re-rotation\n- `KVzipPress` ([source](kvpress/presses/kvzip_press.py), [paper](https://arxiv.org/abs/2505.23416)): identify redundant KV pairs through context reconstruction. Achieve near-lossless compression at the cost of multiple forward passes.\n- `KVComposePress` ([source](kvpress/presses/kvcompose_press.py), [paper](https://arxiv.org/abs/2509.05165)): attention-guided eviction, aligning per-head selections into composite tokens to preserve cache structure.\n\n\u003e [!NOTE]  \n\u003e `KVComposePress` performs an extra pass over the full context, temporarily creating a KV cache of ~2x the context length and creating memory overhead during prefill.\n\nFinally we provide wrapper presses that can be combined with other presses:\n- `AdaKVPress` ([source](kvpress/presses/adakv_press.py), [paper](https://arxiv.org/abs/2407.11550)): prune bottom scores of any `ScorerPress` but across all heads, achieving head-wise compressions \n- `PerLayerCompressionPress` ([source](kvpress/presses/per_layer_compression_press.py)): compress each layer with a different compression ratio (experimental)\n- `ComposedPress` ([source](kvpress/presses/composed_press.py)): compose multiple presses together by chaining their forward hooks\n- `KeyRerotationPress` ([source](kvpress/presses/key_rerotation_press.py)): rerotate pruned keys to have continuous RoPE embeddings\n- `ChunkKVPress` ([source](kvpress/presses/chunkkv_press.py), [paper](https://arxiv.org/abs/2502.00299)): compress by selecting important chunks, preserving semantic coherence\n- `ChunkPress` ([source](kvpress/presses/chunk_press.py), [paper](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00716/125280)): compress the KV cache on each sequence chunk separately. This can yield to more uniform compression across long sequences\n- `CriticalKVPress` and `CriticalAdaKVPress` ([source](kvpress/presses/criticalkv_press.py), [paper](https://arxiv.org/abs/2502.03805)): refine the scores using the L1 norm of Wo @ values, coupled with a two-stage selection.\n- `BlockPress` ([source](kvpress/presses/block_press.py), [paper](https://arxiv.org/abs/2504.15364)): segment input sequence into non-overlapping blocks and compress iteratively (⚠️ not a true chunked-prefill implementation)\n- `DecodingPress` ([source](kvpress/presses/decoding_press.py)): allow for compression during decoding, see decoding section in this README.\n- `PrefillDecodingPress` ([source](kvpress/presses/prefill_decoding_press.py)): allow to compress both during prefilling and during decoding.\n- `DMSPress` ([source](kvpress/presses/dms_press.py), [paper](https://arxiv.org/abs/2506.05345)): evict keys and values with scores below a given threshold of any `ScorerPress` instead of relying on top-k scores. Support both prefilling and decoding (if decoding=True), but only supports dense-prefill and not sparse-prefill.\n\nFor a detailed list of existing KV cache compression methods, check [Awesome-KV-Cache-Compression](https://github.com/October2001/Awesome-KV-Cache-Compression) or [Awesome-LLM-Compression](https://github.com/HuangOwen/Awesome-LLM-Compression?tab=readme-ov-file#kv-cache-compression)\n\n\n## Evaluation\nWe provide a simple CLI to evaluate the performance of different presses on several long-context datasets. \n\n- Accuracy: Test your method on popular benchmarks directly using our CLI. \n- Speed and Memory: The [speed_and_memory](notebooks/speed_and_memory.ipynb) notebook can help you measure peak memory usage and total time gain.\n\nPlease refer to the [evaluation](evaluation/README.md) directory in this repo for more details and results. \n\nBelow we report the average performance on the RULER dataset with 4k context length for different presses, from our [![Hugging Face Leaderboard](https://img.shields.io/badge/🤗%20HuggingFace-Leaderboard-orange)](https://huggingface.co/spaces/nvidia/kvpress-leaderboard)\n\n## Quantization\n\nWe support KV cache quantization through the transformers `QuantizedCache` class (see [HF blog post](https://huggingface.co/blog/kv-cache-quantization#how-to-use-quantized-kv-cache-in-%F0%9F%A4%97-transformers)). To use it, simply pass a cache object to your pipeline:\n\n```python\nfrom transformers import QuantizedCache\n\ncache = QuantizedCache(backend=\"quanto\", nbits=4)\n\npipe(..., cache=cache)\n```\n\nBy default, the `DynamicCache` is used (no quantization). \n\n\u003e [!IMPORTANT]  \n\u003e To use the `QuantizedCache`, you need to install additional dependencies (_e.g._ `pip install optimum-quanto`).\n\n## Contributing\n\nWe welcome contributions! To add a new press, simply open an issue or submit a pull request. Check the [new_press.ipynb](notebooks/new_press.ipynb) notebook for a step-by-step guide.\n\n## Citation\n\nIf you use KVPress in your research, please cite our paper:\n\n```bibtex\n@article{devoto2025expectedattention,\n  title={Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution},\n  author={Devoto, Alessio and Jeblick, Maximilian and J{\\'e}gou, Simon},\n  journal={arXiv preprint arXiv:2510.00636},\n  year={2025},\n  url={https://arxiv.org/abs/2510.00636}\n}\n```\n\n## FAQ\n\n\u003cdetails\u003e\u003csummary\u003e \n\n### Which models are supported ? \n\u003c/summary\u003e\n\nSome presses depend on the model architecture (_e.g._ `ExpectedAttentionPress` or `SnapKVPress`) hence they might not work with all models. We tested support for `LlamaForCausalLM`, `MistralForCausalLM`, `Phi3ForCausalLM`, `Qwen2ForCausalLM`, `Qwen3ForCausalLM`, and `Gemma3ForCausalLM` but many other models might be supported out of the box because their implementation is often similar in transformers.\n\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003e \n\n### How to run inference on multiple GPUs ? \n\u003c/summary\u003e\n\nkvpress supports multi-GPU inference through [accelerate](https://huggingface.co/docs/accelerate/en/index):\n\n```python\npipe = pipeline(\"kv-press-text-generation\", model=model, device_map=\"auto\")\n```\n\n\u003c/details\u003e\n\n\n\u003cdetails\u003e \u003csummary\u003e \n\n### What are the memory and throughput gains ?\n\u003c/summary\u003e\n\nMemory usage should be reduced by around `compression_ratio * kv_cache_size`. As the KV cache is smaller, decoding should also be faster. You can measure peak memory usage gain and total time gain using [this notebook](notebooks/speed_and_memory.ipynb).\n\u003c/details\u003e\n\n\n\u003cdetails\u003e \u003csummary\u003e \n\n### How does a press work ? \u003c/summary\u003e\n\nA press registers a forward hook (`press.forward_hook` method) to each attention layer during the prefilling phase. Registration can be applied using the press as a context manager (`press.__call__` method):\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom kvpress import KnormPress\n\ndevice = \"cuda:0\"\nckpt = \"meta-llama/Meta-Llama-3.1-8B-Instruct\"\nmodel = AutoModelForCausalLM.from_pretrained(ckpt).to(device)\npress = KnormPress(compression_ratio=0.4)\n\ninputs = model.dummy_inputs[\"input_ids\"].to(device)\n\nwith torch.no_grad():\n    print(model(inputs).past_key_values[0][0].shape)\n    # torch.Size([3, 8, 5, 128])\n    \nwith torch.no_grad(), press(model):\n    print(model(inputs).past_key_values[0][0].shape)\n    # torch.Size([3, 8, 3, 128])\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\u003csummary\u003e \n\n### Why not using model.generate ? \n\u003c/summary\u003e\n\nIn fact you can use `model.generate` with a press by using the press as a context manager:\n\n```python\nwith press(model):\n    outputs = model.generate(inputs)\n```\n\nHowever, the `generate` method does not allow to exclude the question from the compression, which would artificially favors methods such as SnapKV. Ideally, we want a compression method that works whatever comes after the context (_e.g._ for use cases such as chat or document question answering). Finally the `generate` method does not allow to provide generation for multiple questions at once.\n\n\u003c/details\u003e\n\n\n\n\u003cdetails\u003e\u003csummary\u003e \n\n### Can I combine compression during prefilling and decoding ? \n\u003c/summary\u003e\n\n\nCombines separate presses for prefilling and decoding phases.\n\n**Parameters:**\n- `prefilling_press`: Press used during prefill phase\n- `decoding_press`: Press used during decoding phase\n\n## Usage Examples\n\n### Basic Decoding Compression\n\n```python\nfrom transformers import pipeline\nfrom kvpress import KnormPress\nfrom kvpress import DecodingPress\n\n# Initialize the pipeline\ndevice = \"cuda:0\"\nmodel = \"meta-llama/Llama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# Create a decoding press that compresses every 10 steps to 512 tokens\ndecoding_press = DecodingPress(\n    base_press=KnormPress(),\n    compression_steps=10,\n    token_buffer_size=512\n)\n\n# Use with pipeline\ncontext = \"A very long text you want to compress during generation\"\nquestion = \"Tell me a long story about this context\"\nresponse = pipe(context, question=question, press=decoding_press)[\"answer\"]\n```\n\n### Combined Prefill + Decoding Compression\n\n```python\nfrom transformers import pipeline\nfrom kvpress import CriticalKVPress, KnormPress\nfrom kvpress import DecodingPress, PrefillDecodingPress\n\n# Initialize the pipeline\ndevice = \"cuda:0\"\nmodel = \"meta-llama/Llama-3.1-8B-Instruct\"\nmodel_kwargs = {\"attn_implementation\": \"flash_attention_2\"}\npipe = pipeline(\"kv-press-text-generation\", model=model, device=device, model_kwargs=model_kwargs)\n\n# Different strategies for prefill vs decoding\nprefill_press = CriticalKVPress(KnormPress())\ndecoding_press = DecodingPress(\n    base_press=KnormPress(compression_ratio=0.2),\n    compression_steps=5,\n    token_buffer_size=256\n)\n\n# Combine them\ncombined_press = PrefillDecodingPress(\n    prefilling_press=prefill_press,\n    decoding_press=decoding_press\n)\n\ncontext = \"A very long context that will be compressed during prefill\"\nquestion = \"Generate a detailed analysis that will be compressed during decoding\"\nresponse = pipe(context, question=question, press=combined_press)[\"answer\"]\n```\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fkvpress","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnvidia%2Fkvpress","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnvidia%2Fkvpress/lists"}