Awesome-KV-Cache-Compression

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
https://github.com/October2001/Awesome-KV-Cache-Compression

Last synced: 3 days ago
JSON representation

🔍 Method
- 1️⃣ Pruning / Evicting / Sparse
  - **Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs.** - pli/PruLong)](https://github.com/princeton-pli/PruLong)
  - **Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores.**
  - **LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models.** - EIC/LaCache)](https://github.com/GATECH-EIC/LaCache)
  - **MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference.** - MLSys-Lab/MEDA)](https://github.com/AIoT-MLSys-Lab/MEDA)
  - **KVCrush: Key value cache size-reduction using similarity in head-behaviour.**
  - **RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.**
  - **Sirius: Contextual Sparsity with Correction for Efficient LLMs.** - ai-lab/sirius)](https://github.com/infini-ai-lab/sirius)
  - **Training-Free Activation Sparsity in Large Language Models.**
  - **CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs.**
  - **LoCoCo: Dropping In Convolutions for Long Context Compression.** - Group/LoCoCo)](https://github.com/VITA-Group/LoCoCo)
  - **SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.** - sg/SimLayerKV)](https://github.com/sail-sg/SimLayerKV)
  - **In-context KV-Cache Eviction for LLMs via Attention-Gate.**
  - **[CLS - 4869/FasterVLM)](https://github.com/Theia-4869/FasterVLM)
  - **Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models.** - 4869/FasterVLM)](https://github.com/ywh187/FitPrune)
  - **Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers.**
  - **MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.** - Yew Lin, Yuqing Yang, Lili Qiu.* NeurIPS 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/MInference)](https://github.com/microsoft/MInference)
  - **SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator.**
  - **HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing.** - Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos.* Arxiv 2024.
  - **Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction.** - Phi Nguyen, Yingyu Liang, Shafiq Joty.* Arxiv 2024.
  - **CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving.**
  - **MagicPIG: LSH Sampling for Efficient LLM Generation.** - AI-Lab/MagicPIG)](https://github.com/Infini-AI-Lab/MagicPIG)
  - **ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference.**
  - **FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation.** - Joon Kim.* Arxiv 2025. [![GitHub Repo stars](https://img.shields.io/github/stars/dongwonjo/FastKV)](https://github.com/dongwonjo/FastKV)
  - **Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference.**
  - **ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching.**
  - **Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference.** - matrix-ai/keyformer-llm)](https://github.com/d-matrix-ai/keyformer-llm)
  - **Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference.**
  - **SnapKV: LLM Knows What You are Looking for Before Generation.**
  - **H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.**
  - **Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs.**
  - **PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference.**
  - **Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.**
  - **A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression.**
  - **Retrieval Head Mechanistically Explains Long-Context Factuality.**
  - **Efficient Sparse Attention needs Adaptive Token Release.**
  - **Loki: Low-Rank Keys for Efficient Sparse Attention.**
  - **Efficient Streaming Language Models with Attention Sinks.** - han-lab/streaming-llm)](https://github.com/mit-han-lab/streaming-llm)
  - **RazorAttention: Efficient KV Cache Compression Through Retrieval Heads.**
  - **CORM: Cache Optimization with Recent Message for Large Language Model Inference.**
  - **A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder.** - Notation/A2SF)](https://github.com/Dirac-Notation/A2SF)
  - **Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.** - han-lab/Quest)](https://github.com/mit-han-lab/Quest)
  - **LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.**
  - **Post-Training Sparse Attention with Double Sparsity.** - yang-1/DoubleSparse)](https://github.com/andy-yang-1/DoubleSparse)
  - **Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope.**
  - **Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference.**
  - **NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time.** - NACL)
  - **SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching.**
  - **KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head.** - kvcompress)](https://github.com/IsaacRe/vllm-kvcompress)
  - **InfiniPot: Infinite Context Processing on Memory-Constrained LLMs.**
  - **Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads.**
  - **Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU.** - ai/INF-MLLM)](https://github.com/infly-ai/INF-MLLM)
  - **KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models.**
  - **TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention.**
  - **ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.** - Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen.* Arxiv 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/bytedance/ShadowKV)](https://github.com/bytedance/ShadowKV)
  - **LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference.**
  - **Unifying KV Cache Compression for Large Language Models with LeanKV.**
  - **DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs.**
  - **SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation.** - ai/SCOPE)](https://github.com/Linking-ai/SCOPE)
  - **ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression.**
  - **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention.** - han-lab/omniserve)](https://github.com/mit-han-lab/omniserve)
  - **EvolKV: Evolutionary KV Cache Compression for LLM Inference.**
  - **LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation.** - Tu.* Arxiv 2025. [![GitHub Repo stars](https://img.shields.io/github/stars/MGDDestiny/Lava)](https://github.com/MGDDestiny/Lava)
  - **BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference.** - llm)](https://github.com/JunqiZhao888/buzz-llm)
  - **Recycled Attention: Efficient inference for long-context language models.**
  - **VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration.**
  - **Squeezed Attention: Accelerating Long Context Length LLM Inference.**
  - **ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction.** - liang/ArkVale)](https://github.com/pku-liang/ArkVale)
  - **SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs.** - Hong Deng, Jing Han.* Arxiv 2025.
  - **KV-Distill: Nearly Lossless Learnable Context Compression for LLMs.** - distill)](https://github.com/vnchari/kv-distill)
  - **MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache.**
  - **Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification.**
  - **DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance.**
  - **RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression.** - An Tsai, Zhiding Yu, Alexey Tumanov.* Arxiv 2025.
  - **Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs.**
  - **The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.**
  - **Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference.**
  - **CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences.**
  - **R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration.** - Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu.* Arxiv 2025. [![GitHub Repo stars](https://img.shields.io/github/stars/Zefan-Cai/R-KV)](https://github.com/Zefan-Cai/R-KV)
  - **KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction.** - Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song.* Arxiv 2025. [![GitHub Repo stars](https://img.shields.io/github/stars/snu-mllab/KVzip)](https://github.com/snu-mllab/KVzip)
  - **Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs.**
  - **FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension.**
  - **PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling.** - Cai/PyramidKV)](https://github.com/Zefan-Cai/PyramidKV)
  - **Transformers are Multi-State RNNs.** - lab-NLP/TOVA)](https://github.com/schwartz-lab-NLP/TOVA)
  - **SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference.** - Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang.* Arxiv 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/Gumpest/SparseVLMs)](https://github.com/Gumpest/SparseVLMs)
  - **DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.** - han-lab/duo-attention)](https://github.com/mit-han-lab/duo-attention)
  - **CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling.** - Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung.* EMNLP 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/ybai-nlp/CItruS)](https://github.com/ybai-nlp/CItruS)
  - **TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection.**
  - **Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning.**
  - **AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models.**
  - **Inference-Time Hyper-Scaling with KV Cache Compression.**
  - **InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding.**
  - **On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference.**
- 4️⃣ Low-Rank
  - **Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache.**
  - **Fast Transformer Decoding: One Write-Head is All You Need.**
  - **GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.** - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai.* EMNLP 2023.
  - **LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy.**
  - **ThinK: Thinner Key Cache by Query-Driven Pruning.**
  - **DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.** - AI.* Arxiv 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-V2)](https://github.com/deepseek-ai/DeepSeek-V2)
  - **Effectively Compress KV Heads for LLM.**
  - **Palu: Compressing KV-Cache with Low-Rank Projection.** - Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu.* Arxiv 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/shadowpa0327/Palu)](https://github.com/shadowpa0327/Palu)
  - **Tensor Product Attention Is All You Need.** - Chih Yao.* Arxiv 2025. [![GitHub Repo stars](https://img.shields.io/github/stars/tensorgi/T6)](https://github.com/tensorgi/T6)
  - **OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule.** - Yu Chen.* Arxiv 2025.
- 6️⃣ Prompt Compression
  - **TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning.**
  - **LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.** - Yew Lin, Yuqing Yang, Lili Qiu.* EMNLP 2023. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/LLMLingua)](https://github.com/microsoft/LLMLingua)
  - **LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression.** - Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang.* ACL 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/LLMLingua)](https://github.com/microsoft/LLMLingua)
  - **Better Prompt Compression Without Multi-Layer Perceptrons.**
  - **LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression.** - Yew Lin, Yuqing Yang, Lili Qiu.* ACL 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/LLMLingua)](https://github.com/microsoft/LLMLingua)
  - **ICPC: In-context Prompt Compression with Faster Inference.**
- 2️⃣ Merging
  - **Token Merging: Your ViT But Faster.** - Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman.* ICLR 2023. [![GitHub Repo stars](https://img.shields.io/github/stars/facebookresearch/ToMe)](https://github.com/facebookresearch/ToMe)
  - **CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion.**
  - **D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models.** - MLSys-Lab/d2o)](https://github.com/AIoT-MLSys-Lab/d2o)
  - **Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks.**
  - **CaM: Cache Merging for Memory-efficient LLMs Inference.**
  - **Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs.** - Woo Ha, Jinwoo Shin.* ICLR 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/alinlab/HOMER)](https://github.com/alinlab/HOMER)
  - **Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention.**
  - **Compressed Context Memory for Online Language Model Interaction.** - Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song.* ICLR 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/snu-mllab/Context-Memory)](https://github.com/snu-mllab/Context-Memory)
  - **LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference.** - M)](https://github.com/SUSTechBruce/LOOK-M)
  - **AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning.** - Lab/AIM)](https://github.com/LaVi-Lab/AIM)
  - **KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference.**
- 3️⃣ Cross-Layer
- 5️⃣ Quantization
  - **GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM.** - project/GEAR)](https://github.com/opengear-project/GEAR)
  - **ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification.**
  - **No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization.**
  - **PQCache: Product Quantization-based KVCache for Long Context LLM Inference.**
  - **Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression.** - Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen.* Arxiv 2024.
  - **SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models.**
  - **QAQ: Quality Adaptive Quantization for LLM KV Cache.** - KVCacheQuantization)](https://github.com/ClubieDong/QAQ-KVCacheQuantization)
  - **WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More.**
  - **KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.**
  - **KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference.** - Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan.* ICML 2025. [![GitHub Repo stars](https://img.shields.io/github/stars/cmd2001/KVTuner)](https://github.com/cmd2001/KVTuner)
  - **TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization.**
  - **TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering.**
  - **KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.** - yuan/KIVI)](https://github.com/jy-yuan/KIVI)
  - **Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models.** - TAO/VidKV)](https://github.com/KD-TAO/VidKV)
  - **BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.** - DuDa/BitDecoding)](https://github.com/DD-DuDa/BitDecoding)
  - **CommVQ: Commutative Vector Quantization for KV Cache Compression.**
- 7️⃣ Reuse
  - **KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.** - NLP-Chang/KVLink)](https://github.com/UCSB-NLP-Chang/KVLink)
- 8️⃣ Non-Autoregressive
  - **dKV-Cache: The Cache for Diffusion Language Models.** - Cache)](https://github.com/horseee/dKV-Cache)
  - **dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching.** - cache)](https://github.com/maomaocun/dLLM-cache)
  - **Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding.** - dLLM)](https://github.com/NVlabs/Fast-dLLM)
📷 Survey
- **Prompt Compression for Large Language Models: A Survey.**
- **A Survey on Large Language Model Acceleration based on KV Cache Management.** - Lab/Awesome-KV-Cache-Management)](https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management)
- **Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption.** - charlie/Awesome-KV-Cache)](https://github.com/zcli-charlie/Awesome-KV-Cache)
📊 Evaluation
- 8️⃣ Non-Autoregressive
  - **KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches.** - Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu.* EMNLP 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/henryzhongsc/longctx_bench)](https://github.com/henryzhongsc/longctx_bench)
  - **Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving.** - kv-compression)](https://github.com/LLMkvsys/rethink-kv-compression)
  - **SCBench: A KV Cache-Centric Analysis of Long-Context Methods.**
  - **More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression.**
⚙️ Project
- **kvpress.**
- **KVCache-Factory.** - Cai/KVCache-Factory)](https://github.com/Zefan-Cai/KVCache-Factory)

Categories

🔍 Method 148 📊 Evaluation 4 📷 Survey 3 ⚙️ Project 2

Sub Categories

1️⃣ Pruning / Evicting / Sparse 92 5️⃣ Quantization 16 2️⃣ Merging 11 4️⃣ Low-Rank 10 3️⃣ Cross-Layer 9 8️⃣ Non-Autoregressive 7 6️⃣ Prompt Compression 6 7️⃣ Reuse 1