Awesome-KV-Cache-Compression
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
https://github.com/October2001/Awesome-KV-Cache-Compression
Last synced: 3 days ago
JSON representation
-
🔍 Method
-
1️⃣ Pruning / Evicting / Sparse
- **Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs.** - pli/PruLong)](https://github.com/princeton-pli/PruLong)
- **Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores.**
- **LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models.** - EIC/LaCache)](https://github.com/GATECH-EIC/LaCache)
- **MEDA: Dynamic KV Cache Allocation for Efficient Multimodal Long-Context Inference.** - MLSys-Lab/MEDA)](https://github.com/AIoT-MLSys-Lab/MEDA)
- **KVCrush: Key value cache size-reduction using similarity in head-behaviour.**
- **RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval.**
- **Sirius: Contextual Sparsity with Correction for Efficient LLMs.** - ai-lab/sirius)](https://github.com/infini-ai-lab/sirius)
- **Training-Free Activation Sparsity in Large Language Models.**
- **CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs.**
- **LoCoCo: Dropping In Convolutions for Long Context Compression.** - Group/LoCoCo)](https://github.com/VITA-Group/LoCoCo)
- **SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.** - sg/SimLayerKV)](https://github.com/sail-sg/SimLayerKV)
- **In-context KV-Cache Eviction for LLMs via Attention-Gate.**
- **[CLS - 4869/FasterVLM)](https://github.com/Theia-4869/FasterVLM)
- **Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models.** - 4869/FasterVLM)](https://github.com/ywh187/FitPrune)
- **Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers.**
- **MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.** - Yew Lin, Yuqing Yang, Lili Qiu.* NeurIPS 2024. [](https://github.com/microsoft/MInference)
- **SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator.**
- **HashEvict: A Pre-Attention KV Cache Eviction Strategy using Locality-Sensitive Hashing.** - Anne Hartley, Brian Gravelle, Furong Huang, Cornelia Fermüller, Yiannis Aloimonos.* Arxiv 2024.
- **Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction.** - Phi Nguyen, Yingyu Liang, Shafiq Joty.* Arxiv 2024.
- **CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving.**
- **MagicPIG: LSH Sampling for Efficient LLM Generation.** - AI-Lab/MagicPIG)](https://github.com/Infini-AI-Lab/MagicPIG)
- **ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference.**
- **FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation.** - Joon Kim.* Arxiv 2025. [](https://github.com/dongwonjo/FastKV)
- **Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference.**
- **ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching.**
- **Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference.** - matrix-ai/keyformer-llm)](https://github.com/d-matrix-ai/keyformer-llm)
- **Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference.**
- **SnapKV: LLM Knows What You are Looking for Before Generation.**
- **H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.**
- **Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs.**
- **PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference.**
- **Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.**
- **A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression.**
- **Retrieval Head Mechanistically Explains Long-Context Factuality.**
- **Efficient Sparse Attention needs Adaptive Token Release.**
- **Loki: Low-Rank Keys for Efficient Sparse Attention.**
- **Efficient Streaming Language Models with Attention Sinks.** - han-lab/streaming-llm)](https://github.com/mit-han-lab/streaming-llm)
- **RazorAttention: Efficient KV Cache Compression Through Retrieval Heads.**
- **CORM: Cache Optimization with Recent Message for Large Language Model Inference.**
- **A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder.** - Notation/A2SF)](https://github.com/Dirac-Notation/A2SF)
- **Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.** - han-lab/Quest)](https://github.com/mit-han-lab/Quest)
- **LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.**
- **Post-Training Sparse Attention with Double Sparsity.** - yang-1/DoubleSparse)](https://github.com/andy-yang-1/DoubleSparse)
- **Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope.**
- **Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference.**
- **NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time.** - NACL)
- **SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching.**
- **KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head.** - kvcompress)](https://github.com/IsaacRe/vllm-kvcompress)
- **InfiniPot: Infinite Context Processing on Memory-Constrained LLMs.**
- **Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads.**
- **Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU.** - ai/INF-MLLM)](https://github.com/infly-ai/INF-MLLM)
- **KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models.**
- **TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention.**
- **ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference.** - Wen Chang, Wenlei Bao, Size Zheng, Ningxin Zheng, Xin Liu, Harry Dong, Yuejie Chi, Beidi Chen.* Arxiv 2024. [](https://github.com/bytedance/ShadowKV)
- **LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference.**
- **Unifying KV Cache Compression for Large Language Models with LeanKV.**
- **DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs.**
- **SCOPE: Optimizing Key-Value Cache Compression in Long-context Generation.** - ai/SCOPE)](https://github.com/Linking-ai/SCOPE)
- **ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression.**
- **LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention.** - han-lab/omniserve)](https://github.com/mit-han-lab/omniserve)
- **EvolKV: Evolutionary KV Cache Compression for LLM Inference.**
- **LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation.** - Tu.* Arxiv 2025. [](https://github.com/MGDDestiny/Lava)
- **BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference.** - llm)](https://github.com/JunqiZhao888/buzz-llm)
- **Recycled Attention: Efficient inference for long-context language models.**
- **VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration.**
- **Squeezed Attention: Accelerating Long Context Length LLM Inference.**
- **ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction.** - liang/ArkVale)](https://github.com/pku-liang/ArkVale)
- **SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs.** - Hong Deng, Jing Han.* Arxiv 2025.
- **KV-Distill: Nearly Lossless Learnable Context Compression for LLMs.** - distill)](https://github.com/vnchari/kv-distill)
- **MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache.**
- **Dynamic-LLaVA: Efficient Multimodal Large Language Models via Dynamic Vision-language Context Sparsification.**
- **DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance.**
- **RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression.** - An Tsai, Zhiding Yu, Alexey Tumanov.* Arxiv 2025.
- **Dialogue Without Limits: Constant-Sized KV Caches for Extended Responses in LLMs.**
- **The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs.**
- **Mustafar: Promoting Unstructured Sparsity for KV Cache Pruning in LLM Inference.**
- **CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences.**
- **R-KV: Redundancy-aware KV Cache Compression for Training-Free Reasoning Models Acceleration.** - Wen Chang, Jiuxiang Gu, Zhen Dong, Anima Anandkumar, Abedelkadir Asi, Junjie Hu.* Arxiv 2025. [](https://github.com/Zefan-Cai/R-KV)
- **KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction.** - Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W. Lee, Sangdoo Yun, Hyun Oh Song.* Arxiv 2025. [](https://github.com/snu-mllab/KVzip)
- **Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs.**
- **FreqKV: Frequency Domain Key-Value Compression for Efficient Context Window Extension.**
- **PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling.** - Cai/PyramidKV)](https://github.com/Zefan-Cai/PyramidKV)
- **Transformers are Multi-State RNNs.** - lab-NLP/TOVA)](https://github.com/schwartz-lab-NLP/TOVA)
- **SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference.** - Kai Fan, Junpeng Ma, Wenzhao Zheng, Tao Huang, Kuan Cheng, Denis Gudovskiy, Tomoyuki Okuno, Yohei Nakata, Kurt Keutzer, Shanghang Zhang.* Arxiv 2024. [](https://github.com/Gumpest/SparseVLMs)
- **DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads.** - han-lab/duo-attention)](https://github.com/mit-han-lab/duo-attention)
- **CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling.** - Antoine Rondeau, Yang Gao, Jackie Chi Kit Cheung.* EMNLP 2024. [](https://github.com/ybai-nlp/CItruS)
- **TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection.**
- **Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning.**
- **AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models.**
- **Inference-Time Hyper-Scaling with KV Cache Compression.**
- **InfiniPot-V: Memory-Constrained KV Cache Compression for Streaming Video Understanding.**
- **On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference.**
-
4️⃣ Low-Rank
- **Beyond Homogeneous Attention: Memory-Efficient LLMs via Fourier-Approximated KV Cache.**
- **Fast Transformer Decoding: One Write-Head is All You Need.**
- **GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.** - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai.* EMNLP 2023.
- **LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy.**
- **ThinK: Thinner Key Cache by Query-Driven Pruning.**
- **DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.** - AI.* Arxiv 2024. [](https://github.com/deepseek-ai/DeepSeek-V2)
- **Effectively Compress KV Heads for LLM.**
- **Palu: Compressing KV-Cache with Low-Rank Projection.** - Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu.* Arxiv 2024. [](https://github.com/shadowpa0327/Palu)
- **Tensor Product Attention Is All You Need.** - Chih Yao.* Arxiv 2025. [](https://github.com/tensorgi/T6)
- **OjaKV: Context-Aware Online Low-Rank KV Cache Compression with Oja's Rule.** - Yu Chen.* Arxiv 2025.
-
6️⃣ Prompt Compression
- **TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning.**
- **LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.** - Yew Lin, Yuqing Yang, Lili Qiu.* EMNLP 2023. [](https://github.com/microsoft/LLMLingua)
- **LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression.** - Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang.* ACL 2024. [](https://github.com/microsoft/LLMLingua)
- **Better Prompt Compression Without Multi-Layer Perceptrons.**
- **LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression.** - Yew Lin, Yuqing Yang, Lili Qiu.* ACL 2024. [](https://github.com/microsoft/LLMLingua)
- **ICPC: In-context Prompt Compression with Faster Inference.**
-
2️⃣ Merging
- **Token Merging: Your ViT But Faster.** - Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman.* ICLR 2023. [](https://github.com/facebookresearch/ToMe)
- **CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion.**
- **D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models.** - MLSys-Lab/d2o)](https://github.com/AIoT-MLSys-Lab/d2o)
- **Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks.**
- **CaM: Cache Merging for Memory-efficient LLMs Inference.**
- **Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs.** - Woo Ha, Jinwoo Shin.* ICLR 2024. [](https://github.com/alinlab/HOMER)
- **Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention.**
- **Compressed Context Memory for Online Language Model Interaction.** - Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song.* ICLR 2024. [](https://github.com/snu-mllab/Context-Memory)
- **LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference.** - M)](https://github.com/SUSTechBruce/LOOK-M)
- **AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning.** - Lab/AIM)](https://github.com/LaVi-Lab/AIM)
- **KeepKV: Eliminating Output Perturbation in KV Cache Compression for Efficient LLMs Inference.**
-
3️⃣ Cross-Layer
- **A Systematic Study of Cross-Layer KV Sharing for Efficient LLM Inference.**
- **Layer-Condensed KV Cache for Efficient Inference of Large Language Models.**
- **MiniCache: KV Cache Compression in Depth Dimension for Large Language Models.**
- **Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.**
- **You Only Cache Once: Decoder-Decoder Architectures for Language Models.**
- **KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing.**
- **Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity.**
- **SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation.**
- **MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding.** - mlkv)](https://github.com/zaydzuhri/pythia-mlkv)
-
5️⃣ Quantization
- **GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM.** - project/GEAR)](https://github.com/opengear-project/GEAR)
- **ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification.**
- **No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization.**
- **PQCache: Product Quantization-based KVCache for Long Context LLM Inference.**
- **Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression.** - Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen.* Arxiv 2024.
- **SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models.**
- **QAQ: Quality Adaptive Quantization for LLM KV Cache.** - KVCacheQuantization)](https://github.com/ClubieDong/QAQ-KVCacheQuantization)
- **WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More.**
- **KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.**
- **KVTuner: Sensitivity-Aware Layer-wise Mixed Precision KV Cache Quantization for Efficient and Nearly Lossless LLM Inference.** - Ling Zhen, Wulong Liu, Yiwu Yao, Sinno Jialin Pan, Mingxuan Yuan.* ICML 2025. [](https://github.com/cmd2001/KVTuner)
- **TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization.**
- **TaDA: Training-free recipe for Decoding with Adaptive KV Cache Compression and Mean-centering.**
- **KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.** - yuan/KIVI)](https://github.com/jy-yuan/KIVI)
- **Plug-and-Play 1.x-Bit KV Cache Quantization for Video Large Language Models.** - TAO/VidKV)](https://github.com/KD-TAO/VidKV)
- **BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache.** - DuDa/BitDecoding)](https://github.com/DD-DuDa/BitDecoding)
- **CommVQ: Commutative Vector Quantization for KV Cache Compression.**
-
7️⃣ Reuse
- **KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse.** - NLP-Chang/KVLink)](https://github.com/UCSB-NLP-Chang/KVLink)
-
8️⃣ Non-Autoregressive
- **dKV-Cache: The Cache for Diffusion Language Models.** - Cache)](https://github.com/horseee/dKV-Cache)
- **dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching.** - cache)](https://github.com/maomaocun/dLLM-cache)
- **Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding.** - dLLM)](https://github.com/NVlabs/Fast-dLLM)
-
-
📷 Survey
- **Prompt Compression for Large Language Models: A Survey.**
- **A Survey on Large Language Model Acceleration based on KV Cache Management.** - Lab/Awesome-KV-Cache-Management)](https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management)
- **Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption.** - charlie/Awesome-KV-Cache)](https://github.com/zcli-charlie/Awesome-KV-Cache)
-
📊 Evaluation
-
8️⃣ Non-Autoregressive
- **KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches.** - Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu.* EMNLP 2024. [](https://github.com/henryzhongsc/longctx_bench)
- **Rethinking Key-Value Cache Compression Techniques for Large Language Model Serving.** - kv-compression)](https://github.com/LLMkvsys/rethink-kv-compression)
- **SCBench: A KV Cache-Centric Analysis of Long-Context Methods.**
- **More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression.**
-
-
⚙️ Project
- **kvpress.**
- **KVCache-Factory.** - Cai/KVCache-Factory)](https://github.com/Zefan-Cai/KVCache-Factory)