Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-KV-Cache-Compression
📰 Must-read papers on KV Cache Compression (constantly updating 🤗).
https://github.com/October2001/Awesome-KV-Cache-Compression
Last synced: 4 days ago
JSON representation
-
🔍 Method
-
1️⃣ Pruning / Evicting / Sparse
- **MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention.** - Yew Lin, Yuqing Yang, Lili Qiu.* Arxiv 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/MInference)](https://github.com/microsoft/MInference)
- **Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers.**
- **Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference.**
- **ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching.**
- **Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference.** - matrix-ai/keyformer-llm)](https://github.com/d-matrix-ai/keyformer-llm)
- **Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference.**
- **Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time.**
- **SnapKV: LLM Knows What You are Looking for Before Generation.**
- **H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models.**
- **Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs.**
- **PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference.**
- **PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling.** - Cai/PyramidKV)](https://github.com/Zefan-Cai/PyramidKV)
- **Transformers are Multi-State RNNs.** - lab-NLP/TOVA)](https://github.com/schwartz-lab-NLP/TOVA)
- **Efficient Streaming Language Models with Attention Sinks.** - han-lab/streaming-llm)](https://github.com/mit-han-lab/streaming-llm)
- **A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression.**
- **Retrieval Head Mechanistically Explains Long-Context Factuality.**
- **Efficient Sparse Attention needs Adaptive Token Release.**
- **Loki: Low-Rank Keys for Efficient Sparse Attention.**
- **On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference.**
- **CORM: Cache Optimization with Recent Message for Large Language Model Inference.**
- **RazorAttention: Efficient KV Cache Compression Through Retrieval Heads.**
- **ThinK: Thinner Key Cache by Query-Driven Pruning.**
- **A2SF: Accumulative Attention Scoring with Forgetting Factor for Token Pruning in Transformer Decoder.** - Notation/A2SF)](https://github.com/Dirac-Notation/A2SF)
- **Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference.** - han-lab/Quest)](https://github.com/mit-han-lab/Quest)
- **LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference.**
- **Post-Training Sparse Attention with Double Sparsity.** - yang-1/DoubleSparse)](https://github.com/andy-yang-1/DoubleSparse)
- **Farewell to Length Extrapolation, a Training-Free Infinite Context with Finite Attention Scope.**
- **Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference.**
- **NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time.** - NACL)
-
2️⃣ Merging
- **Token Merging: Your ViT But Faster.** - Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, Judy Hoffman.* ICLR 2023. [![GitHub Repo stars](https://img.shields.io/github/stars/facebookresearch/ToMe)](https://github.com/facebookresearch/ToMe)
- **D2O: Dynamic Discriminative Operations for Efficient Generative Inference of Large Language Models.**
- **Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks.**
- **CaM: Cache Merging for Memory-efficient LLMs Inference.**
- **Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs.** - Woo Ha, Jinwoo Shin.* ICLR 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/alinlab/HOMER)](https://github.com/alinlab/HOMER)
-
4️⃣ Low-Rank
- **GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints.** - Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, Sumit Sanghai.* EMNLP 2023.
- **DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.** - AI.* Arxiv 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-V2)](https://github.com/deepseek-ai/DeepSeek-V2)
- **Effectively Compress KV Heads for LLM.**
-
6️⃣ Prompt Compression
- **LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models.** - Yew Lin, Yuqing Yang, Lili Qiu.* EMNLP 2023. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/LLMLingua)](https://github.com/microsoft/LLMLingua)
- **LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression.** - Yew Lin, H. Vicky Zhao, Lili Qiu, Dongmei Zhang.* ACL 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/LLMLingua)](https://github.com/microsoft/LLMLingua)
- **LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression.** - Yew Lin, Yuqing Yang, Lili Qiu.* ACL 2024. [![GitHub Repo stars](https://img.shields.io/github/stars/microsoft/LLMLingua)](https://github.com/microsoft/LLMLingua)
-
3️⃣ Cross-Layer
- **Layer-Condensed KV Cache for Efficient Inference of Large Language Models.**
- **MiniCache: KV Cache Compression in Depth Dimension for Large Language Models.**
- **MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding.** - mlkv)](https://github.com/zaydzuhri/pythia-mlkv)
- **You Only Cache Once: Decoder-Decoder Architectures for Language Models.**
- **Reducing Transformer Key-Value Cache Size with Cross-Layer Attention.**
-
5️⃣ Quantization
- **ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification.**
- **No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization.**
- **KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache.** - yuan/KIVI)](https://github.com/jy-yuan/KIVI)
- **GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM.** - project/GEAR)](https://github.com/opengear-project/GEAR)
- **PQCache: Product Quantization-based KVCache for Long Context LLM Inference.**
- **Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression.** - Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen.* Arxiv 2024.
- **SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models.**
- **QAQ: Quality Adaptive Quantization for LLM KV Cache.** - KVCacheQuantization)](https://github.com/ClubieDong/QAQ-KVCacheQuantization)
- **KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.**
- **WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More.**
-
-
📊 Evaluation
-
6️⃣ Prompt Compression
- **KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches.** - Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, Zirui Liu, Xia Hu.* Arxiv 2024.
-
-
📷 Survey