Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-Efficient-LLM
A curated list for Efficient Large Language Models
https://github.com/horseee/Awesome-Efficient-LLM
Last synced: 5 days ago
JSON representation
-
Paper from August 24, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Inference Acceleration
- ![Star
- ![Star - wise Criticality-based Approach for Prefilling Acceleration in LLMs](https://arxiv.org/abs/2409.12490) <br> Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie |<img width="1002" alt="image" src="https://arxiv.org/html/2409.12490v1/x2.png"> |[Github](https://github.com/66RING/CritiPrefill) <br> [Paper](https://arxiv.org/abs/2409.12490)|[//]: #09/21
- ![Star - the-Fly Self-Speculative Decoding for LLM Inference Acceleration](https://arxiv.org/abs/2410.06916) <br> Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li |<img width="1002" alt="image" src="https://github.com/hemingkx/SWIFT/raw/main/assets/swift.png"> |[Github](https://github.com/hemingkx/SWIFT) <br> [Paper](https://arxiv.org/abs/2410.06916)|[//]: #10/14
- ![Star - Inspired Adaptive Sparse Activation](https://arxiv.org/abs/2410.18311#) <br> Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen |<img width="1002" alt="image" src="https://wangqinsi1.github.io/coreinfer_page/static/images/overview.png"> |[Github](https://github.com/wangqinsi1/CoreInfer) <br> [Paper](https://arxiv.org/abs/2410.18311#)|[//]: #10/29
- ![Star - han-lab/duo-attention)<br>[DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads](https://arxiv.org/abs/2410.10819) <br> Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/duo-attention/raw/main/figures/method1.jpg"> |[Github](https://github.com/mit-han-lab/duo-attention) <br> [Paper](https://arxiv.org/abs/2410.10819)|[//]: #10/21
- ![Star - ICML'23%20Oral-blue)]()<br> :star: [Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https://openreview.net/forum?id=wIPIhHd00i) <br> Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen |<img width="202" alt="image" src="figures/DajeVu.png"> |[Github](https://github.com/FMInference/DejaVu) <br> [Paper](https://openreview.net/forum?id=wIPIhHd00i)| [//]: #Recommend
- ![Star - han-lab/streaming-llm)<br> :star: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) <br> Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis |<img width="1002" alt="image" src="https://github.com/mit-han-lab/streaming-llm/blob/main/figures/schemes.png"> |[Github](https://github.com/mit-han-lab/streaming-llm) <br> [Paper](https://arxiv.org/abs/2309.17453)| [//]: #Recommend
-
KV Cache Compression
- ![Star - Wise Dissimilar KV Cache Sharing](https://arxiv.org/abs/2410.18517) <br> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |<img width="1002" alt="image" src="https://github.com/yangyifei729/KVSharer/raw/main/img/main_fig.jpg"> |[Github](https://github.com/yangyifei729/KVSharer) <br> [Paper](https://arxiv.org/abs/2410.18517)|[//]: #10/29
- ![Star - KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference](https://arxiv.org/abs/2407.11550) <br> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |<img width="1002" alt="image" src="figures/adakv.png"> |[Github](https://github.com/FFY0/AdaKV) <br> [Paper](https://arxiv.org/abs/2407.11550)|[//]: #10/13
- ![Star - Layer KV Sharing for Efficient LLM Inference](https://arxiv.org/abs/2410.14442) <br> You Wu, Haoyi Wu, Kewei Tu |<img width="202" alt="image" src="figures/cross-layer-kv.png"> |[Github](https://github.com/whyNLP/LCKV) <br> [Paper](https://arxiv.org/abs/2410.14442)|[//]: #10/30
- ![Star - Level KV Cache Compression Method with Integrated Retrieval and Reasoning](https://arxiv.org/abs/2410.19258) <br> Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao |<img width="1002" alt="image" src="https://github.com/FYYFU/HeadKV/raw/main/main.png"> |[Github](https://github.com/FYYFU/HeadKV) <br> [Paper](https://arxiv.org/abs/2410.19258)|[//]: #11/17
-
Tuning
- ![Star - EMNLP'24-blue)]()<br>[Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models](https://arxiv.org/abs/2410.11772) <br> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11772v1/x3.png"> |[Github](https://github.com/Kaiseem/IST) <br> [Paper](https://arxiv.org/abs/2410.11772)|[//]: #10/21
- ![Star - IIITD/MonteCLoRA)<br>[Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation](https://arxiv.org/abs/2411.04358) <br> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04358v2/x3.png"> |[Github](https://github.com/LCS2-IIITD/MonteCLoRA) <br> [Paper](https://arxiv.org/abs/2411.04358)|[//]: #11/18
- ![Star - EMNLP'24%20Findings-blue)]()<br>[QEFT: Quantization for Efficient Fine-Tuning of LLMs](https://arxiv.org/abs/2410.08661) <br> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08661v1/x2.png"> |[Github](https://github.com/xvyaward/qeft) <br> [Paper](https://arxiv.org/abs/2410.08661)|[//]: #10/21
- ![Star - tuning of MLP Layers](https://arxiv.org/abs/2410.07383) <br> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07383v1/x1.png"> |[Github](https://github.com/sayankotor/sparse_grads) <br> [Paper](https://arxiv.org/abs/2410.07383)|[//]: #10/13
-
Text Compression
- ![Star
- ![Star - Shree-Narashiman/AlphaZip)<br>[AlphaZip: Neural Network-Enhanced Lossless Text Compression](https://arxiv.org/abs/2409.15046) <br> Swathi Shree Narashiman, Nitin Chandrachoodan |<img width="1002" alt="image" src="https://arxiv.org/html/2409.15046v1/extracted/5873563/images/architecture_bloack_diagram.png"> |[Github](https://github.com/Swathi-Shree-Narashiman/AlphaZip) <br> [Paper](https://arxiv.org/abs/2409.15046)|[//]: #09/27
- ![Star - hou/instruction-aware-contextual-compressor)<br>[Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression](https://arxiv.org/abs/2408.15491) <br> Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2408.15491v1/extracted/5817813/arch.png"> |[Github](https://github.com/howard-hou/instruction-aware-contextual-compressor) <br> [Paper](https://arxiv.org/abs/2408.15491)|[//]: #09/02
- ![Star - Length Tokenization for Efficient LLMs Adapted from LZW Compression](https://arxiv.org/abs/2410.21548) <br> Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21548v1/extracted/5960495/Figures/MultiTok.png"> |[Github](https://github.com/noelkelias/multitok) <br> [Paper](https://arxiv.org/abs/2410.21548)|[//]: #11/17
- ![Star - context-distillation)<br>[Generative Context Distillation](https://arxiv.org/abs/2411.15927) <br> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |<img width="1002" alt="image" src="figures/GCD.png"> |[Github](https://github.com/kaistai/generative-context-distillation) <br> [Paper](https://arxiv.org/abs/2411.15927)|[//]: #12/02
-
Low-Rank Decomposition
- ![Star - ai/Natural-GaLore)<br>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https://arxiv.org/abs/2410.16029) <br> Arijit Das | |[Github](https://github.com/selfsupervised-ai/Natural-GaLore) <br> [Paper](https://arxiv.org/abs/2410.16029)|[//]: #10/30
-
Hardware/System/Serving
- ![Star - LLM)<br>[TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices](https://arxiv.org/abs/2410.00531) <br> Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.00531v1/x4.png"> |[Github](https://github.com/Lizonghang/TPI-LLM) <br> [Paper](https://arxiv.org/abs/2410.00531)|[//]: #10/02
-
Quantization
- ![Star - group/Quamba)<br>[Quamba: A Post-Training Quantization Recipe for Selective State Space Models](https://arxiv.org/abs/2410.13229) <br> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13229v1/extracted/5933363/figures/outliers.png"> |[Github](https://github.com/enyac-group/Quamba) <br> [Paper](https://arxiv.org/abs/2410.13229)|[//]: #10/21
- ![Star - ov-file)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs](https://arxiv.org/abs/2406.01721) <br> Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei |<img width="1002" alt="image" src="https://github.com/Hsu1023/DuQuant/blob/main/imgs/duquant.png"> |[Github](https://github.com/Hsu1023/DuQuant?tab=readme-ov-file) <br> [Paper](https://arxiv.org/abs/2406.01721)|[//]: #09/27
- ![Star
- ![Star - bit Vector Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2409.17066) <br> Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang |<img width="1002" alt="image" src="figures/VPTQ.png"> |[Github](https://github.com/microsoft/VPTQ) <br> [Paper](https://arxiv.org/abs/2409.17066)|[//]: #09/27
- ![Star
- ![Star - Mozaffari/slim)<br>[SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs](https://arxiv.org/abs/2410.09615) <br> Mohammad Mozaffari, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09615v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slim) <br> [Paper](https://arxiv.org/abs/2410.09615)|[//]: #10/21
- Matmul or No Matmal in the Era of 1-bit LLMs
- ![Star - lab/BitMoD-HPCA-25)[![Publish](https://img.shields.io/badge/Conference-HPCA'25-blue)]()<br>[BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration](https://arxiv.org/abs/2411.11745) <br> Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah |<img width="1002" alt="image" src="https://arxiv.org/html/2411.11745v1/x5.png"> |[Github](https://github.com/abdelfattah-lab/BitMoD-HPCA-25) <br> [Paper](https://arxiv.org/abs/2411.11745)|[//]: #11/24
-
Efficient Training
- ![Star - Efficient FP8 Training](https://arxiv.org/abs/2410.19313) <br> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |<img width="1002" alt="image" src="https://github.com/NVlabs/COAT/blob/main/docs/figs/FP8PrecisionFlow.png"> |[Github](https://github.com/NVlabs/COAT) <br> [Paper](https://arxiv.org/abs/2410.19313)|[//]: #11/17
- ![Star - v.svg"> |[Github](https://github.com/wuhouming/BitPipe) <br> [Paper](https://arxiv.org/abs/2410.19367)|[//]: #11/17
-
Survey (or Benchmark)
- ![Star - lcf/LLM-Inference-Bench)<br>[LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/abs/2411.00136) <br> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al | |[Github](https://github.com/argonne-lcf/LLM-Inference-Bench) <br> [Paper](https://arxiv.org/abs/2411.00136)|[//]: #11/18
- ![Star - Compression-Survey)<br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) <br> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12388v2/extracted/5933385/Figures/tree_overview.png"> |[Github](https://github.com/ZongqianLi/Prompt-Compression-Survey) <br> [Paper](https://arxiv.org/abs/2410.12388)|[//]: #10/21
- ![Star - Compression)<br>[Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2409.13385) <br> Sourav Verma |<img width="1002" alt="image" src="figures/CCRAG_survey.png"> |[Github](https://github.com/SrGrace/Contextual-Compression) <br> [Paper](https://arxiv.org/abs/2409.13385)|[//]: #09/27
-
Efficient MOE
- ![Star - 2)<br>[MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) <br> Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01016v1/x1.png"> |[Github](https://github.com/xiaochengsky/MoEI-2) <br> [Paper](https://arxiv.org/abs/2411.01016)|[//]: #11/18
- ![Star - 778/MC-MoE)<br>[MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More](https://arxiv.org/abs/2410.06270) <br> Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/MC-MoE/raw/main/imgs/[email protected]"> |[Github](https://github.com/Aaronhuang-778/MC-MoE) <br> [Paper](https://arxiv.org/abs/2410.06270)|[//]: #10/14
-
Network Pruning / Sparsity
- Language-specific Calibration for Pruning Multilingual Language Models - Jia Chen, Lucie Flek ||[Paper](https://arxiv.org/abs/2408.14398)|[//]: #08/27
- Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
- ![Star - zjut/Navigation-LLM-layer-pruning)<br>[Reassessing Layer Pruning in LLMs: New Insights and Methods](https://arxiv.org/abs/2411.15558) <br> Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu |<img width="1002" alt="image" src="https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning/raw/main/framework.JPG"> |[Github](https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning) <br> [Paper](https://arxiv.org/abs/2411.15558)|[//]: #12/03
- LLM Pruning and Distillation in Practice: The Minitron Approach
- ![Star - EIC/AmoebaLLM)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment](https://arxiv.org/abs/2411.10606) <br> Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2411.10606v1/x2.png"> |[Github](https://github.com/GATECH-EIC/AmoebaLLM) <br> [Paper](https://arxiv.org/abs/2411.10606)|[//]: #11/24
- ![Star
-
Knowledge Distillation
-
-
Paper from Sep 2, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Network Pruning / Sparsity
- ![Star - Aware-Automated-Machine-Learning/tree/main/SQFT)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() [![Type](https://img.shields.io/badge/w/Quantization-39B0A9)]() <br>[SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models](https://arxiv.org/abs/2410.03750) <br> Juan Pablo Munoz, Jinjie Yuan, Nilesh Jain |<img width="1002" alt="image" src="figures/SQFT.png"> |[Github](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT) <br> [Paper](https://arxiv.org/abs/2410.03750)|[//]: #10/01
- KVPruner: Structural Pruning for Faster and Memory-Efficient Large Language Models
- Evaluating the Impact of Compression Techniques on Task-Specific Performance of Large Language Models
- STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning - won hwang, Aurick Qiao, Daniel F Campos, Zhewei Yao, Yuxiong He |<img width="1002" alt="image" src="https://arxiv.org/html/2409.06211v1/x1.png"> |[Paper](https://arxiv.org/abs/2409.06211)|[//]: #09/13
- OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition
-
Inference Acceleration
- Path-Consistency: Prefix Enhancement for Efficient Inference in LLM
- Dynamic-Width Speculative Beam Decoding for Efficient LLM Inference
- Boosting Lossless Speculative Decoding via Feature Sampling and Partial Alignment Distillation
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
- Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
-
Hardware/System/Serving
- ![Publish
- ![Publish - Preserved Microscaling Quantization A ccelerator for Generative Large Language Models](https://arxiv.org/abs/2409.05902) <br> Jahyun Koo, Dahoon Park, Sangwoo Jung, Jaeha Kung |<img width="1002" alt="image" src="https://arxiv.org/html/2409.05902v1/x5.png"> |[Paper](https://arxiv.org/abs/2409.05902)|[//]: #09/13
- Accelerating Large Language Model Training with Hybrid GPU-based Compression - 3d-rev.png"> |[Paper](https://arxiv.org/abs/2409.02423)|[//]: #09/06
-
Text Compression
-
Quantization
- The Uniqueness of LLaMA3-70B with Per-Channel Quantization: An Empirical Study - 70b-series-accuracy.png"> |[Paper](https://arxiv.org/abs/2408.15301)|[//]: #09/02
- A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B
- Accumulator-Aware Post-Training Quantization
-
Survey (or Benchmark)
-
Tuning
- Enabling Resource-Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines - RGE.png"> |[Paper](https://arxiv.org/abs/2409.15520)|[//]: #09/27
-
Knowledge Distillation
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models
- Efficient Knowledge Distillation: Empowering Small Language Models with Teacher Model Insights - Uwe Kühnberger |<img width="1002" alt="image" src="https://arxiv.org/html/2409.12586v1/x2.png"> |[Paper](https://arxiv.org/abs/2409.12586)|[//]: #09/21
- BabyLlama-2: Ensemble-Distilled Models Consistently Outperform Teachers With Limited Data - Loup Tastet, Inar Timiryasov | |[Paper](https://arxiv.org/abs/2409.17312)|[//]: #09/27
- EchoAtt: Attend, Copy, then Adjust for More Efficient Large Language Models
-
KV Cache Compression
-
Efficient MOE
- ![Star - MoE)<br>[Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning](https://arxiv.org/abs/2412.00069) <br> Mingyu Cao, Gen Li, Jie Ji, Jiaqi Zhang, Xiaolong Ma, Shiwei Liu, Lu Yin |<img width="1002" alt="image" src="https://arxiv.org/html/2412.00069v1/x2.png"> |[Github](https://github.com/duterscmy/CD-MoE) <br> [Paper](https://arxiv.org/abs/2412.00069)|[//]: #12/09
-
-
Knowledge Distillation
- Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models - AKL.png"> |[Paper](https://arxiv.org/abs/2404.02657) <br> [Blog-Eng](https://zhuanlan.zhihu.com/p/690804722)<br> [Blog-中](https://zhuanlan.zhihu.com/p/690748958)|
- Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation
- Teaching Small Language Models to Reason
- ![Star - distillation) <br>[Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing](https://arxiv.org/abs/2305.16635) <br> Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi |<img width="1002" alt="image" src="figures/impossible_distillation.png"> |[Github](https://github.com/jaehunjung1/impossible-distillation) [paper](https://arxiv.org/abs/2305.16635) |
- ![Star - distill) <br> [Large Language Model Distillation Doesn't Need a Teacher](https://arxiv.org/abs/2305.14864) <br> Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, Iz Beltagy </br> | <img width="2000" alt="image" src="figures/TeacherFreeLLM.png"> | [Github](https://github.com/ananyahjha93/llm-distill) [paper](https://arxiv.org/abs/2305.14864) |
- PaD: Program-aided Distillation Specializes Large Models in Reasoning
- The False Promise of Imitating Proprietary LLMs
- RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment
- ![Star - CoT-Specialization)[![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]()<br>[Specializing Smaller Language Models towards Multi-Step Reasoning](https://arxiv.org/abs/2301.12726) <br> Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot |<img width="1002" alt="image" src="figures/ModelSpecialization.png"> |[Github](https://github.com/FranxYao/FlanT5-CoT-Specialization) <br> [Paper](https://arxiv.org/abs/2301.12726)|
- ![Star - ACL'23%20Outstanding-blue)]()<br>[Distilling Script Knowledge from Large Language Models for Constrained Language Planning](https://arxiv.org/abs/2305.05252) <br> Siyu Yuan, Jiangjie Chen, Ziquan Fu, Xuyang Ge, Soham Shah, Charles Robert Jankowski, Yanghua Xiao, Deqing Yang |<img width="302" alt="image" src="figures/CoScript.png"> |[Github](https://github.com/siyuyuan/coscript) <br> [Paper](https://arxiv.org/abs/2305.05252)|
- ![Publish - Consistent Chain-of-Thought Distillation](https://arxiv.org/abs/2305.01879) <br> Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, Xiang Ren |<img width="1002" alt="image" src="figures/SCOTT.png"> |[Paper](https://arxiv.org/abs/2305.01879)|
- ![Star - ACL'23-blue)]()<br>[DISCO: Distilling Counterfactuals with Large Language Models](https://arxiv.org/abs/2212.10534) <br> Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson |<img width="1002" alt="image" src="figures/disco.png"> |[Github](https://github.com/eric11eca/disco) <br> [Paper](https://arxiv.org/abs/2212.10534)|
- ![Star - ACL'23-blue)]()<br>[I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation](https://arxiv.org/abs/2212.09246) <br> Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi |<img width="1002" alt="image" src="https://i2d2.allen.ai/i2d2-fig1.png"> |[Github](https://github.com/allenai/i2d2) <br> [Paper](https://arxiv.org/abs/2212.09246) <br> [Project](https://i2d2.allen.ai/) |
- ![Star - ACL'23-blue)]()<br>[Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step](https://arxiv.org/abs/2306.14050) <br> Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi |<img width="202" alt="image" src="figures/SCoTD.png"> |[Github](https://github.com/allenai/cot_distillation) <br> [Paper](https://arxiv.org/abs/2306.14050)|
- ![Star - NeurIPS'23-blue)]() <br>[Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind](https://arxiv.org/abs/2306.09299) <br> Swarnadeep Saha, Peter Hase, and Mohit Bansal |<img width="302" alt="image" src="https://github.com/swarnaHub/ExplanationIntervention/blob/main/assets/main_fig.png"> |[Github](https://github.com/swarnaHub/ExplanationIntervention) <br> [Paper](https://arxiv.org/abs/2306.09299)|
- ![Star - EMNLP-2023)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation](https://arxiv.org/abs/2310.14192) <br> Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, Issam H. Laradji |<img width="1002" alt="image" src="figures/PromptMix.png"> |[Github](https://github.com/ServiceNow/PromptMix-EMNLP-2023) <br> [Paper](https://arxiv.org/abs/2310.14192)|
- ![Star - AAAI'24-blue)]()<br>[Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data](https://arxiv.org/abs/2312.12832) <br> Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, Kan Li |<img width="1002" alt="image" src="https://github.com/Yiwei98/TDG/blob/main/img.png"> |[Github](https://github.com/Yiwei98/TDG) <br> [Paper](https://arxiv.org/abs/2312.12832)|
- ![Star - ACL'23%20Industry%20Track-blue)]() <br>[GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model](https://arxiv.org/abs/2306.06629) <br> Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Yang Yang, Hongyin Tang, Keqing He, Jiahao Liu, Jingang Wang, Shu Zhao, Peng Zhang, Jie Tang |<img width="1002" alt="image" src="figures/GKD.png"> |[Github](https://github.com/aitsc/GLMKD) <br> [Paper](https://arxiv.org/abs/2306.06629)|
- ![Star - research/distilling-step-by-step) [![Publish](https://img.shields.io/badge/Conference-ACL'23%20Findings-blue)]() <br> [Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes](https://arxiv.org/abs/2305.02301) <br> Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister | <img width="2000" alt="image" src="figures/Distill_step_by_step.png">| [Github](https://github.com/google-research/distilling-step-by-step) <br> [Paper](https://arxiv.org/abs/2305.02301) |
- ![Star - EMNLP'23%20Findings-blue)]()<br>[Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models](https://arxiv.org/abs/2310.13395) <br> Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos |<img width="252" alt="image" src="figures/OCaTS.png"> |[Github](https://github.com/stoyian/OCaTS) <br> [Paper](https://arxiv.org/abs/2310.13395)|
- ![Star - nlp/LaMini-LM) <br> [LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions](https://github.com/mbzuai-nlp/LaMini-LM) <br>Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji | <img width="1002" alt="image" src="https://github.com/mbzuai-nlp/LaMini-LM/blob/main/images/lamini-pipeline.drawio.png"> | [Github](https://github.com/mbzuai-nlp/LaMini-LM) [paper](https://arxiv.org/abs/2304.14402) |
- Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA - COT.png"> |[Paper](https://arxiv.org/abs/2308.04679)|
- ![Star - ner/universal-ner)<br>[UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition](https://arxiv.org/abs/2308.03279) <br> Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon |<img width="302" alt="image" src="figures/UniversalNER.png"> |[Github](https://github.com/universal-ner/universal-ner) <br> [Paper](https://arxiv.org/abs/2308.03279) <br> [Project](https://universal-ner.github.io) |
- ![Star - Loup Tastet |<img width="302" alt="image" src="figures/BabyLLaMA.png"> |[Github](https://github.com/timinar/BabyLlama) <br> [Paper](https://arxiv.org/abs/2308.02019) | [Model](https://huggingface.co/timinar/baby-llama-58m) |
- ![Star - handbook)<br>[Zephyr: Direct Distillation of LM Alignment](https://arxiv.org/abs/2310.16944) <br> Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf |<img width="1002" alt="image" src="figures/zephyr.png"> |[Github](https://github.com/huggingface/alignment-handbook) <br> [Paper](https://arxiv.org/abs/2310.16944)|
- ![Star
- Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models
- Mixed Distillation Helps Smaller Language Model Better Reasoning
- Distilling Event Sequence Knowledge From Large Language Models
- Knowledge Distillation for Closed-Source Language Models
- ![Star - Young Yun |<img width="1002" alt="image" src="https://arxiv.org/html/2402.03898v1/x4.png"> |[Github](https://github.com/jongwooko/distillm) <br> [Paper](https://arxiv.org/abs/2402.03898)|
- Large Language Model Meets Graph Neural Network in Knowledge Distillation
- Improving Small Language Models' Mathematical Reasoning via Equation-of-Thought Distillation
- Scavenging Hyena: Distilling Transformers into Long Convolution Models - Transfer-HD.png"> |[Paper](https://arxiv.org/abs/2401.17574)|
- Divide-or-Conquer? Which Part Should You Distill Your LLM?
- ![Star - cd)<br>[Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation](https://arxiv.org/abs/2402.14874) <br> Phuc Phan, Hieu Tran, Long Phan |<img width="1002" alt="image" src="https://github.com/pphuc25/distil-cd/blob/main/assets/figure1-method.jpg"> |[Github](https://github.com/pphuc25/distil-cd) <br> [Paper](https://arxiv.org/abs/2402.14874)|
- PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning
- Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning
- Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model
- ![Star - BZRD/llm-recipes)<br>[Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs](https://arxiv.org/abs/2402.12030) <br> Nicolas Boizard, Kevin El-Haddad, Céline Hudelot, Pierre Colombo |<img width="1002" alt="image" src="figures/CrossTokenizer.png"> |[Github](https://github.com/Nicolas-BZRD/llm-recipes) [Github](https://github.com/Nicolas-BZRD/llm-distillation) <br> [Paper](https://arxiv.org/abs/2402.12030) <br> [Model](https://huggingface.co/collections/Nicolas-BZRD/llms-distillation-65cfa07f1e4ed7404502a9eb)|
- Revisiting Knowledge Distillation for Autoregressive Language Models
- ![Publish
- ![Star - large-metaie)|
- Gecko: Versatile Text Embeddings Distilled from Large Language Models
- DistillSpec: Improving Speculative Decoding via Knowledge Distillation - François Kagy, Rishabh Agarwal |<img width="1002" alt="image" src="figures/DistillSpec.png"> |[Paper](https://arxiv.org/abs/2310.08461)|
- ![Star - river/LLM_unlearning)<br>[Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination](https://arxiv.org/abs/2402.10052) <br> Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić |<img width="1002" alt="image" src="https://arxiv.org/html/2402.10052v1/x1.png"> |[Github](https://github.com/dong-river/LLM_unlearning) <br> [Paper](https://arxiv.org/abs/2402.10052)|
- ![Star - to-Reason)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[Democratizing Reasoning Ability: Tailored Learning from Large Language Model](https://aclanthology.org/2023.emnlp-main.120.pdf) <br> Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, Jiahai Wang, Minghui Song, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang |<img width="1002" alt="image" src="figures/learn-to-reason.png"> |[Github](https://github.com/Raibows/Learn-to-Reason) <br> [Paper](https://aclanthology.org/2023.emnlp-main.120.pdf)|
- Leveraging Zero-Shot Prompting for Efficient Language Model Distillation
- Post-Semantic-Thinking: A Robust Strategy to Distill Reasoning Capacity from Large Language Models
- Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning
- ![Star - ACL'24-blue)]()<br>[RDRec: Rationale Distillation for LLM-based Recommendation](https://arxiv.org/abs/2405.10587) <br> Xinfeng Wang, Jin Cui, Yoshimi Suzuki, Fumiyo Fukumoto |<img width="1002" alt="image" src="figures/RDRec.png"> |[Github](https://github.com/WangXFng/RDRec) <br> [Paper](https://arxiv.org/abs/2405.10587)|
-
Network Pruning
- ![Star - LANCE/MBS)<br>[Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind](https://arxiv.org/abs/2404.04748) <br> Hongchuan Zeng, Hongshen Xu, Lu Chen, Kai Yu |<img width="1002" alt="image" src="https://github.com/HongchuanZeng/MBS/raw/main/mbs.png"> |[Github](https://github.com/X-LANCE/MBS) <br> [Paper](https://arxiv.org/abs/2404.04748)|
- ![Star - cybernetics/Relative-importance-and-activation-pruning)[![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]()<br>[Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models](https://openreview.net/forum?id=Tr0lPx9woF) <br> Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, Carlo Vittorio Cannistraci |<img width="1002" alt="image" src="figures/RIA.png"> |[Github](https://github.com/biomedical-cybernetics/Relative-importance-and-activation-pruning) <br> [Paper](https://openreview.net/forum?id=Tr0lPx9woF)|
- Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
- Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
- Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models
- Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
- ![Star - DASLab/sparsegpt) [![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://github.com/IST-DASLab/sparsegpt) <br> Elias Frantar, Dan Alistarh| <img width="522" alt="image" src="figures/sparsegpt.png"> |[Github](https://github.com/IST-DASLab/sparsegpt) [paper](https://arxiv.org/abs/2301.00774) |
- ![Star - Pruner) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[LLM-Pruner: On the Structural Pruning of Large Language Models](https://arxiv.org/abs/2305.11627) <br> Xinyin Ma, Gongfan Fang, Xinchao Wang | <img width="561" alt="image" src="figures/llm_pruner.png">| [Github](https://github.com/horseee/LLM-Pruner) [paper](https://arxiv.org/abs/2305.11627)|
- ![Star - Group/essential_sparsity) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter](https://arxiv.org/abs/2306.03805) <br> Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang |<img width="1002" alt="image" src="https://user-images.githubusercontent.com/6660499/243539825-ca3b1dbe-bc1c-45d9-a6ea-d1d0c991e997.png"> |[Github](https://github.com/VITA-Group/essential_sparsity) <br> [Paper](https://arxiv.org/abs/2306.03805)|
- ![Star - llm)[![Publish](https://img.shields.io/badge/Conference-VLDB'24-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity](https://arxiv.org/abs/2309.10285) <br> Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song |<img width="602" alt="image" src="figures/FlashLLM.png"> |[Github](https://github.com/AlibabaResearch/flash-llm) <br> [Paper](https://arxiv.org/abs/2309.10285)|
- ![Star - Pruning-Official)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23%20Findings-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models](https://arxiv.org/abs/2310.10054) <br> Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young Yun |<img width="402" alt="image" src="figures/NASH.png"> |[Github](https://github.com/jongwooko/NASH-Pruning-Official) <br> [Paper](https://arxiv.org/abs/2310.10054)|
- ![Star - C2A4A6)]()<br>[A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) <br> Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter |<img width="1002" alt="image" src="https://user-images.githubusercontent.com/20168304/245999360-f951de47-269d-491d-826a-8e6d85627849.png"> |[Github](https://github.com/locuslab/wanda) <br> [Paper](https://arxiv.org/abs/2306.11695)|
- ![Type
- ![Type - KICK.png"> |[Paper](https://arxiv.org/abs/2310.01382)|
- ![Star - Group/Junk_DNA_Hypothesis)[![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]()<br>[Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity](https://arxiv.org/abs/2310.02277) <br> Lu Yin, Shiwei Liu, Ajay Jaiswal, Souvik Kundu, Zhangyang Wang |<img width="1002" alt="image" src="figures/junk_DNA.png"> |[Github](https://github.com/VITA-Group/Junk_DNA_Hypothesis) <br> [Paper](https://arxiv.org/abs/2310.02277)|
- ![Star - C2A4A6)]() <br>[Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity](https://arxiv.org/abs/2310.05175) <br> Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu |<img width="1002" alt="image" src="https://github.com/luuyin/OWL/blob/main/Images/Layer_wise_sparsity.png"> |[Github](https://github.com/luuyin/OWL) <br> [Paper](https://arxiv.org/abs/2310.05175)|
- ![Star - nlp/LLM-Shearing) [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning](https://arxiv.org/abs/2310.06694) <br> Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen |<img width="1002" alt="image" src="figures/LLM-shearing.png"> |[Github](https://github.com/princeton-nlp/LLM-Shearing) <br> [Paper](https://arxiv.org/abs/2310.06694)|
- ![Star - DASLab/SparseFinetuning) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927) <br> Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh |<img width="1002" alt="image" src="figures/SquareHead.png"> |[Github](https://github.com/IST-DASLab/SparseFinetuning) <br> [Paper](https://arxiv.org/abs/2310.06927)|
- ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models - C2A4A6)]() <br> Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar |<img width="1002" alt="image" src="figures/relufication.png"> |[Paper](https://arxiv.org/abs/2310.04564)|
- The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning - C2A4A6)]() <br> Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite |<img width="1002" alt="image" src="figures/recall_and_icl.png"> |[Paper](https://arxiv.org/abs/2310.04680)|
- ![Star - C2A4A6)]() <br>[Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs](https://arxiv.org/abs/2310.08915) <br> Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji |<img width="202" alt="image" src="figures/DSOT.png"> |[Github](https://github.com/zxyxmu/DSnoT) <br> [Paper](https://arxiv.org/abs/2310.08915)|
- One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models - C2A4A6)]()<br> Hang Shao, Bei Liu, Yanmin Qian |<img width="202" alt="image" src="figures/sensitivity_sparse.png"> |[Paper](https://arxiv.org/abs/2310.09499)|
- ![Star - C2A4A6)]() <br>[LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery](https://arxiv.org/abs/2310.18356) <br> Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang |<img width="1002" alt="image" src="figures/LoRAShear.png"> |[Github](https://github.com/microsoft/lorashear) <br> [Paper](https://arxiv.org/abs/2310.18356)|
- ![Star - Alpha/Divergent_Tokens) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization](https://arxiv.org/abs/2311.01544) <br> Björn Deiseroth, Max Meuer, Nikolas Gritsch, Constantin Eichenberg, Patrick Schramowski, Matthias Aßenmacher, Kristian Kersting |<img width="1002" alt="image" src="figures/FDT.png"> |[Github](https://github.com/Aleph-Alpha/Divergent_Tokens) <br> [Paper](https://arxiv.org/abs/2311.01544)|
- ![Star - Lab/GBLM-Pruner) [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models](https://arxiv.org/abs/2311.04902) <br> Rocktim Jyoti Das, Liqun Ma, Zhiqiang Shen |<img width="1002" alt="image" src="figures/GBLM-Pruner.png"> |[Github](https://github.com/VILA-Lab/GBLM-Pruner) <br> [Paper](https://arxiv.org/abs/2311.04902)|
- ![Star - Free Fine-tuning for Sparse LLMs](https://arxiv.org/abs/2310.08915) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji |<img width="202" alt="image" src="https://github.com/zyxxmu/DSnoT/blob/main/imgs/framework.png"> |[Github](https://github.com/zyxxmu/DSnoT) <br> [Paper](https://arxiv.org/abs/2310.08915)|
- ![Type - Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity](https://arxiv.org/abs/2310.15929) <br> Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, Zhanhui Kang |<img width="1002" alt="image" src="figures/e-sparse.png"> |[Paper](https://arxiv.org/abs/2310.15929)|
- ![Star - IOL/PERP) [![Type](https://img.shields.io/badge/Semi-structured-C2A4A6)]() <br>[PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs](https://arxiv.org/abs/2312.15230) <br> Max Zimmer, Megi Andoni, Christoph Spiegel, Sebastian Pokutta |<img width="1002" alt="image" src="figures/PERP.png"> |[Github](https://github.com/ZIB-IOL/PERP) <br> [Paper](https://arxiv.org/abs/2312.15230)|
- ![Star - compbio/admm-pruning)<br>[Fast and Optimal Weight Update for Pruned Large Language Models](https://arxiv.org/abs/2401.02938) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> Vladimír Boža |<img width="202" alt="image" src="figures/admm.png"> |[Github](https://github.com/fmfi-compbio/admm-pruning) <br> [Paper](https://arxiv.org/abs/2401.02938)|
- ![Type
- ![Star - safety) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning](https://arxiv.org/abs/2401.10862) <br> Adib Hasan, Ileana Rugina, Alex Wang |<img width="1002" alt="image" src="figures/eval_safety.png"> |[Github](https://github.com/CrystalEye42/eval-safety) <br> [Paper](https://arxiv.org/abs/2401.10862)|
- ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs
- ![Star - Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang |<img width="1002" alt="image" src="https://github.com/YuchuanTian/RethinkTinyLM/blob/master/fig/improve.png"> |[Github](https://github.com/YuchuanTian/RethinkTinyLM) <br> [Paper](https://arxiv.org/abs/2402.02791)|
- ![Star - Francois Kagey, Virginia Smith, Graham Neubig, Ameet Talwalkar |<img width="1002" alt="image" src="figures/bonsai.png"> |[Github](https://github.com/ldery/Bonsai) <br> [Paper](https://arxiv.org/abs/2402.05406)|
- ![Star - attribution-code)<br>[Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications](https://arxiv.org/abs/2402.05162) <br> Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson |<img width="1002" alt="image" src="https://boyiwei.com/alignment-attribution/static/images/main.png"> |[Github](https://github.com/boyiwei/alignment-attribution-code) <br> [Paper](https://arxiv.org/abs/2402.05162) <br> [Project](https://boyiwei.com/alignment-attribution/)|
- ![Star - C2A4A6)]()<br>[SliceGPT: Compress Large Language Models by Deleting Rows and Columns](https://arxiv.org/abs/2401.15024) <br> Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman |<img width="1002" alt="image" src="figures/SliceGPT.png"> |[Github](https://github.com/microsoft/TransformerCompression) <br> [Paper](https://arxiv.org/abs/2401.15024)|
- Efficient Pruning of Large Language Model with Adaptive Estimation Fusion
- ![Star - BESA)<br>[BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation](https://arxiv.org/pdf/2402.16880.pdf) <br> Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="https://arxiv.org/html/2402.16880v1/x1.png"> |[Github](https://github.com/OpenGVLab/LLMPrune-BESA) <br> [Paper](https://arxiv.org/pdf/2402.16880.pdf)|
- LaCo: Large Language Model Pruning via Layer Collapse
- ![Star - Song/sparse_gpu_operator)<br>[ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models](https://arxiv.org/abs/2402.13516) <br> Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun |<img width="1002" alt="image" src="https://arxiv.org/html/2402.13516v1/x1.png"> |[Github](https://github.com/Raincleared-Song/sparse_gpu_operator) <br> [Paper](https://arxiv.org/abs/2402.13516) <br> [[Model-7B]](https://huggingface.co/SparseLLM/prosparse-llama-2-7b) [[Model-13B]](https://huggingface.co/SparseLLM/prosparse-llama-2-13b)|
- ![Star - Wise Fine-Tuning for Sparse LLMs](https://arxiv.org/abs/2402.12419) <br> Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, Rongrong Ji |<img width="1002" alt="image" src="figures/EBFT.png"> |[Github](https://github.com/sunggo/EBFT) <br> [Paper](https://arxiv.org/abs/2402.12419)|
- ![Star - NetsPresso/shortened-llm) [![Publish](https://img.shields.io/badge/Workshop-ICLRW'24-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br> [Shortened LLaMA: A Simple Depth Pruning for Large Language Models](https://arxiv.org/abs/2402.02834) <br> Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song |<img width="1002" alt="image" src="figures/ShortenedLLaMA.png"> |[Github](https://github.com/Nota-NetsPresso/shortened-llm)<br>[Paper](https://arxiv.org/abs/2402.02834)|
- NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
- Learn To be Efficient: Build Structured Sparsity in Large Language Models
- Shortened LLaMA: A Simple Depth Pruning for Large Language Models - Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song |<img width="1002" alt="image" src="figures/ShortenedLLaMA.png"> |[Paper](https://arxiv.org/abs/2402.02834)|
- ![Star - dev/SLEB)<br>[SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks](https://arxiv.org/abs/2402.09025) <br> Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim |<img width="1002" alt="image" src="figures/SLEB.png"> |[Github](https://github.com/leapingjagg-dev/SLEB) <br> [Paper](https://arxiv.org/abs/2402.09025)|
- HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference
- ![Star - IVA-Lab/FLAP)[![Publish](https://img.shields.io/badge/Conference-AAAI'24-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]()<br>[Fluctuation-based Adaptive Structured Pruning for Large Language Models](https://arxiv.org/abs/2312.11983) <br> Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang |<img width="1002" alt="image" src="https://github.com/CASIA-IVA-Lab/FLAP/raw/main/figures/overview.png"> |[Github](https://github.com/CASIA-IVA-Lab/FLAP) <br> [Paper](https://arxiv.org/abs/2312.11983)|
- ![Star - comp-trust/comp-trust) [![Type](https://img.shields.io/badge/w/Quantization-39B0A9)]() <br>[Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression](https://arxiv.org/abs/2403.15447) <br> Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li |<img width="1002" alt="image" src="https://arxiv.org/html/2403.15447v1/extracted/5477136/fig/teaser.png"> |[Github](https://github.com/decoding-comp-trust/comp-trust) <br> [Paper](https://arxiv.org/abs/2403.15447) <br> [Project](https://decoding-comp-trust.github.io) |
- Compressing Large Language Models by Streamlining the Unimportant Layer
- LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
- ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
- ![Star
- LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models
- CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models - Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini |<img width="1002" alt="image" src="https://arxiv.org/html/2404.08763v1/x5.png"> |[Paper](https://arxiv.org/abs/2404.08763)|
- ![Type - Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models](https://arxiv.org/abs/2310.09499) <br> Hang Shao, Bei Liu, Yanmin Qian |<img width="202" alt="image" src="figures/sensitivity_sparse.png"> |[Paper](https://arxiv.org/abs/2310.09499)|
- ![Star - Pruner)[![Publish](https://img.shields.io/badge/Conference-NAACL'24%20Findings-blue)]()<br>[Pruning as a Domain-specific LLM Extractor](https://arxiv.org/abs/2405.06275) <br> Nan Zhang, Yanchi Liu, Xujiang Zhao, Wei Cheng, Runxue Bao, Rui Zhang, Prasenjit Mitra, Haifeng Chen |<img width="1002" alt="image" src="https://github.com/psunlpgroup/D-Pruner/raw/main/assets/prune_types_example.png"> |[Github](https://github.com/psunlpgroup/D-Pruner) <br> [Paper](https://arxiv.org/abs/2405.06275)|
- ![Star - specific-pruning)[![Publish](https://img.shields.io/badge/Conference-UNLP'24-blue)]()<br>[Language-Specific Pruning for Efficient Reduction of Large Language Models](https://aclanthology.org/2024.unlp-1.16/) <br> Maksym Shamrai | |[Github](https://github.com/mshamrai/language-specific-pruning) <br> [Paper](https://aclanthology.org/2024.unlp-1.16/)|
- ![Star - v2)<br>[OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning](https://arxiv.org/abs/2405.05957) <br> Dan Qiao, Yi Su, Pinzheng Wang, Jing Ye, Wenjing Xie, Yuechi Zhou, Yuyang Ding, Zecheng Tang, Jikai Wang, Yixin Ji, Yue Wang, Pei Guo, Zechen Sun, Zikang Zhang, Juntao Li, Pingfu Chao, Wenliang Chen, Guohong Fu, Guodong Zhou, Qiaoming Zhu, Min Zhang |<img width="1002" alt="image" src="figures/OpenBA.png"> |[Github](https://github.com/OpenNLG/OpenBA-v2) <br> [Paper](https://arxiv.org/abs/2405.05957)|
-
Quantization
- Increased LLM Vulnerabilities from Fine-tuning and Quantization
- Lossless and Near-Lossless Compression for Foundation Models
- ![Star - Quantization)<br>[How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study](https://arxiv.org/abs/2404.14047) <br> Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno |<img width="1002" alt="image" src="https://arxiv.org/html/2404.14047v1/x1.png"> |[Github](https://github.com/Macaronlin/LLaMA3-Quantization) <br> [Paper](https://arxiv.org/abs/2404.14047) <br> [Model](https://huggingface.co/LLMQ)|
- ![Star - lm-confidence)[![Publish](https://img.shields.io/badge/Conference-NAACL'24%20Findings-blue)]()<br>[When Quantization Affects Confidence of Large Language Models?](https://arxiv.org/abs/2405.00632) <br> Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin |<img width="1002" alt="image" src="figures/quantized-lm-confidence.png"> |[Github](https://github.com/upunaprosk/quantized-lm-confidence) <br> [Paper](https://arxiv.org/abs/2405.00632)|
- ![Star - han-lab/qserve)<br>[QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving](https://arxiv.org/abs/2405.04532) <br> Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/qserve/blob/main/assets/figures/teaser.png"> |[Github](https://github.com/mit-han-lab/qserve) <br> [Paper](https://arxiv.org/abs/2405.04532)|
- ![Star - DASLab/gptq)[![Publish](https://img.shields.io/badge/Conference-ICLR'22-blue)]()<br>[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) <br> Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh |<img width="202" alt="image" src="figures/GPTQ.png"> |[Github](https://github.com/IST-DASLab/gptq) <br> [Paper](https://arxiv.org/abs/2210.17323)|o
- ![Star - han-lab/smoothquant)[![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]() <br>[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2211.10438) <br> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/smoothquant/blob/main/figures/intuition.png"> |[Github](https://github.com/mit-han-lab/smoothquant) <br> [Paper](https://arxiv.org/abs/2211.10438)|
- ![Star - NeurIPS'23-blue)]() <br>[QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) <br> Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer | ![](figures/qlora.png) | <br>[Github](https://github.com/artidoro/qlora)</br> [Paper](https://arxiv.org/abs/2305.14314) |
- ![Star - chee/QuIP) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() <br>[QuIP: 2-Bit Quantization of Large Language Models With Guarantees](https://arxiv.org/abs/2307.13304) <br> Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De SaXQ |<img width="302" alt="image" src="figures/QuIP.png"> |[Github](https://github.com/jerry-chee/QuIP) <br> [Paper](https://arxiv.org/abs/2307.13304)|
- ![Star - AI-research/outlier-free-transformers) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() <br>[Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing](https://arxiv.org/abs/2306.12929) <br> Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort | ![](figures/QT.png) | [Github](https://github.com/Qualcomm-AI-research/outlier-free-transformers) [Paper](https://arxiv.org/abs/2306.12929) |
- ![Star - FP4)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[LLM-FP4: 4-Bit Floating-Point Quantized Transformers](https://arxiv.org/abs/2310.16836) <br> Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng |<img width="1002" alt="image" src="figures/LLM-FP4.png"> |[Github](https://github.com/nbasyl/LLM-FP4) <br> [Paper](https://arxiv.org/abs/2310.16836)|
- ![Star - Watermark)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23%20Findings-blue)]()<br>[Watermarking LLMs with Weight Quantization](https://arxiv.org/abs/2310.11237) <br> Linyang Li, Botian Jiang, Pengyu Wang, Ke Ren, Hang Yan, Xipeng Qiu |<img width="1002" alt="image" src="figures/watermark_quant.png"> |[Github](https://github.com/Twilight92z/Quantize-Watermark) <br> [Paper](https://arxiv.org/abs/2310.11237)|
- ![Star - han-lab/llm-awq) <br>[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) <br> Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/llm-awq/blob/main/figures/overview.png"> |[Github](https://github.com/mit-han-lab/llm-awq) <br> [Paper](https://arxiv.org/abs/2306.00978)|
- ![Star - based Post-training Quantization for Large Language Models](https://arxiv.org/abs/2304.01089) <br> Zhihang Yuan and Lin Niu and Jiawei Liu and Wenyu Liu and Xinggang Wang and Yuzhang Shang and Guangyu Sun and Qiang Wu and Jiaxiang Wu and Bingzhe Wu | ![](https://github.com/hahnyuan/RPTQ4LLM/blob/master/ims/cover.png) | <br>[Github](https://github.com/hahnyuan/RPTQ4LLM)</br> [Paper](https://arxiv.org/abs/2304.01089) |
- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation - v2.png"> |[Paper](https://arxiv.org/abs/2303.08302)|
- ![Star - and-Sparse Quantization](https://arxiv.org/pdf/2306.07629.pdf) <br>Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer | <img width="1102" alt="image" src="figures/SqueezeLLM.png"> |[Github](https://github.com/SqueezeAILab/SqueezeLLM) <br> [Paper](https://arxiv.org/pdf/2306.07629.pdf)|
- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
- Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models - QAT.png"> |[Paper](https://arxiv.org/abs/2305.17888)|
- ![Star - Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078) <br> Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh |<img width="1002" alt="image" src="figures/SpQR.png"> |[Github](https://github.com/Vahe1994/SpQR) <br> [Paper](https://arxiv.org/abs/2306.03078)|
- ![Star
- ![Star - Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen |<img width="1002" alt="image" src="figures/QuantizedEmpirical.png"> |[Github](https://github.com/RUCAIBox/QuantizedEmpirical) <br> [Paper](https://arxiv.org/abs/2307.08072)|
- ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats - FP.png"> |[Paper](https://arxiv.org/abs/2307.09782)|
- FPTQ: Fine-grained Post-Training Quantization for Large Language Models
- QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm
- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
- Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs - compressor) <br> [Paper](https://arxiv.org/abs/2309.05516)|
- ![Star - lora)<br>[QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2309.14717) <br> Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian |<img width="1002" alt="image" src="https://github.com/yuhuixu1993/qa-lora/blob/main/image/qalora.png"> |[Github](https://github.com/yuhuixu1993/qa-lora) <br> [Paper](https://arxiv.org/abs/2309.14717)|
- ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
- ![Star - LLM)<br>[PB-LLM: Partially Binarized Large Language Models](https://arxiv.org/abs/2310.00034) <br> Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong |<img width="1002" alt="image" src="figures/PB-LLM.png"> |[Github](https://github.com/hahnyuan/PB-LLM) <br> [Paper](https://arxiv.org/abs/2310.00034)|
- Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
- QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
- TEQ: Trainable Equivalent Transformation for Quantization of LLMs - compressor) <br> [Paper](https://arxiv.org/abs/2310.10944)|
- BitNet: Scaling 1-bit Transformers for Large Language Models
- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving - Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci |<img width="302" alt="image" src="figures/atom.png"> |[Paper](https://arxiv.org/abs/2310.19102)|
- AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
- ![Star
- A Speed Odyssey for Deployable Quantization of LLMs
- ![Star - lora)<br>[LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning](https://arxiv.org/abs/2311.12023) <br> Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim |<img width="1002" alt="image" src="figures/LQ-LoRA.png"> |[Github](https://github.com/HanGuo97/lq-lora) <br> [Paper](https://arxiv.org/abs/2311.12023)|
- Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization - 2-bit.png"> |[Paper](https://arxiv.org/abs/2311.16442)|
- ![Star - bit Post-Training WeightQuantization for LLM](https://arxiv.org/abs/2312.03788) <br> Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng |<img width="402" alt="image" src="figures/SmoothQuant+.png"> |[Github](https://github.com/adlik/smoothquant+) <br> [Paper](https://arxiv.org/abs/2312.03788)|
- ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks - 6bit.png"> |[Github](https://github.com/microsoft/DeepSpeed) <br> [Paper](https://arxiv.org/abs/2312.08583)|
- ![Star
- ![Star - ICLR'24-blue)]()<br>[OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://arxiv.org/abs/2308.13137) <br> Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="figures/omniquant.png"> |[Github](https://github.com/OpenGVLab/OmniQuant) <br> [Paper](https://arxiv.org/abs/2308.13137)|
- L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ - joon Kim |<img width="1002" alt="image" src="figures/L4Q.png"> |[Paper](https://arxiv.org/abs/2402.04902)|
- ![Star - RelaxML/quip-sharp)<br>[QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks](https://arxiv.org/abs/2402.04396) <br> Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa |<img width="1002" alt="image" src="figures/QuIP_sign.png"> |[Github](https://github.com/Cornell-RelaxML/quip-sharp) <br> [Paper](https://arxiv.org/abs/2402.04396)|
- ![Star - 778/BiLLM)<br>[BiLLM: Pushing the Limit of Post-Training Quantization for LLMs](https://arxiv.org/abs/2402.04291) <br> Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/BiLLM/blob/main/imgs/main.png"> |[Github](https://github.com/Aaronhuang-778/BiLLM) <br> [Paper](https://arxiv.org/abs/2402.04291)|
- ![Star - qlora)<br>[Accurate LoRA-Finetuning Quantization of LLMs via Information Retention](https://arxiv.org/abs/2402.05445) <br> Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno |<img width="1002" alt="image" src="https://github.com/htqin/IR-QLoRA/blob/main/imgs/overview.png"> |[Github](https://github.com/htqin/ir-qlora) <br> [Paper](https://arxiv.org/abs/2402.05445)|
- ApiQ: Finetuning of 2-Bit Quantized Large Language Model
- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design - LLM.png"> |[Paper](https://arxiv.org/abs/2401.14112)|
- ![Star
- The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
- ![Star - ai-research/gptvq)<br>[GPTVQ: The Blessing of Dimensionality for LLM Quantization](https://arxiv.org/abs/2402.15319) <br> Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough |<img width="1002" alt="image" src="https://arxiv.org/html/2402.15319v1/extracted/5412979/fig/new_fig1a_blue.png"> |[Github](https://github.com/qualcomm-ai-research/gptvq) <br> [Paper](https://arxiv.org/abs/2402.15319)|
- A Comprehensive Evaluation of Quantization Strategies for Large Language Models
- ![Star - nics/qllm-eval)<br>[Evaluating Quantized Large Language Models](https://arxiv.org/abs/2402.18158) <br> Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |<img width="302" alt="image" src="figures/qllm-eval.png"> |[Github](https://github.com/thu-nics/qllm-eval) <br> [Paper](https://arxiv.org/abs/2402.18158)|
- FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
- Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers - young Kim, Joonyoung Kim, Yongkweon Jeon |<img width="1002" alt="image" src="https://arxiv.org/html/2402.08958v1/x1.png"> |[Paper](https://arxiv.org/abs/2402.08958)|
- ![Star - Aware Training for the Acceleration of Lightweight LLMs on the Edge](https://arxiv.org/abs/2402.10787) <br> Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, Yanzhi Wang |<img width="1002" alt="image" src="figures/EdgeQAT.png"> |[Github](https://github.com/shawnricecake/EdgeQAT) <br> [Paper](https://arxiv.org/abs/2402.10787)|
- ![Star - DuDa/BitDistiller)<br>[BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation](https://arxiv.org/abs/2402.10631) <br> Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu |<img width="202" alt="image" src="https://github.com/DD-DuDa/BitDistiller/raw/main/imgs/overview.jpg"> |[Github](https://github.com/DD-DuDa/BitDistiller) <br> [Paper](https://arxiv.org/abs/2402.10631)|
- ![Star - bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points](https://arxiv.org/abs/2404.12759) <br> Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu |<img width="1002" alt="image" src="https://github.com/bytedance/decoupleQ/raw/main/imgs/img.png"> |[Github](https://github.com/bytedance/decoupleQ) <br> [Paper](https://arxiv.org/abs/2404.12759)|
- OneBit: Towards Extremely Low-bit Large Language Models
- ![Star - Tune May Only Be Worth One Bit](https://arxiv.org/abs/2402.10193) <br> James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai |<img width="1002" alt="image" src="https://github.com/FasterDecoding/BitDelta/raw/main/figures/BitDelta.png"> |[Github](https://github.com/FasterDecoding/BitDelta) <br> [Paper](https://arxiv.org/abs/2402.10193)|
- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
- Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
- ![Star - Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456) <br> Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman |<img width="1002" alt="image" src="https://github.com/spcl/QuaRot/blob/main/img/fig1.png"> |[Github](https://github.com/spcl/QuaRot) <br> [Paper](https://arxiv.org/abs/2404.00456)|
- Accurate Block Quantization in LLMs with Outliers
- ![Star - ICLR'24-blue)]()<br>[AffineQuant: Affine Transformation Quantization for Large Language Models](https://arxiv.org/abs/2403.12544) <br> Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, Rongrong Ji |<img width="1002" alt="image" src="https://github.com/bytedance/AffineQuant/blob/main/fig/overview.png"> |[Github](https://github.com/bytedance/AffineQuant) <br> [Paper](https://arxiv.org/abs/2403.12544)|
- ![Publish
- What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
- FrameQuant: Flexible Low-Bit Quantization for Transformers
- ![Star - KVCacheQuantization)<br>[QAQ: Quality Adaptive Quantization for LLM KV Cache](https://arxiv.org/abs/2403.04643) <br> Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2403.04643v1/x1.png"> |[Github](https://github.com/ClubieDong/QAQ-KVCacheQuantization) <br> [Paper](https://arxiv.org/abs/2403.04643)|
- Quantization of Large Language Models with an Overdetermined Basis
- ![Star - window Key and Value Cache Quantization for Large Language Models](https://arxiv.org/abs/2405.06219) <br> Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin |<img width="1002" alt="image" src="figures/SKVQ.png"> |[Github](https://github.com/cat538/SKVQ) <br> [Paper](https://arxiv.org/abs/2405.06219)|
- ![Star - QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models](https://arxiv.org/abs/2405.06001) <br> Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Yunchen Zhang, Xianglong Liu, Dacheng Tao |<img width="1002" alt="image" src="https://github.com/ModelTC/llmc/raw/main/imgs/best_practice.png"> |[Github](https://github.com/ModelTC/llmc) <br> [Paper](https://arxiv.org/abs/2405.06001)|
-
Efficient MOE
- Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
- SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
- Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
- ![Publish - of-Experts Training via Whole Graph Computation-Communication Overlapping](https://arxiv.org/abs/2404.19429) <br> Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2404.19429v1/x4.png"> |[Paper](https://arxiv.org/abs/2404.19429)|
- SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
- ![Star - of-Experts Attention](https://arxiv.org/abs/2312.07987) <br> Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber |<img width="1002" alt="image" src="figures/switchhead.png"> |[Github](https://github.com/robertcsordas/moe_attention) <br> [Paper](https://arxiv.org/abs/2312.07987)|
- ![Star - Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference](https://arxiv.org/abs/2401.08383) <br> Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda |<img width="1002" alt="image" src="figures/exflow.png"> |[Github](https://github.com/YJHMITWEB/ExFlow) <br> [Paper](https://arxiv.org/abs/2401.08383)|
- ![Star - Infinity)<br>[MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving](https://arxiv.org/abs/2401.14361) <br> Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina |<img width="1002" alt="image" src="figures/MOE-Infinity.png"> |[Github](https://github.com/TorchMoE/MoE-Infinity) <br> [Paper](https://arxiv.org/abs/2401.14361)|
- ![Star - Lance/Expert_Sparsity)<br>[Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models](https://arxiv.org/abs/2402.14800) <br> Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li |<img width="1002" alt="image" src="https://arxiv.org/html/2402.14800v1/x2.png"> |[Github](https://github.com/Lucky-Lance/Expert_Sparsity) <br> [Paper](https://arxiv.org/abs/2402.14800)|
- ![Star - prompted Mixture of Experts for Efficient LLM Generation](https://arxiv.org/abs/2404.01365) <br> Harry Dong, Beidi Chen, Yuejie Chi |<img width="1002" alt="image" src="https://arxiv.org/html/2404.01365v1/extracted/5509263/figures/algorithm.png"> |[Github](https://github.com/hdong920/GRIFFIN) <br> [Paper](https://arxiv.org/abs/2404.01365)|
- ![Star - GPU Orchestration for Fast Inference of Mixture-of-Experts Models](https://arxiv.org/abs/2402.07033) <br> Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci |<img width="1002" alt="image" src="https://github.com/efeslab/fiddler/blob/main/asset/key-idea.png"> |[Github](https://github.com/efeslab/fiddler) <br> [Paper](https://arxiv.org/abs/2402.07033)|
- Enhancing Efficiency in Sparse Models with Sparser Selection
-
Text Compression
- ![Publish
- ![Star - memorization/)<br>[Rethinking LLM Memorization through the Lens of Adversarial Compression](https://arxiv.org/abs/2404.15146) <br> Avi Schwarzschild, Zhili Feng, Pratyush Maini, Zachary C. Lipton, J. Zico Kolter |<img width="1002" alt="image" src="https://pratyushmaini-acr-viewer.hf.space/file=/tmp/gradio/f054d27283291fa78df9949f26ac605dbea31398/ACR.png"> |[Github](https://github.com/locuslab/acr-memorization/) <br> [Paper](https://arxiv.org/abs/2404.15146) <br> [Project](https://locuslab.github.io/acr-memorization/)|
- ![Publish - Information Optimization for Language Model-based Text Compression](https://arxiv.org/abs/2308.13399) <br> Alexander Tsvetkov. Alon Kipnis |<img width="1002" alt="image" src="figures/EntropyRank.png"> |[Paper](https://arxiv.org/abs/2308.13399)|
- LLMZip: Lossless Text Compression using Large Language Models - Francois Chamberland, Srinivas Shakkottai |<img width="1002" alt="image" src="figures/LLMZip.png"> |[Paper](https://arxiv.org/abs/2306.04050) \| [Unofficial Github](https://github.com/erika-n/GPTzip)|
- ![Star - nlp/AutoCompressors)<br>[Adapting Language Models to Compress Contexts](https://arxiv.org/abs/2305.14788) <br> Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen |<img width="202" alt="image" src="figures/AutoCompressor.png"> |[Github](https://github.com/princeton-nlp/AutoCompressors) <br> [Paper](https://arxiv.org/abs/2305.14788)|
- In-context Autoencoder for Context Compression in a Large Language Model - Qing Chen, Furu Wei |<img width="502" alt="image" src="figures/ICAE.png"> |[Paper](https://arxiv.org/abs/2307.06945)|
- Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Model
- Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning - Ting Cheng, Mao Yang |<img width="1002" alt="image" src="figures/CoT-Max.png"> |[Paper](https://arxiv.org/abs/2312.08901)|
- Learning to Compress Prompt in Natural Language Formats - Neng Chuang, Tianwei Xing, Chia-Yuan Chang, Zirui Liu, Xun Chen, Xia Hu |<img width="1002" alt="image" src="https://arxiv.org/html/2402.18700v1/x1.png"> |[Paper](https://arxiv.org/abs/2402.18700)|
- ProPD: Dynamic Token Tree Pruning and Generation for LLM Parallel Decoding
- PROMPT-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression
- Training LLMs over Neurally Compressed Text - Dickstein, Noah Constant |<img width="302" alt="image" src="https://arxiv.org/html/2404.03626v1/x1.png"> |[Paper](https://arxiv.org/abs/2404.03626)|
- LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression
- ![Star - for-Prompt-Compression)<br>[PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models](https://arxiv.org/abs/2403.17411) <br> Jinyi Li, Yihuai Lan, Lei Wang, Hao Wang |<img width="1002" alt="image" src="https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression/raw/main/imgs/architecture.png"> |[Github](https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression) <br> [Paper](https://arxiv.org/abs/2403.17411)|
-
Hardware/System
- Transformer-Lite: High-efficiency Deployment of Large Language Models on Mobile Phone GPUs
- ![Star - griggs/melange-release) [Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity](https://arxiv.org/abs/2404.14527). Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, Ion Stoica. [[Paper]](https://arxiv.org/abs/2404.14527)[[Github]](https://github.com/tyler-griggs/melange-release)
- Expert Router: Orchestrating Efficient Language Model Inference through Prompt Classification
- Efficient and Economic Large Language Model Inference with Attention Offloading
- ![Publish
- ![Publish - extension-for-transformers.svg?style=social&label=Star)](https://github.com/intel/intel-extension-for-transformers) [Efficient LLM Inference on CPUs](https://arxiv.org/abs/2311.00502). Haihao Shen, Hanwen Chang, Bo Dong, Yu Luo, Hengyu Meng. [[Paper]](https://arxiv.org/abs/2311.00502)[[Github]](https://github.com/intel/intel-extension-for-transformers)
- LLM in a flash: Efficient Large Language Model Inference with Limited Memory
- FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA
- Efficient LLM inference solution on Intel GPU - extension-for-pytorch/tree/v2.1.10%2Bxpu/examples/gpu/inference/python/llm)
- ![Publish - project/vllm.svg?style=social&label=Star)](https://github.com/vllm-project/vllm) [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180). Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica. [[Paper]](https://arxiv.org/abs/2309.06180)[[Github]](https://github.com/vllm-project/vllm)
- EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
- ![Publish - Generation AI Accelerator Design Automation via Large Language Models](https://arxiv.org/abs/2309.10730). Yonggan Fu, Yongan Zhang, Zhongzhi Yu, Sixu Li, Zhifan Ye, Chaojian Li, Cheng Wan, Yingyan Lin. [[Paper]](https://arxiv.org/abs/2309.10730)
- Rethinking Memory and Communication Cost for Efficient Large Language Model Training
- Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
- FlashDecoding++: Faster Large Language Model Inference on GPUs
- ![Star
- ![Star - MII) [DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference](https://arxiv.org/abs/2401.08671). Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, Yuxiong He. [[Paper]](https://arxiv.org/abs/2401.08671)[[Github]](https://github.com/microsoft/DeepSpeed-MII)
- ![Publish - Throughput Generative Inference of Large Language Models with a Single GPU](https://arxiv.org/abs/2303.06865). Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang. [[Paper]](https://arxiv.org/abs/2303.06865)[[Github]](https://github.com/FMInference/FlexGen)
- BurstAttention: An Efficient Distributed Attention Framework for Extremely Long Sequences
- DeFT: Flash Tree-attention with IO-Awareness for Efficient Tree-search-based LLM Inference
- ![Publish - AILab/flash-attention.svg?style=social&label=Star)](https://github.com/Dao-AILab/flash-attention) [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https://arxiv.org/abs/2205.14135). Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré. [[Paper]](https://arxiv.org/abs/2205.14135)[[Github]](https://github.com/Dao-AILab/flash-attention)
- ![Star - aware Interleaving and Conflict-free Kernel for efficient LLM inference](https://arxiv.org/abs/2402.10076). Taesu Kim, Jongho Lee, Daehyun Ahn, Sarang Kim, Jiwoong Choi, Minkyu Kim, Hyungjun Kim. [[Paper]](https://arxiv.org/abs/2402.10076)[[Github]](https://github.com/SqueezeBits/QUICK)
- ![Star - project/sglang/tree/main) [Efficiently Programming Large Language Models using SGLang](https://arxiv.org/abs/2312.07104). Lianmin Zheng*, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, Ying Sheng*. [[Paper]](https://arxiv.org/abs/2312.07104) [[Github]](https://github.com/sgl-project/sglang/tree/main)
- MELTing point: Mobile Evaluation of Language Transformers
- LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism
-
Paper from May 26, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Please check out all the papers by selecting the sub-area you're interested in. On this page, we're showing papers released in the past 60 days.
- ![Star - pruning)[![Publish](https://img.shields.io/badge/Conference-ACL24'Findings-blue)]()<br>[Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations](https://arxiv.org/abs/2407.05690) <br> Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.05690v1/x2.png"> |[Github](https://github.com/sbwww/TransAct-pruning) <br> [Paper](https://arxiv.org/abs/2407.05690)|[//]: #07/10
- ![Star - shufe/Beyond-Perplexity-Compression-Safety-Eval) [![Type](https://img.shields.io/badge/w/Quantization-39B0A9)]() <br>[Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression](https://arxiv.org/abs/2407.04965) <br> Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, Vivek Srikumar | |[Github](https://github.com/zhichaoxu-shufe/Beyond-Perplexity-Compression-Safety-Eval) <br> [Paper](https://arxiv.org/abs/2407.04965)|[//]: #07/10
- ![Star - tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization](https://arxiv.org/abs/2407.08044) <br> Xijie Huang, Zechun Liu, Shih-Yang Liu, Kwang-Ting Cheng |<img width="1002" alt="image" src="https://arxiv.org/html/2407.08044v1/x1.png"> |[Github](https://github.com/HuangOwen/RoLoRA) <br> [Paper](https://arxiv.org/abs/2407.08044)|[//]: #07/12
- ![Star - Quantized LLMs](https://arxiv.org/abs/2407.10960) <br> Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim |<img width="302" alt="image" src="https://arxiv.org/html/2407.10960v1/x1.png"> |[Github](https://github.com/HanGuo97/flute) <br> [Paper](https://arxiv.org/abs/2407.10960)|[//]: #07/16
- ![Star - LLM)<br>[FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation](https://arxiv.org/abs/2407.07093) <br> Liqun Ma, Mingjie Sun, Zhiqiang Shen |<img width="1002" alt="image" src="https://github.com/LiqunMa/FBI-LLM/blob/main/figures/structure_and_training_procedure.png"> |[Github](https://github.com/LiqunMa/FBI-LLM) <br> [Paper](https://arxiv.org/abs/2407.07093)|[//]: #07/10
- ![Star - Tuning of Quantized Large Language Models Through Optimal Balance](https://arxiv.org/abs/2407.17029) <br> Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li |<img width="1002" alt="image" src="figures/Q-BaRA.png"> |[Github](https://github.com/xiaocaigou/qbaraqahira) <br> [Paper](https://arxiv.org/abs/2407.17029)|[//]: #07/26
- ![Star - research/jax-scalify)[![Publish](https://img.shields.io/badge/Conference-ICML'24%20WANT-blue)]()<br>[Scalify: scale propagation for efficient low-precision LLM training](https://arxiv.org/abs/2407.17353) <br> Paul Balança, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon | |[Github](https://github.com/graphcore-research/jax-scalify) <br> [Paper](https://arxiv.org/abs/2407.17353)|[//]: #07/26
- ![Star
- ![Star - grained Pruning for Large Language Models](https://arxiv.org/abs/2406.10594) <br> Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li |<img width="1002" alt="image" src="https://arxiv.org/html/2406.10594v2/x3.png"> |[Github](https://github.com/MrGGLS/BlockPruner) <br> [Paper](https://arxiv.org/abs/2406.10594)|[//]: #07/05
- ![Star - MAC)<br>[T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge](https://arxiv.org/abs/2407.00088) <br> Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.00088v1/x2.png"> |[Github](https://github.com/microsoft/T-MAC) <br> [Paper](https://arxiv.org/abs/2407.00088)|[//]: #07/03
- ![Star - TUM/LiveMind)<br>[LiveMind: Low-latency Large Language Models with Simultaneous Inference](https://arxiv.org/abs/2406.14319) <br> Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li |<img width="1002" alt="image" src="https://arxiv.org/html/2406.14319v1/x1.png"> |[Github](https://github.com/ChuangtaoChen-TUM/LiveMind) <br> [Paper](https://arxiv.org/abs/2406.14319)|[//]: #07/05
- ![Star - research/EEP)<br>[Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs](https://arxiv.org/abs/2407.00945) <br> Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.00945v1/extracted/5697370/Figures/use_case.png"> |[Github](https://github.com/imagination-research/EEP) <br> [Paper](https://arxiv.org/abs/2407.00945)|[//]: #07/03
- ![Star - ACL'24%20Findings-blue)]()<br>[Efficient Sparse Attention needs Adaptive Token Release](https://arxiv.org/abs/2407.02328) <br> Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li |<img width="1002" alt="image" src="https://arxiv.org/html/2407.02328v1/x1.png"> |[Github](https://github.com/WHUIR/ADORE) <br> [Paper](https://arxiv.org/abs/2407.02328)|[//]: #07/05
- ![Star - Neng Chuang, Songchen Li et al |<img width="1002" alt="image" src="figures/longctx_bench.png"> |[Github](https://github.com/henryzhongsc/longctx_bench) <br> [Paper](https://arxiv.org/abs/2407.01527)|[//]: #07/03
- ![Star - lab/CapaBoost)[![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]()<br>[Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning](https://arxiv.org/abs/2407.01320) <br> Haobo Song, Hao Zhao, Soumajit Majumder, Tao Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2407.01320v1/x2.png"> |[Github](https://github.com/LINs-lab/CapaBoost) <br> [Paper](https://arxiv.org/abs/2407.01320)|[//]: #07/03
- ![Star
- ![Star - Aware Training for Large Language Models](https://arxiv.org/abs/2407.11062) <br> Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="https://arxiv.org/html/2407.11062v1/x5.png"> |[Github](https://github.com/OpenGVLab/EfficientQAT) <br> [Paper](https://arxiv.org/abs/2407.11062)|[//]: #07/21
- ![Star - Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices](https://arxiv.org/abs/2407.11534) <br> Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee |<img width="1002" alt="image" src="https://arxiv.org/html/2407.11534v1/extracted/5734567/Figures/Fig_ablation_samplesize_flexround.png"> |[Github](https://github.com/onliwad101/FlexRound_LRQ) <br> [Paper](https://arxiv.org/abs/2407.11534)|[//]: #07/21
- ![Star
- ![Star - paper)<br>[GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression](https://arxiv.org/abs/2407.12077) <br> Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah |<img width="202" alt="image" src="https://github.com/recursal/GoldFinch-paper/raw/main/assets/architecture.png"> |[Github](https://github.com/recursal/GoldFinch-paper) <br> [Paper](https://arxiv.org/abs/2407.12077)|[//]: #07/21
- ![Star
-
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
- Large Language Model Pruning - Jia Song, Hsing-Kuo Pao |<img width="1002" alt="image" src="https://arxiv.org/html/2406.00030v1/x1.png"> |[Paper](https://arxiv.org/abs/2406.00030)|[//]: #06/05
- ![Star - 2: Accurate Inference for Regressive Lightweight Speculative Decoding](https://arxiv.org/abs/2408.00264) <br> Bin Xiao, Lujun Gui, Lei Su, Weipeng Chen |<img width="1002" alt="image" src="https://github.com/XiaoBin1992/clover/raw/v1/figs/structure.png"> |[Github](https://github.com/XiaoBin1992/clover) <br> [Paper](https://arxiv.org/abs/2408.00264)|[//]: #08/08
- ![Star - Rank Space for KV Cache Compression](https://arxiv.org/abs/2408.05646) <br> Utkarsh Saxena, Gobinda Saha, Sakshi Choudhary, Kaushik Roy |<img width="1002" alt="image" src="https://arxiv.org/html/2408.05646v1/x1.png"> |[Github](https://github.com/UtkarshSaxena1/EigenAttn) <br> [Paper](https://arxiv.org/abs/2408.05646)|[//]: #08/13
- ![Star - Cache with Low-Rank Projection](https://arxiv.org/abs/2407.21118) <br> Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Kai-Chiang Wu |<img width="1002" alt="image" src="https://github.com/shadowpa0327/Palu/blob/master/img/palu_idea.png"> |[Github](https://github.com/shadowpa0327/Palu) <br> [Paper](https://arxiv.org/abs/2407.21118)|[//]: #08/08
- ![Star - 1.png"> |[Github](https://github.com/ZongqianLi/500xCompressor) <br> [Paper](https://arxiv.org/abs/2408.03094)|[//]: #08/08
- ![Star - Context Reasoning through Query-Guided Context Compression](https://arxiv.org/abs/2408.00274) <br> Wenshan Wang, Yihang Wang, Yixing Fan, Huaming Liao, Jiafeng Guo |<img width="1002" alt="image" src="https://github.com/Wenshansilvia/attention_compressor/blob/main/assets/method.png"> |[Github](https://github.com/Wenshansilvia/attention_compressor) <br> [Paper](https://arxiv.org/abs/2408.00274)|[//]: #08/08
- ![Publish - based Applications with Semantic Variable](https://arxiv.org/abs/2405.19888) <br> Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu |<img width="1002" alt="image" src="figures/parrot.png"> |[Paper](https://arxiv.org/abs/2405.19888)| [//]: #05/31
- FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models
- CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
- SpinQuant -- LLM quantization with learned rotations
- I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models - LLM.png"> |[Paper](https://arxiv.org/abs/2405.17849)| [//]: #05/29
- Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
- Faster Cascades via Speculative Decoding
- ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
-
-
Paper from 05/26/2024 - Now (see Full List from 05/22/2023 [here](#full-list))
-
Please check out all the papers by selecting the sub-area you're interested in. On this page, we're showing papers released in the past 30 days.
- ![Star - based Contextual Sparsity for Large Language Models](https://arxiv.org/abs/2406.16635) <br> Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah |<img width="1002" alt="image" src="https://arxiv.org/html/2406.16635v1/x4.png"> |[Github](https://github.com/abdelfattah-lab/shadow_llm/) <br> [Paper](https://arxiv.org/abs/2406.16635)|[//]: #06/26
- ![Star - Wise Quantization: A Simple and Effective Approach to Quantize LLMs](https://arxiv.org/abs/2406.17415) <br> Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu |<img width="202" alt="image" src="https://arxiv.org/html/2406.17415v1/x1.png"> |[Github](https://github.com/RazvanDu/LayerwiseQuant) <br> [Paper](https://arxiv.org/abs/2406.17415)|[//]: #06/26
- ![Star - Bit Quantization for Large Language Models](https://arxiv.org/abs/2406.09904) <br> Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, Wei Lin |<img width="202" alt="image" src="https://arxiv.org/html/2406.09904v1/x1.png"> |[Github](https://github.com/HandH1998/QQQ) <br> [Paper](https://arxiv.org/abs/2406.09904)|[//]: #06/18
- ![Star - nics/MoA)<br>[MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression](https://arxiv.org/abs/2406.14909) <br> Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |<img width="1002" alt="image" src="https://github.com/thu-nics/MoA/blob/master/assets/workflow.png"> |[Github](https://github.com/thu-nics/MoA) <br> [Paper](https://arxiv.org/abs/2406.14909)|[//]: #06/26
- ![Star - Lab/moe-quantization)<br>[Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark](https://arxiv.org/abs/2406.08155) <br> Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2406.08155v1/x1.png"> |[Github](https://github.com/UNITES-Lab/moe-quantization) <br> [Paper](https://arxiv.org/abs/2406.08155)|[//]: #06/18
- ![Star - mlkv)<br>[MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding](https://arxiv.org/abs/2406.09297) <br> Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji |<img width="1002" alt="image" src="https://arxiv.org/html/2406.09297v1/extracted/5665367/resources/mlkv-All_KV.png"> |[Github](https://github.com/zaydzuhri/pythia-mlkv) <br> [Paper](https://arxiv.org/abs/2406.09297)|[//]: #06/18
- ![Star - EIC/Edge-LLM)[![Publish](https://img.shields.io/badge/Conference-DAC'24-blue)]()<br>[EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting](https://arxiv.org/abs/2406.15758) <br> Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin |<img width="1002" alt="image" src="https://github.com/GATECH-EIC/Edge-LLM/blob/main/images/Edge-LLM-overview.png"> |[Github](https://github.com/GATECH-EIC/Edge-LLM) <br> [Paper](https://arxiv.org/abs/2406.15758)|[//]: #06/26
- ![Star - Matching Distillation of Large Language Models](https://arxiv.org/abs/2406.02959) <br> Chen Jia |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02959v1/x1.png"> |[Github](https://github.com/jiachenwestlake/MMKD) <br> [Paper](https://arxiv.org/abs/2406.02959)|[//]: #06/11
- ![Star
- ![Star - Zero)[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]()<br>[Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models](https://arxiv.org/abs/2406.02924) <br> Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang Wang, Xiaowen Chu |<img width="1002" alt="image" src="https://raw.githubusercontent.com/pprp/Pruner-Zero/main/.github/images/pruner-zero-main-figure.png"> |[Github](https://github.com/pprp/Pruner-Zero) <br> [Paper](https://arxiv.org/abs/2406.02924)|[//]: #06/11
- ![Star - Mozaffari/slope)<br>[SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs](https://arxiv.org/abs/2405.16325) <br> Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2405.16325v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slope) <br> [Paper](https://arxiv.org/abs/2405.16325)| [//]: #05/29
- ![Star - Lance/SPP)<br>[SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models](https://arxiv.org/abs/2405.16057) <br> Xudong Lu, Aojun Zhou, Yuhui Xu, Renrui Zhang, Peng Gao, Hongsheng Li |<img width="1002" alt="image" src="https://github.com/Lucky-Lance/SPP/raw/main/asserts/SPP.png"> |[Github](https://github.com/Lucky-Lance/SPP) <br> [Paper](https://arxiv.org/abs/2405.16057)| [//]: #05/29
- ![Star
- ![Star - EIC/ShiftAddLLM)<br>[ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization](https://arxiv.org/abs/2406.05981) <br> Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Lin |<img width="1002" alt="image" src="https://github.com/GATECH-EIC/ShiftAddLLM/raw/main/assets/overview.jpg"> |[Github](https://github.com/GATECH-EIC/ShiftAddLLM) <br> [Paper](https://arxiv.org/abs/2406.05981)|[//]: #06/11
- ![Star - sri/llm-quantization-attack)<br>[Exploiting LLM Quantization](https://arxiv.org/abs/2405.18137) <br> Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev |<img width="1002" alt="image" src="figures/exploiting_llm_quantization.png"> |[Github](https://github.com/eth-sri/llm-quantization-attack) <br> [Paper](https://arxiv.org/abs/2405.18137)| [//]: #05/29
- ![Star - 778/SliM-LLM)<br>[SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models](https://arxiv.org/abs/2405.14917) <br> Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/SliM-LLM/blob/main/imgs/[email protected]"> |[Github](https://github.com/Aaronhuang-778/SliM-LLM) <br> [Paper](https://arxiv.org/abs/2405.14917)| [//]: #05/29
- ![Star - tuning)<br>[PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression](https://arxiv.org/abs/2405.14852) <br> Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik |<img width="1002" alt="image" src="figures/pv-tuning.png"> |[Github](https://github.com/Vahe1994/AQLM/tree/pv-tuning) <br> [Paper](https://arxiv.org/abs/2405.14852)| [//]: #05/29
- ![Star - EIC/Linearized-LLM)[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]()<br>[When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models](https://arxiv.org/abs/2406.07368) <br> Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine)Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2406.07368v1/x5.png"> |[Github](https://github.com/GATECH-EIC/Linearized-LLM) <br> [Paper](https://arxiv.org/abs/2406.07368)|[//]: #06/12
- ![Star - research/Q-LLM)<br>[QuickLLaMA: Query-aware Inference Acceleration for Large Language Models](https://arxiv.org/abs/2406.07528) <br> Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia |<img width="1002" alt="image" src="https://github.com/dvlab-research/Q-LLM/raw/master/img/framework.png"> |[Github](https://github.com/dvlab-research/Q-LLM) <br> [Paper](https://arxiv.org/abs/2406.07528)|[//]: #06/12
- ![Star - prompt-decoding)<br>[Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference](https://arxiv.org/abs/2405.18628) <br> Hao (Mark)Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan |<img width="1002" alt="image" src="https://github.com/hmarkc/parallel-prompt-decoding/raw/main/assets/Overview.png"> |[Github](https://github.com/hmarkc/parallel-prompt-decoding) <br> [Paper](https://arxiv.org/abs/2405.18628)| [//]: #05/31
- ![Star - Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead](https://arxiv.org/abs/2406.03482) <br> Amir Zandieh, Majid Daliri, Insu Han |<img width="1002" alt="image" src="figures/QJL.png"> |[Github](https://github.com/amirzandieh/QJL) <br> [Paper](https://arxiv.org/abs/2406.03482)|[//]: #06/11
- ![Star - PEFT)[![Publish](https://img.shields.io/badge/Conference-ACL'24%20Findings-blue)]()<br>[Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning](https://arxiv.org/abs/2406.03792) <br> Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2406.03792v1/x5.png"> |[Github](https://github.com/gccnlp/Light-PEFT) <br> [Paper](https://arxiv.org/abs/2406.03792)|[//]: #06/12
- ![Star - lab/DynMoE)<br>[Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models](https://arxiv.org/abs/2405.14297) <br> Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin |<img width="1002" alt="image" src="figures/dynmoe.png"> |[Github](https://github.com/LINs-lab/DynMoE) <br> [Paper](https://arxiv.org/abs/2405.14297)| [//]: #05/29
- ![Star - transformer)<br>[Block Transformer: Global-to-Local Language Modeling for Fast Inference](https://arxiv.org/abs/2406.02657) <br> Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02657v1/x1.png"> |[Github](https://github.com/itsnamgyu/block-transformer) <br> [Paper](https://arxiv.org/abs/2406.02657)|[//]: #06/12
-
-
Paper from July 13, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Quantization
- LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
- ![Star
- ![Star - Mozaffari/slim)<br>[SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs](https://arxiv.org/abs/2410.09615) <br> Mohammad Mozaffari, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09615v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slim) <br> [Paper](https://arxiv.org/abs/2410.09615)|[//]: #10/21
- ![Star - Aware Post-Training Weight-Only Quantization For LLMs](https://arxiv.org/abs/2410.12187) <br> Yingsong Luo, Ling Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12187v2/x1.png"> |[Github](https://github.com/LuoYingSong/DAQ) <br> [Paper](https://arxiv.org/abs/2410.12187)|[//]: #10/21
- ![Star - group/Quamba)<br>[Quamba: A Post-Training Quantization Recipe for Selective State Space Models](https://arxiv.org/abs/2410.13229) <br> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13229v1/extracted/5933363/figures/outliers.png"> |[Github](https://github.com/enyac-group/Quamba) <br> [Paper](https://arxiv.org/abs/2410.13229)|[//]: #10/21
- Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
- ![Star - bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs](https://arxiv.org/abs/2410.16144) <br> Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16144v2/x1.png"> |[Github](https://github.com/microsoft/BitNet) <br> [Paper](https://arxiv.org/abs/2410.16144)|[//]: #10/30
- ![Star - Yaacov, Ron Banner, Kfir Yehuda Levy |<img width="1002" alt="image" src="figures/EXAQ.png"> |[Github](https://github.com/Anonymous1252022/EXAQ) <br> [Paper](https://arxiv.org/abs/2410.03185)|[//]: #10/14
- ![Star
-
Knowledge Distillation
- Enhancing Data-Limited Graph Neural Networks by Actively Distilling Knowledge from Large Language Models
- DDK: Distilling Domain Knowledge for Efficient Large Language Models
- BOND: Aligning LLMs with Best-of-N Distillation
- Key-Point-Driven Mathematical Reasoning Distillation of Large Language Model
- Multi-Granularity Semantic Revision for Large Language Model Distillation
- Don't Throw Away Data: Better Sequence Knowledge Distillation
- ![Star - coai/MiniPLM)<br>[MiniPLM: Knowledge Distillation for Pre-Training Language Models](https://arxiv.org/abs/2410.17215) <br> Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang |<img width="1002" alt="image" src="https://github.com/thu-coai/MiniPLM/raw/main/figures/method.png"> |[Github](https://github.com/thu-coai/MiniPLM) <br> [Paper](https://arxiv.org/abs/2410.17215)|[//]: #10/29
-
Network Pruning / Sparsity
- ![Star - NeurIPS'24-blue)]()<br>[AlphaPruning: Using Heavy-Tailed Self Regularization Theory for Improved Layer-wise Pruning of Large Language Models](https://arxiv.org/abs/2410.10912) <br> Haiquan Lu, Yefan Zhou, Shiwei Liu, Zhangyang Wang, Michael W. Mahoney, Yaoqing Yang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.10912v1/x1.png"> |[Github](https://github.com/haiquanlu/AlphaPruning) <br> [Paper](https://arxiv.org/abs/2410.10912)|[//]: #10/21
- Pruning Large Language Models with Semi-Structural Adaptive Sparse Training
- Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining
- Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
- A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models
- ![Star - DASLab/EvoPress)<br>[EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search](https://arxiv.org/abs/2410.14649) <br> Oliver Sieberling, Denis Kuznedelev, Eldar Kurtic, Dan Alistarh |<img width="1002" alt="image" src="figures/evopress.png"> |[Github](https://github.com/IST-DASLab/EvoPress) <br> [Paper](https://arxiv.org/abs/2410.14649)|[//]: #10/30
- ![Star
- ![Star - pruning-calibration-data)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24-blue)]()<br>[Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning](https://arxiv.org/abs/2410.07461) <br> Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, Ajay Kumar Jaiswal, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07461v1/x1.png"> |[Github](https://github.com/abx393/llm-pruning-calibration-data) <br> [Paper](https://arxiv.org/abs/2410.07461)|[//]: #10/13
- MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models
- Reconstruct the Pruned Model without Any Retraining
-
Inference Acceleration
- ![Star - han-lab/duo-attention)<br>[DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads](https://arxiv.org/abs/2410.10819) <br> Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/duo-attention/raw/main/figures/method1.jpg"> |[Github](https://github.com/mit-han-lab/duo-attention) <br> [Paper](https://arxiv.org/abs/2410.10819)|[//]: #10/21
- Accelerating Large Language Model Inference with Self-Supervised Early Exits
- An Efficient Inference Framework for Early-exit Large Language Models
- ![Publish
- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
- Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference
- ![Star - Exit LLMs](https://arxiv.org/abs/2410.18952) <br> Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec |<img width="1002" alt="image" src="https://github.com/MatteoNulli/Vocabulary_pruning/raw/main/src/images/final_nips.svg"> |[Github](https://github.com/MatteoNulli/Vocabulary_pruning) <br> [Paper](https://arxiv.org/abs/2410.18952)|[//]: #10/29
- ![Star - Inspired Adaptive Sparse Activation](https://arxiv.org/abs/2410.18311#) <br> Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen |<img width="1002" alt="image" src="https://wangqinsi1.github.io/coreinfer_page/static/images/overview.png"> |[Github](https://github.com/wangqinsi1/CoreInfer) <br> [Paper](https://arxiv.org/abs/2410.18311#)|[//]: #10/29
- ![Star - AI-Lab/MagicPIG)<br>[MagicPIG: LSH Sampling for Efficient LLM Generation](https://arxiv.org/abs/2410.16179) <br> Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16179v2/x15.png"> |[Github](https://github.com/Infini-AI-Lab/MagicPIG) <br> [Paper](https://arxiv.org/abs/2410.16179)|[//]: #10/30
- ![Star - the-Fly Self-Speculative Decoding for LLM Inference Acceleration](https://arxiv.org/abs/2410.06916) <br> Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li |<img width="1002" alt="image" src="https://github.com/hemingkx/SWIFT/raw/main/assets/swift.png"> |[Github](https://github.com/hemingkx/SWIFT) <br> [Paper](https://arxiv.org/abs/2410.06916)|[//]: #10/14
- ![Star - Augmented Generation with Precomputed KV Caches for Chunked Text](https://arxiv.org/abs/2410.07590) <br> Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang |<img width="1002" alt="image" src="https://github.com/MooreThreads/TurboRAG/raw/main/assets/image/TurboRAG.png"> |[Github](https://github.com/MooreThreads/TurboRAG) <br> [Paper](https://arxiv.org/abs/2410.07590)|[//]: #10/13
- Adaptive Draft-Verification for Efficient Large Language Model Decoding
-
Efficient Architecture of LLM
- ![Star - Hay So, Ting Cao, Fan Yang, Mao Yang |<img width="202" alt="image" src="https://arxiv.org/html/2410.13276v1/x4.png"> |[Github](https://github.com/microsoft/SeerAttention) <br> [Paper](https://arxiv.org/abs/2410.13276)|[//]: #10/21
- SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context
- ![Star - HWAI/Basis_Sharing)<br>[Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression](https://arxiv.org/abs/2410.03765) <br> Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.03765v1/x1.png"> |[Github](https://github.com/TUDa-HWAI/Basis_Sharing) <br> [Paper](https://arxiv.org/abs/2410.03765)|[//]: #10/14
-
Tuning
- ![Star - Column Updates](https://arxiv.org/abs/2410.10075) <br> Md Kowsher, Tara Esmaeilbeig, Chun-Nam Yu, Mojtaba Soltanalian, Niloofar Yousefi |<img width="1002" alt="image" src="https://github.com/Kowsher/RoCoFT/blob/main/figures/rocoft.png"> |[Github](https://github.com/Kowsher/RoCoFT) <br> [Paper](https://arxiv.org/abs/2410.10075)|[//]: #10/21
- ![Star - EMNLP'24-blue)]()<br>[Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models](https://arxiv.org/abs/2410.11772) <br> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11772v1/x3.png"> |[Github](https://github.com/Kaiseem/IST) <br> [Paper](https://arxiv.org/abs/2410.11772)|[//]: #10/21
- ![Star - EMNLP'24%20Findings-blue)]()<br>[QEFT: Quantization for Efficient Fine-Tuning of LLMs](https://arxiv.org/abs/2410.08661) <br> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08661v1/x2.png"> |[Github](https://github.com/xvyaward/qeft) <br> [Paper](https://arxiv.org/abs/2410.08661)|[//]: #10/21
- ![Star - Chang/BIPEFT)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models](https://arxiv.org/abs/2410.09079) <br> Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09079v1/x1.png"> |[Github](https://github.com/Aofei-Chang/BIPEFT) <br> [Paper](https://arxiv.org/abs/2410.09079)|[//]: #10/21
- Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs
- ![Star - tuning of MLP Layers](https://arxiv.org/abs/2410.07383) <br> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07383v1/x1.png"> |[Github](https://github.com/sayankotor/sparse_grads) <br> [Paper](https://arxiv.org/abs/2410.07383)|[//]: #10/13
-
Survey
- ![Star - Compression-Survey)<br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) <br> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12388v2/extracted/5933385/Figures/tree_overview.png"> |[Github](https://github.com/ZongqianLi/Prompt-Compression-Survey) <br> [Paper](https://arxiv.org/abs/2410.12388)|[//]: #10/21
- ![Publish
-
KV Cache Compression
- RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
- ![Star
- ![Star - Wise Dissimilar KV Cache Sharing](https://arxiv.org/abs/2410.18517) <br> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |<img width="1002" alt="image" src="https://github.com/yangyifei729/KVSharer/raw/main/img/main_fig.jpg"> |[Github](https://github.com/yangyifei729/KVSharer) <br> [Paper](https://arxiv.org/abs/2410.18517)|[//]: #10/29
- ![Star - KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference](https://arxiv.org/abs/2407.11550) <br> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |<img width="1002" alt="image" src="figures/adakv.png"> |[Github](https://github.com/FFY0/AdaKV) <br> [Paper](https://arxiv.org/abs/2407.11550)|[//]: #10/13
- PQCache: Product Quantization-based KVCache for Long Context LLM Inference
-
Text Compression
-
Efficient MOE
- Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
- ![Star - 778/MC-MoE)<br>[MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More](https://arxiv.org/abs/2410.06270) <br> Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/MC-MoE/raw/main/imgs/[email protected]"> |[Github](https://github.com/Aaronhuang-778/MC-MoE) <br> [Paper](https://arxiv.org/abs/2410.06270)|[//]: #10/14
-
Low-Rank Decomposition
- ![Star - ai/Natural-GaLore)<br>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https://arxiv.org/abs/2410.16029) <br> Arijit Das | |[Github](https://github.com/selfsupervised-ai/Natural-GaLore) <br> [Paper](https://arxiv.org/abs/2410.16029)|[//]: #10/30
-
Hardware/System
-
-
Inference Acceleration
- XC-Cache: Cross-Attending to Cached Context for Efficient LLM Inference - André Noël, Valentina Zantedeschi, David Vázquez, Nicolas Chapados, Christopher Pal, Perouz Taslakian |<img width="1002" alt="image" src="https://arxiv.org/html/2404.15420v1/x2.png"> |[Paper](https://arxiv.org/abs/2404.15420)|
- ![Star - core/hybrid_llm_routing)[![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]()<br>[Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing](https://arxiv.org/abs/2404.14618) <br> Dujian Ding, Ankur Mallick, Chi Wang, Robert Sim, Subhabrata Mukherjee, Victor Ruhle, Laks V.S. Lakshmanan, Ahmed Hassan Awadallah |<img width="1002" alt="image" src="figures/hybridLLM.png"> |[Github](https://github.com/m365-core/hybrid_llm_routing) <br> [Paper](https://arxiv.org/abs/2404.14618)|
- Efficient LLM Inference with Kcache
- Better & Faster Large Language Models via Multi-token Prediction - Paz, Gabriel Synnaeve |<img width="1002" alt="image" src="figures/MBPP.png"> |[Paper](https://arxiv.org/abs/2404.19737)|
- You Only Cache Once: Decoder-Decoder Architectures for Language Models
- ![Star - Speculative Decoding via Double Early Exiting](https://arxiv.org/abs/2404.18911) <br> Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, Yunhe Wang |<img width="1002" alt="image" src="https://github.com/Equationliu/Kangaroo/blob/main/imgs/kangaroo.png"> |[Github](https://github.com/Equationliu/Kangaroo) <br> [Paper](https://arxiv.org/abs/2404.18911)|
- Accelerating Speculative Decoding using Dynamic Speculation Length
- Clover: Regressive Lightweight Speculative Decoding with Sequential Knowledge
- ![Star - ICML'23%20Oral-blue)]()<br>[Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https://openreview.net/forum?id=wIPIhHd00i) <br> Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen |<img width="202" alt="image" src="figures/DajeVu.png"> |[Github](https://github.com/FMInference/DejaVu) <br> [Paper](https://openreview.net/forum?id=wIPIhHd00i)|
- ![Star - EMNLP'23-blue)]()<br>[Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding](https://arxiv.org/abs/2310.05424) <br> Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun |<img width="1002" alt="image" src="figures/FREE.png"> |[Github](https://github.com/raymin0223/fast_robust_early_exit) <br> [Paper](https://arxiv.org/abs/2310.05424)|
- ![Star - EMNLP'23-blue)]()<br>[Compressing Context to Enhance Inference Efficiency of Large Language Models](https://arxiv.org/abs/2310.06201) <br> Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin |<img width="1002" alt="image" src="figures/selective_context.png"> |[Github](https://github.com/liyucheng09/Selective_Context) <br> [Paper](https://arxiv.org/abs/2310.06201)|
- Inference with Reference: Lossless Acceleration of Large Language Models
- ![Star
- SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
- Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
- ![Star - zju/self-speculative-decoding)<br>[Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding](https://arxiv.org/abs/2309.08168) <br> Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra |<img width="1002" alt="image" src="https://github.com/dilab-zju/self-speculative-decoding/blob/main/assets/intro.png"> |[Github](https://github.com/dilab-zju/self-speculative-decoding) <br> [Paper](https://arxiv.org/abs/2309.08168)|
- ![Star - han-lab/streaming-llm)<br>[Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) <br> Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis |<img width="1002" alt="image" src="https://github.com/mit-han-lab/streaming-llm/blob/main/figures/schemes.png"> |[Github](https://github.com/mit-han-lab/streaming-llm) <br> [Paper](https://arxiv.org/abs/2309.17453)|
- (Dynamic) Prompting might be all you need to repair Compressed LLMs
- ![Star - efficient Reasoning](https://arxiv.org/abs/2310.03094) <br> Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao |<img width="1002" alt="image" src="figures/LLM_MoT_cascade.png"> |[Github](https://github.com/MurongYue/LLM_MoT_cascade) <br> [Paper](https://arxiv.org/abs/2310.03094)|
- CacheGen: Fast Context Loading for Language Model Applications
- ![Star - EMNLP'23-blue)]()<br>[Context Compression for Auto-regressive Transformers with Sentinel Tokens](https://arxiv.org/abs/2310.08152) <br> Siyu Ren, Qi Jia, Kenny Q. Zhu |<img width="1002" alt="image" src="figures/KV_compression.png"> |[Github](https://github.com/DRSY/KV_Compression) <br> [Paper](https://arxiv.org/abs/2310.08152)|
- ![Star - rankers)<br>[A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models](https://arxiv.org/abs/2310.09497) <br> Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, Guido Zuccon |<img width="1002" alt="image" src="figures/Setwise.png"> |[Github](https://github.com/ielab/llm-rankers) <br> [Paper](https://arxiv.org/abs/2310.09497)|
- SPEED: Speculative Pipelined Execution for Efficient Decoding
- Accelerating LLM Inference by Enabling Intermediate Layer Decoding
- Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
- ![Star - mllab/context-memory)<br>[Compressed Context Memory For Online Language Model Interaction](https://arxiv.org/abs/2312.03414) <br> Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song |<img width="1002" alt="image" src="https://github.com/snu-mllab/Context-Memory/blob/main/image/main.png"> |[Github](https://github.com/snu-mllab/context-memory) <br> [Paper](https://arxiv.org/abs/2312.03414)|
- SparQ Attention: Bandwidth-Efficient LLM Inference - Galley, Charlie Blake, Carlo Luschi, Douglas Orr |<img width="1002" alt="image" src="figures/SparQ.png"> |[Paper](https://arxiv.org/abs/2312.04985)|
- Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
- Cascade Speculative Drafting for Even Faster LLM Inference - Chuan Chang |<img width="1002" alt="image" src="figures/CSDrafting.png"> |[Paper](https://arxiv.org/abs/2312.11462)|
- ![Star - offloading)<br>[Fast Inference of Mixture-of-Experts Language Models with Offloading](https://arxiv.org/abs/2312.17238) <br> Artyom Eliseev, Denis Mazur |<img width="1002" alt="image" src="figures/mixtral_offloading.png"> |[Github](https://github.com/dvmazur/mixtral-offloading) <br> [Paper](https://arxiv.org/abs/2312.17238)|
- ![Star - Yew Lin, Yuqing Yang, Lili Qiu |<img width="1002" alt="image" src="figures/longllmlingua.png"> |[Github](https://github.com/microsoft/LLMLingua) <br> [Paper](https://arxiv.org/abs/2310.06839)|
- ![Star - NeurIPS'23-blue)]()<br>[H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models](https://arxiv.org/abs/2306.14048) <br> Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen |<img width="1002" alt="image" src="https://github.com/FMInference/H2O/blob/main/Figs/h2o.jpg"> |[Github](https://github.com/FMInference/H2O) <br> [Paper](https://arxiv.org/abs/2306.14048)|
- ![Star - llm) <br> Yuhui Li, Chao Zhang, and Hongyang Zhang |<img width="302" alt="image" src="https://github.com/SafeAILab/EAGLE/blob/main/figs/fig1.png"> |[Github](https://github.com/SafeAILab/EAGLE) <br> [Blog](https://sites.google.com/view/eagle-llm)|
- LoMA: Lossless Compressed Memory Attention
- ![Star
- APAR: LLMs Can Do Auto-Parallel Auto-Regressive Decoding
- ![Star - Directional Tuning for Lossless Acceleration in Large Language Models](https://arxiv.org/abs/2401.12522) <br> Feng Lin, Hanling Yi, Hongbin Li, Yifan Yang, Xiaotian Yu, Guangming Lu, Rong Xiao |<img width="1002" alt="image" src="figures/BiTA.png"> |[Github](https://github.com/linfeng93/BiTA) <br> [Paper](https://arxiv.org/abs/2401.12522)|
- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
- Recurrent Drafter for Fast Speculative Decoding in Large Language Models
- Optimal Block-Level Draft Verification for Accelerating Speculative Decoding
- ![Star
- Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration
- Speculative Streaming: Fast LLM Inference without Auxiliary Models
- RelayAttention for Efficient Large Language Model Serving with Long System Prompts
- Recursive Speculative Decoding: Accelerating LLM Inference via Sampling Without Replacement
- ![Star
- Hierarchical Skip Decoding for Efficient Autoregressive Text Generation
- ![Star
- ![Star - Timmy/GEAR)<br>[GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM](https://arxiv.org/abs/2403.05527) <br> Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao |<img width="1002" alt="image" src="https://github.com/HaoKang-Timmy/GEAR/raw/main/Fig/overview.png"> |[Github](https://github.com/HaoKang-Timmy/GEAR) <br> [Paper](https://arxiv.org/abs/2403.05527)|
- CHAI: Clustered Head Attention for Efficient LLM Inference - Jean Wu |<img width="1002" alt="image" src="figures/chai.png"> |[Paper](https://arxiv.org/abs/2403.08058)|
- ![Star - zhao/prepacking)<br>[Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models](https://arxiv.org/abs/2404.09529) <br> Siyan Zhao, Daniel Israel, Guy Van den Broeck, Aditya Grover |<img width="302" alt="image" src="https://github.com/siyan-zhao/prepacking/raw/main/figures/prepacking_gif_final.gif"> |[Github](https://github.com/siyan-zhao/prepacking) <br> [Paper](https://arxiv.org/abs/2404.09529)|
- Towards Fast Inference: Exploring and Improving Blockwise Parallel Drafts
- ![Star - gram Parallel Decoding](https://arxiv.org/abs/2404.08698) <br> Jie Ou, Yueming Chen, Wenhong Tian |<img width="602" alt="image" src="figures/ANPD.png"> |[Github](https://github.com/oujieww/ANPD) <br> [Paper](https://arxiv.org/abs/2404.08698)|
- Self-Selected Attention Span for Accelerating Large Language Model Inference
- ![Publish
- ![Publish - Guided Early Exiting Method for Accelerating Language Models Inference](https://arxiv.org/abs/2312.11882) <br> Ziqian Zeng, Yihuai Hong, Hongliang Dai, Huiping Zhuang, Cen Chen |<img width="1002" alt="image" src="figures/ConsistentEE.png"> |[Paper](https://arxiv.org/abs/2312.11882)|
- ![Publish - LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction](https://arxiv.org/abs/2310.15556) <br> Junyi Liu, Liangzhi Li, Tong Xiang, Bowen Wang, Yiming Qian |<img width="1002" alt="image" src="figures/TCRA-LLM.png"> |[Paper](https://arxiv.org/abs/2310.15556)|
- ![Star - SD)<br>[EMS-SD: Efficient Multi-sample Speculative Decoding for Accelerating Large Language Models](https://arxiv.org/abs/2405.07542) <br> Yunsheng Ni, Chuanjian Liu, Yehui Tang, Kai Han, Yunhe Wang |<img width="202" alt="image" src="https://github.com/niyunsheng/EMS-SD/raw/main/assets/fig2-method.png"> |[Github](https://github.com/niyunsheng/EMS-SD) <br> [Paper](https://arxiv.org/abs/2405.07542)|
- Distributed Speculative Inference of Large Language Models
- ![Star
-
Efficient Architecture of LLM
- ![Star - ICLR'24-blue)]()<br>[Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs](https://arxiv.org/abs/2404.10308) <br> Woomin Song, Seunghyuk Oh, Sangwoo Mo, Jaehyung Kim, Sukmin Yun, Jung-Woo Ha, Jinwoo Shin |<img width="1002" alt="image" src="figures/homer.png"> |[Github](https://github.com/alinlab/HOMER) <br> [Paper](https://arxiv.org/abs/2404.10308)|
- Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models - Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, Guillaume Desjardins, Arnaud Doucet, David Budden, Yee Whye Teh, Razvan Pascanu, Nando De Freitas, Caglar Gulcehre |<img width="1002" alt="image" src="https://arxiv.org/html/2402.19427v1/x3.png"> |[Paper](https://arxiv.org/abs/2402.19427)|
- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases
- Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding
- ![Star - oryx/MobiLlama)<br>[MobiLlama: Towards Accurate and Lightweight Fully Transparent GPT](https://arxiv.org/abs/2402.16840) <br> Omkar Thawakar, Ashmal Vayani, Salman Khan, Hisham Cholakal, Rao M. Anwer, Michael Felsberg, Tim Baldwin, Eric P. Xing, Fahad Shahbaz Khan |<img width="402" alt="image" src="https://github.com/mbzuai-oryx/MobiLlama/raw/main/images/mobillama_generation.gif"> |[Github](https://github.com/mbzuai-oryx/MobiLlama) <br> [Paper](https://arxiv.org/abs/2402.16840) <br>[Model](https://huggingface.co/MBZUAI/MobiLlama-05B) |
- ![Star
- Scaling Efficient LLMs
- ![Star
- Tandem Transformers for Inference Efficient LLMs
-
Survey
- A Survey on Efficient Inference for Large Language Models - Ping Zhang, Yuhan Dong, Yu Wang. [[Paper]](https://arxiv.org/abs/2404.14294)
- A Survey on Model Compression for Large Language Models
- ![Star - LLM-Survey) [The Efficiency Spectrum of Large Language Models: An Algorithmic Survey](https://arxiv.org/abs/2312.00678). Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang. [[Paper]](https://arxiv.org/abs/2312.00678)[[Github]](https://github.com/tding1/Efficient-LLM-Survey)
- ![Star - MLSys-Lab/Efficient-LLMs-Survey) [Efficient Large Language Models: A Survey](https://arxiv.org/abs/2312.03863). Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang. [[Paper]](https://arxiv.org/abs/2312.03863)[[Github]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)
- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
- ![Star - shii/Awesome-Resource-Efficient-LLM-Papers) [Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models](https://arxiv.org/abs/2401.00625). Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao. [[Paper]](https://arxiv.org/abs/2401.00625)[[Github]](https://github.com/tiingweii-shii/Awesome-Resource-Efficient-LLM-Papers)
- ![Star - efficient LLM and Multimodal Foundation Models](https://arxiv.org/abs/2401.08092). Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu. [[Paper]](https://arxiv.org/abs/2401.08092)[[Github]](https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey)
- A Survey on Hardware Accelerators for Large Language Models
- ![Star - Qin Zhang, Yunxin Liu. [[Paper]](https://arxiv.org/abs/2401.05459)[[Github]](https://github.com/MobileLLM/Personal_LLM_Agents_Survey)
- A Comprehensive Survey of Compression Algorithms for Language Models
- ![Star - Bench) [Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding](https://arxiv.org/abs/2401.07851). Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui. [[Paper]](https://arxiv.org/abs/2401.07851)[[Github]](https://github.com/hemingkx/Spec-Bench)[[Blog]](https://sites.google.com/view/spec-bench)
- ![Star - LLM-Survey) [Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward](https://arxiv.org/abs/2402.01799). Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta. [[Paper]](https://arxiv.org/abs/2402.01799)[[Github]](https://github.com/nyunAI/Faster-LLM-Survey)
- ![Star - Knowledge-Distillation-of-LLMs) [A Survey on Knowledge Distillation of Large Language Models](https://arxiv.org/abs/2402.13116). Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou. [[Paper]](https://arxiv.org/abs/2402.13116)[[Github]](https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs)
- Efficient Prompting Methods for Large Language Models: A Survey
- A Survey on Transformer Compression
- Model Compression and Efficient Inference for Large Language Models: A Survey
- A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
-
Paper from June 21, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Quantization
- SDQ: Sparse Decomposed Quantization for LLM Inference - An Tsai, Stephen W. Keckler, Tushar Krishna |<img width="1002" alt="image" src="https://arxiv.org/html/2406.13868v1/x3.png"> |[Paper](https://arxiv.org/abs/2406.13868)|[//]: #06/24
- ![Star - bit Vector Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2409.17066) <br> Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang |<img width="1002" alt="image" src="figures/VPTQ.png"> |[Github](https://github.com/microsoft/VPTQ) <br> [Paper](https://arxiv.org/abs/2409.17066)|[//]: #09/27
- ![Star - FlashAttention2024/INT-FlashAttention)<br>[INT-FlashAttention: Enabling Flash Attention for INT8 Quantization](https://arxiv.org/abs/2409.16997) <br> Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang |<img width="1002" alt="image" src="https://arxiv.org/html/2409.16997v2/x1.png"> |[Github](https://github.com/INT-FlashAttention2024/INT-FlashAttention) <br> [Paper](https://arxiv.org/abs/2409.16997)|[//]: #09/27
- ![Star - ov-file)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs](https://arxiv.org/abs/2406.01721) <br> Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei |<img width="1002" alt="image" src="https://github.com/Hsu1023/DuQuant/blob/main/imgs/duquant.png"> |[Github](https://github.com/Hsu1023/DuQuant?tab=readme-ov-file) <br> [Paper](https://arxiv.org/abs/2406.01721)|[//]: #09/27
- Attention-aware Post-training Quantization without Backpropagation - young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon |<img width="1002" alt="image" src="https://arxiv.org/html/2406.13474v1/x1.png"> |[Paper](https://arxiv.org/abs/2406.13474)|[//]: #06/24
- CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
- Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models - Joon Kim |<img width="1002" alt="image" src="https://arxiv.org/html/2406.12311v1/x2.png"> |[Paper](https://arxiv.org/abs/2406.12311)|[//]: #06/23
-
Inference Acceleration
- Optimized Speculative Sampling for GPU Hardware Accelerators
- ![Star - Context LLMs with 1000x Input Token Reduction](https://arxiv.org/abs/2409.17422) <br> Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, Shafiq Joty |<img width="1002" alt="image" src="https://arxiv.org/html/2409.17422v1/x1.png"> |[Github](https://github.com/SalesforceAIResearch/GemFilter) <br> [Paper](https://arxiv.org/abs/2409.17422)|[//]: #09/27
- Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
- Interpreting Attention Layer Outputs with Sparse Autoencoders
- Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
- ![Star - wise Criticality-based Approach for Prefilling Acceleration in LLMs](https://arxiv.org/abs/2409.12490) <br> Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie |<img width="1002" alt="image" src="https://arxiv.org/html/2409.12490v1/x2.png"> |[Github](https://github.com/66RING/CritiPrefill) <br> [Paper](https://arxiv.org/abs/2409.12490)|[//]: #09/21
-
Hardware/System
-
Network Pruning / Sparsity
- ![Star - NeurIPS'24-blue)]() <br>[MaskLLM: Learnable Semi-Structured Sparsity for Large Language Models](https://arxiv.org/abs/2409.17481) <br> Gongfan Fang, Hongxu Yin, Saurav Muralidharan, Greg Heinrich, Jeff Pool, Jan Kautz, Pavlo Molchanov, Xinchao Wang |<img width="302" alt="image" src="https://github.com/NVlabs/MaskLLM/blob/main/assets/animation-LQ.gif"> |[Github](https://github.com/NVlabs/MaskLLM) <br> [Paper](https://arxiv.org/abs/2409.17481)|[//]: #09/27
- ![Star - to-Fine Activation Information](https://arxiv.org/abs/2409.13199) <br> Yuxin Wang, Minghua Ma, Zekun Wang, Jingchang Chen, Huiming Fan, Liping Shan, Qing Yang, Dongliang Xu, Ming Liu, Bing Qin |<img width="1002" alt="image" src="https://arxiv.org/html/2409.13199v1/x1.png"> |[Github](https://github.com/wyxscir/CFSP) <br> [Paper](https://arxiv.org/abs/2409.13199)|[//]: #09/27
- FoldGPT: Simple and Effective Large Language Model Compression Scheme
- ![Publish
- Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization
- Optimization-based Structural Pruning for Large Language Models without Back-Propagation - Song Xia |<img width="1002" alt="image" src="https://arxiv.org/html/2406.10576v1/extracted/5669159/imgs/overview5.png"> |[Paper](https://arxiv.org/abs/2406.10576)|[//]: #06/23
-
Knowledge Distillation
- ![Star
- ![Star - UOFA/Prompt-LLMR)[![Publish](https://img.shields.io/badge/Conference-LREC-COLING'24-blue)]()<br>[LLMR: Knowledge Distillation with a Large Language Model-Induced Reward](https://arxiv.org/abs/2409.12500) <br> Dongheng Li, Yongchang Hao, Lili Mou |<img width="1002" alt="image" src="https://github.com/MANGA-UOFA/Prompt-LLMR/blob/main/LLMR-main/assets/model.png"> |[Github](https://github.com/MANGA-UOFA/Prompt-LLMR) <br> [Paper](https://arxiv.org/abs/2409.12500)|[//]: #09/21
-
KV Cache Compression
- ![Star - Cache with Precision-Aligned Quantization](https://arxiv.org/abs/2409.16546) <br> Yifan Tan, Haoze Wang, Chao Yan, Yangdong Deng |<img width="1002" alt="image" src="https://arxiv.org/html/2409.16546v1/extracted/5867591/Figure6.png"> |[Github](https://github.com/AlignedQuant/AlignedKV) <br> [Paper](https://arxiv.org/abs/2409.16546)|[//]: #09/27
-
Text Compression
- ![Star
- ![Star
- ![Star - Shree-Narashiman/AlphaZip)<br>[AlphaZip: Neural Network-Enhanced Lossless Text Compression](https://arxiv.org/abs/2409.15046) <br> Swathi Shree Narashiman, Nitin Chandrachoodan |<img width="1002" alt="image" src="https://arxiv.org/html/2409.15046v1/extracted/5873563/images/architecture_bloack_diagram.png"> |[Github](https://github.com/Swathi-Shree-Narashiman/AlphaZip) <br> [Paper](https://arxiv.org/abs/2409.15046)|[//]: #09/27
- Brevity is the soul of wit: Pruning long files for code generation
-
Tuning
- ![Star - er/Bone)<br>[Bone: Block Affine Transformation as Parameter Efficient Fine-tuning Methods for Large Language Models](https://arxiv.org/abs/2409.15371) <br> Jiale Kang |<img width="1002" alt="image" src="https://arxiv.org/html/2409.15371v1/extracted/5865415/imgs/bone-free.png"> |[Github](https://github.com/JL-er/Bone) <br> [Paper](https://arxiv.org/abs/2409.15371)|[//]: #09/27
- Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead - Gabrielsson, Jiacheng Zhu, Onkar Bhardwaj, Leshem Choshen et al |<img width="1002" alt="image" src="https://arxiv.org/html/2407.00066v1/x1.png"> |[Paper](https://arxiv.org/abs/2407.00066)|[//]: #07/03
- BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks
-
Survey
- ![Star - Compression)<br>[Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2409.13385) <br> Sourav Verma |<img width="1002" alt="image" src="figures/CCRAG_survey.png"> |[Github](https://github.com/SrGrace/Contextual-Compression) <br> [Paper](https://arxiv.org/abs/2409.13385)|[//]: #09/27
-
Low-Rank Decomposition
-
-
Low-Rank Decomposition
- ![Star - ICML'23-blue)]() <br>[LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation](https://arxiv.org/abs/2306.11222) <br> Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, Tuo Zhao |<img width="302" alt="image" src="figures/LoSparse.png"> |[Github](https://github.com/yxli2123/LoSparse) <br> [Paper](https://arxiv.org/abs/2306.11222)|
- ![Star - compressor)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]()<br>[Matrix Compression via Randomized Low Rank and Low Precision Factorization](https://arxiv.org/abs/2310.11028) <br> Rajarshi Saha, Varun Srivastava, Mert Pilanci |<img width="1002" alt="image" src="figures/LPLR.png"> |[Github](https://github.com/pilancilab/matrix-compressor) <br> [Paper](https://arxiv.org/abs/2310.11028)|
- TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition - SVD.png"> |[Paper](https://arxiv.org/abs/2307.00526)|
- LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
- ![Star - rom)<br>[Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models](https://arxiv.org/abs/2312.07046) <br> Arnav Chavan, Nahush Lele, Deepak Gupta |<img width="1002" alt="image" src="figures/LLM-ROM.png"> |[Github](https://github.com/transmuteAI/trailmet/tree/main/trailmet/algorithms/llm-rom) <br> [Paper](https://arxiv.org/abs/2312.07046)|
- Data-free Weight Compress and Denoise for Large Language Models
- ![Star - MLSys-Lab/SVD-LLM)<br>[SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression](https://arxiv.org/abs/2403.07378) <br> Xin Wang, Yu Zheng, Zhongwei Wan, Mi Zhang |<img width="1002" alt="image" src="https://github.com/AIoT-MLSys-Lab/SVD-LLM/raw/main/figures/framework.png"> |[Github](https://github.com/AIoT-MLSys-Lab/SVD-LLM) <br> [Paper](https://arxiv.org/abs/2403.07378)|
- ![Star - ACL'24%20Findings-blue)]()<br>[Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization](https://arxiv.org/abs/2405.10616) <br> Yixin Ji, Yang Xiang, Juntao Li, Wei Chen, Zhongyi Liu, Kehai Chen, Min Zhang |<img width="1002" alt="image" src="figures/bolaco.png"> |[Github](https://github.com/Dereck0602/Bolaco) <br> [Paper](https://arxiv.org/abs/2405.10616)|
- ![Star - LLM)[![Publish](https://img.shields.io/badge/Conference-ACL'24-blue)]()<br>[Surgical Feature-Space Decomposition of LLMs: Why, When and How?](https://arxiv.org/abs/2405.13039) <br> Arnav Chavan, Nahush Lele, Deepak Gupta |<img width="1002" alt="image" src="figures/SFSD-LLM.png"> |[Github](https://github.com/nyunAI/SFSD-LLM) <br> [Paper](https://arxiv.org/abs/2405.13039)|
-
Hardware
- ![Star - AILab/flash-attention) [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691). Tri Dao. [[Paper]](https://arxiv.org/abs/2307.08691)[[Github]](https://github.com/Dao-AILab/flash-attention)
- ![Star - Kelley. [[Paper]](https://arxiv.org/abs/2311.09431)[[Github]](https://github.com/exists-forall/striped_attention/)
- ![Star - IPADS/PowerInfer) [PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU](https://arxiv.org/abs/2312.12456). Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen. [[Paper]](https://arxiv.org/abs/2312.12456)[[Github]](https://github.com/SJTU-IPADS/PowerInfer)
-
Tuning
- CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
- ![Star - Quan Luo. [[Paper]](https://arxiv.org/abs/2310.10505)[[Github]](https://github.com/liziniu/ReMax)
- TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
- DEFT: Data Efficient Fine-Tuning for Large Language Models via Unsupervised Core-Set Selection
- ![Star
- ![Star
- Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper
- ![Star - proto) [SPT: Fine-Tuning Transformer-based Language Models Efficiently with Sparsification](https://arxiv.org/abs/2312.10365). Yuntao Gui, Xiao Yan, Peiqi Yin, Han Yang, James Cheng. [[Paper]](https://arxiv.org/abs/2312.10365)[[Github]](https://github.com/ytgui/SPT-proto)
- LoRA-SP: Streamlined Partial Parameter Adaptation for Resource-Efficient Fine-Tuning of Large Language Models
- Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning - Jui Hsieh, Yang You. [[Paper]](https://arxiv.org/abs/2402.15751)
- ![Star - Tuning of Large Language Models by Dropping Backward Propagation](https://arxiv.org/abs/2402.17812). Sunghyeon Woo, Baeseong Park, Byeongwook Kim, Minjung Jo, Sejung Kwon, Dongsuk Jeon, Dongsoo Lee. [[Paper]](https://arxiv.org/abs/2402.17812)[[Github]](https://github.com/WooSunghyeon/dropbp)
- ![Star - ghosh-berkeley/loraplus) [LoRA+: Efficient Low Rank Adaptation of Large Models](https://arxiv.org/abs/2402.12354). Soufiane Hayou, Nikhil Ghosh, Bin Yu. [[Paper]](https://arxiv.org/abs/2402.12354)[[Github]](https://github.com/nikhil-ghosh-berkeley/loraplus)
- ![Publish - Shroom.svg?style=social&label=Star)](https://github.com/ngregoriade/Semeval2024-Shroom) [AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis](https://arxiv.org/pdf/2404.01210.pdf). Natalia Griogoriadou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou. [[Paper]](https://arxiv.org/pdf/2404.01210.pdf)[[Github]](https://github.com/ngregoriade/Semeval2024-Shroom)
- ![Star
- ![Publish - Shroom.svg?style=social&label=Star)](https://github.com/ngregoriade/Semeval2024-Shroom) [AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis](https://arxiv.org/pdf/2404.01210.pdf). Natalia Griogoriadou, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou. [[Paper]](https://arxiv.org/pdf/2404.01210.pdf)[[Github]](https://github.com/ngregoriade/Semeval2024-Shroom)
- Intuition-aware Mixture-of-Rank-1-Experts for Parameter Efficient Finetuning
- Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
-
Leaderboard
-
Paper from Sep 30, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
KV Cache Compression
- ![Publish
- KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head
- MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
- Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
- TokenSelect: Efficient Long-Context Inference and Length Extrapolation for LLMs via Dynamic Token-Level KV Cache Selection
- SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation
- ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
- Unifying KV Cache Compression for Large Language Models with LeanKV
- Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity
- MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
- Lossless KV Cache Compression to 2%
- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy
-
Quantization
- Channel-Wise Mixed-Precision Quantization for Large Language Models
- Progressive Mixed-Precision Decoding for Efficient LLM Inference
- MixPE: Quantization and Hardware Co-design for Efficient LLM Inference - Ling Zhen, Mingxuan Yuan, Bei Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2411.16158v1/x5.png"> |[Paper](https://arxiv.org/abs/2411.16158)|[//]: #12/03
- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
- Addition is All You Need for Energy-efficient Language Models
- ![Star - dmx/project-resq)<br>[ResQ: Mixed-Precision Quantization of Large Language Models with Low-Rank Residuals](https://arxiv.org/abs/2412.14363) <br> Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, Xin Wang |<img width="1002" alt="image" src="figures/ResQ.png"> |[Github](https://github.com/utkarsh-dmx/project-resq) <br> [Paper](https://arxiv.org/abs/2412.14363)|[//]: #12/30
- MixLLM: LLM Quantization with Global Mixed-precision between Output-features and Highly-efficient System Design
- GQSA: Group Quantization and Sparsity for Accelerating Large Language Model Inference
- LSAQ: Layer-Specific Adaptive Quantization for Large Language Model Deployment
- GWQ: Gradient-Aware Weight Quantization for Large Language Models
- "Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization
- BitNet a4.8: 4-bit Activations for 1-bit LLMs
- AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
- Scaling laws for post-training quantized large language models
- Understanding the difficulty of low-precision post-training quantization of large language models
- QuAILoRA: Quantization-Aware Initialization for LoRA
- Pyramid Vector Quantization for LLMs
- CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression
- Scaling Laws for Mixed quantization in Large Language Models - MPQ.png"> |[Paper](https://arxiv.org/abs/2410.06722)|[//]: #10/14
- PalmBench: A Comprehensive Benchmark of Compressed Large Language Models on Mobile Platforms
- A Comprehensive Study on Quantization Techniques for Large Language Models
- The Impact of Inference Acceleration Strategies on Bias of LLMs
- SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization
- CPTQuant -- A Novel Mixed Precision Post-Training Quantization Techniques for Large Language Models
- ![Publish - Length Grouped Activation Data Format](https://arxiv.org/abs/2411.15982) <br> Chao Fang, Man Shi, Robin Geens, Arne Symons, Zhongfeng Wang, Marian Verhelst |<img width="1002" alt="image" src="https://arxiv.org/html/2411.15982v1/x1.png"> |[Paper](https://arxiv.org/abs/2411.15982)|[//]: #12/03
- Evaluating Quantized Large Language Models for Code Generation on Low-Resource Language Benchmarks
- SeedLM: Compressing LLM Weights into Seeds of Pseudo-Random Generators
- Continuous Approximations for Improving Quantization Aware Training of LLMs
- AMXFP4: Taming Activation Outliers with Asymmetric Microscaling Floating-Point for 4-bit LLM Inference
- Bi-Mamba: Towards Accurate 1-Bit State Space Models
-
Inference Acceleration
- QSpec: Speculative Decoding with Complementary Quantization Schemes
- A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
- Faster Language Models with Better Multi-Token Prediction Using Tensor Decomposition
- Efficient Inference for Augmented Large Language Models
- Accelerated AI Inference via Dynamic Execution Methods
- SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference
- Dynamic Strategy Planning for Efficient Question Answering with Large Language Models
- The N-Grammys: Accelerating Autoregressive Inference with Learning-Free Batched Speculation
- TidalDecode: Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
- ParallelSpec: Parallel Drafter for Efficient Speculative Decoding
- DySpec: Faster Speculative Decoding with Dynamic Token Tree Structure
- Speculative Decoding with CTC-based Draft Model for LLM Inference Acceleration
- PLD+: Accelerating LLM inference by leveraging Language Model Artifacts
- ![Publish
-
Hardware/System/Serving
- ![Publish - bit Communication Quantization in Sharded Data Parallelism for LLM Training](https://arxiv.org/abs/2410.15526) <br> Jinda Jia, Cong Xie, Hanlin Lu, Daoce Wang, Hao Feng, Chengming Zhang, Baixi Sun, Haibin Lin, Zhi Zhang, Xin Liu, Dingwen Tao |<img width="1002" alt="image" src="https://arxiv.org/html/2410.15526v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.15526)|[//]: #10/30
- KunServe: Elastic and Efficient Large Language Model Serving with Parameter-centric Memory Management
- FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs
- CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration
- Ripple: Accelerating LLM Inference on Smartphones with Correlation-Aware Neuron Management
- ![Publish
- FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
- EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
- POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
-
Knowledge Distillation
- Improving Mathematical Reasoning Capabilities of Small Language Models via Feedback-Driven Distillation
- Knowledge Distillation of Large Language Models
- ![Publish - Evolution Knowledge Distillation for LLM-based Machine Translation](https://arxiv.org/abs/2412.15303) <br> Yuncheng Song, Liang Ding, Changtong Zan, Shujian Huang |<img width="1002" alt="image" src="https://arxiv.org/html/2412.15303v1/extracted/6081708/model_two.png"> |[Paper](https://arxiv.org/abs/2412.15303)|[//]: #12/30
- Large Language Models Compression via Low-Rank Feature Distillation
- ![Star - HLT/FSA-Distillation)<br>[Distilling Fine-grained Sentiment Understanding from Large Language Models](https://arxiv.org/abs/2412.18552) <br> Yice Zhang, Guangyu Xie, Hongling Xu, Kaiheng Hou, Jianzhu Bao, Qianlong Wang, Shiwei Chen, Ruifeng Xu |<img width="302" alt="image" src="https://arxiv.org/html/2412.18552v1/x1.png"> |[Github](https://github.com/HITSZ-HLT/FSA-Distillation) <br> [Paper](https://arxiv.org/abs/2412.18552)|[//]: #12/30
- ![Star - distillation)<br>[Enhancing Knowledge Distillation for LLMs with Response-Priming Prompting](https://arxiv.org/abs/2412.17846) <br> Vijay Goyal, Mustafa Khan, Aprameya Tirupati, Harveer Saini, Michael Lam, Kevin Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2412.17846v1/extracted/6080471/prompt-example.png"> |[Github](https://github.com/alonso130r/knowledge-distillation) <br> [Paper](https://arxiv.org/abs/2412.17846)|[//]: #12/30
- Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling - Yu Lee, Tomas Pfister |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11325v1/x2.png"> |[Paper](https://arxiv.org/abs/2410.11325)|[//]: #10/21
- SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models
- Pre-training Distillation for Large Language Models: A Design Space Exploration
- Evolutionary Contrastive Distillation for Language Model Alignment - Samuels, Zheng Li, Hyokun Yun, Priyanka Nigam, Yi Xu, Vaclav Petricek, Bing Yin, Trishul Chilimbi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07513v1/extracted/5913898/figures/main_alg_v3.png"> |[Paper](https://arxiv.org/abs/2410.07513)|[//]: #10/13
-
Network Pruning / Sparsity
- Mitigating Copy Bias in In-Context Learning through Neuron Pruning
- HashAttention: Semantic Sparsity for Faster Inference
- Adaptive Pruning for Large Language Models with Structural Importance Awareness
- SlimGPT: Layer-wise Structured Pruning for Large Language Models
- Less is More: Towards Green Code Large Language Models via Unified Structural Pruning - Pruner.png"> |[Paper](https://arxiv.org/abs/2412.15921)|[//]: #12/30
- AVSS: Layer Importance Evaluation in Large Language Models via Activation Variance-Sparsity Analysis
- Tailored-LLaMA: Optimizing Few-Shot Learning in Pruned LLaMA Models with Task-Specific Prompts
- ![Publish - Data Distillation for Recovering Quality in Pruned Large Language Models](https://arxiv.org/abs/2410.09982) <br> Vithursan Thangarasa, Ganesh Venkatesh, Nish Sinnadurai, Sean Lie |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09982v2/x1.png"> |[Paper](https://arxiv.org/abs/2410.09982)|[//]: #10/21
- LLM-Rank: A Graph Theoretical Approach to Pruning Large Language Models
- Beyond 2:4: exploring V:N:M sparsity for efficient transformer inference on GPUs
- FedSpaLLM: Federated Pruning of Large Language Models
- Self-calibration for Language Model Quantization and Pruning
- Beware of Calibration Data for Pruning Large Language Models
- Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
- Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
- Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix
- Layer Importance and Hallucination Analysis in Large Language Models via Enhanced Activation Variance-Sparsity
- Scaling Law for Post-training after Model Pruning
-
Text Compression
- ![Star - leveraging-rwkv-for-learned-lossless-low-complexity-text-compression)<br>[L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression](https://arxiv.org/abs/2412.16642) <br> Junxuan Zhang, Zhengxue Cheng, Yan Zhao, Shihao Wang, Dajiang Zhou, Guo Lu, Li Song |<img width="1002" alt="image" src="https://arxiv.org/html/2412.16642v2/x2.png"> |[Github](https://github.com/alipay/L3TC-leveraging-rwkv-for-learned-lossless-low-complexity-text-compression) <br> [Paper](https://arxiv.org/abs/2412.16642)|[//]: #12/30
- ![Star - Aware Prompt Compression for LLM-based MT Evaluation Metrics](https://arxiv.org/abs/2412.16120) <br> Daniil Larionov, Steffen Eger |<img width="1002" alt="image" src="https://arxiv.org/html/2412.16120v1/x1.png"> |[Github](https://github.com/NL2G/promptoptme) <br> [Paper](https://arxiv.org/abs/2412.16120)|[//]: #12/30
- A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
- JPPO: Joint Power and Prompt Optimization for Accelerated Large Language Model Services
- Perception Compressor:A training-free prompt compression method in long context scenarios - Tao Zheng |<img width="1002" alt="image" src="https://arxiv.org/html/2409.19272v1/x1.png"> |[Paper](https://arxiv.org/abs/2409.19272)|[//]: #10/02
-
Efficient Training
- ![Star - Artzi |<img width="1002" alt="image" src="https://arxiv.org/html/2412.18027v1/x1.png"> |[Github](https://github.com/neiterman21/LDB) <br> [Paper](https://arxiv.org/abs/2412.18027)|[//]: #12/30
- Less is More: Extreme Gradient Boost Rank-1 Adaption for Efficient Finetuning of LLMs
- AutoMixQ: Self-Adjusting Quantization for High Performance Memory-Efficient Fine-Tuning
-
Efficient Architecture of LLM
-
Efficient MOE
- ProMoE: Fast MoE-based LLM Serving using Proactive Caching
- Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
- ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
- EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
- HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference - Ann Heng, Chao Li, Minyi Guo |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01433v2/extracted/5980843/figures/overview5.png"> |[Paper](https://arxiv.org/abs/2411.01433)|[//]: #11/18
-
Low-Rank Decomposition
-
Tuning
- ![Publish - Efficient Fine-Tuning of Large Language Models using Semantic Knowledge Tuning](https://arxiv.org/abs/2410.08598) <br> Nusrat Jahan Prottasha, Asif Mahmud, Md. Shohanur Islam Sobuj, Prakash Bhat, Md Kowsher, Niloofar Yousefi, Ozlem Ozmen Garibay |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08598v1/x1.png"> |[Paper](https://arxiv.org/abs/2410.08598)|[//]: #10/21
- SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching
- HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization
- ![Publish - Rank Adaptation for Large Language Models Fine-tuning](https://arxiv.org/abs/2410.18035) <br> Jingfan Zhang, Yi Zhao, Dan Chen, Xing Tian, Huanran Zheng, Wei Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.18035v1/extracted/5949512/em_lora_framework.png"> |[Paper](https://arxiv.org/abs/2410.18035)|[//]: #10/29
-
Survey (or Benchmark)
-
-
Paper from June 13, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
- ![Star - AI-Lab/Sirius)<br>[Sirius: Contextual Sparsity with Correction for Efficient LLMs](https://arxiv.org/abs/2409.03856) <br> Yang Zhou, Zhuoming Chen, Zhaozhuo Xu, Victoria Lin, Beidi Chen |<img width="1002" alt="image" src="https://infini-ai-lab.github.io/Sirius/static/images/methodsillustration.png"> |[Github](https://github.com/Infini-AI-Lab/Sirius) <br> [Paper](https://arxiv.org/abs/2409.03856)|[//]: #09/13
- ![Star - Pass Unified Generation and Retrieval for LLMs](https://arxiv.org/abs/2409.05152) <br> Jintian Zhang, Cheng Peng, Mengshu Sun, Xiang Chen, Lei Liang, Zhiqiang Zhang, Jun Zhou, Huajun Chen, Ningyu Zhang |<img width="1002" alt="image" src="https://github.com/zjunlp/OneGen/blob/main/assets/train.jpg"> |[Github](https://github.com/zjunlp/OneGen) <br> [Paper](https://arxiv.org/abs/2409.05152)|[//]: #09/13
- ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models
- HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning
- ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models
- Memory Is All You Need: An Overview of Compute-in-Memory Architectures for Accelerating Large Language Model Inference
-
-
Paper from August 17, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
KV Cache Compression
- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference
- Finch: Prompt-guided Key-Value Cache Compression
- ThinK: Thinner Key Cache by Query-Driven Pruning
- ![Star - Level KV Cache Compression Method with Integrated Retrieval and Reasoning](https://arxiv.org/abs/2410.19258) <br> Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao |<img width="1002" alt="image" src="https://github.com/FYYFU/HeadKV/raw/main/main.png"> |[Github](https://github.com/FYYFU/HeadKV) <br> [Paper](https://arxiv.org/abs/2410.19258)|[//]: #11/17
- ![Star - llm)<br>[BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference](https://arxiv.org/abs/2410.23079) <br> Junqi Zhao, Zhijin Fang, Shu Li, Shaohui Yang, Shichao He |<img width="1002" alt="image" src="https://arxiv.org/html/2410.23079v1/x1.png"> |[Github](https://github.com/JunqiZhao888/buzz-llm) <br> [Paper](https://arxiv.org/abs/2410.23079)|[//]: #11/17
-
Knowledge Distillation
- LaDiMo: Layer-wise Distillation Inspired MoEfier
- Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models
- Concept Distillation from Strong to Weak Models via Hypotheses-to-Theories Prompting
- ![Star - Distillation Through Time](https://arxiv.org/abs/2410.21035) <br> Justin Deschenaux, Caglar Gulcehre |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21035v1/x3.png"> |[Github](https://github.com/jdeschena/sdtt) <br> [Paper](https://arxiv.org/abs/2410.21035)|[//]: #11/17
-
Quantization
- STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
- ![Star - Computing-Lab-Yale/TesseraQ)<br>[TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction](https://arxiv.org/abs/2410.19103) <br> Yuhang Li, Priyadarshini Panda |<img width="1002" alt="image" src="https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ/raw/main/imgs/tesseraq.png"> |[Github](https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ) <br> [Paper](https://arxiv.org/abs/2410.19103)|[//]: #11/17
- ![Star - Grained Size Control for Compressed Large Language Models in Variable Memory Environments](https://arxiv.org/abs/2410.23918) <br> Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu |<img width="1002" alt="image" src="https://github.com/xinghaow99/BitStack/raw/main/assets/bitstack.png"> |[Github](https://github.com/xinghaow99/BitStack) <br> [Paper](https://arxiv.org/abs/2410.23918)|[//]: #11/17
-
Hardware/System/Serving
-
Inference Acceleration
- Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion
- Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
- ![Star - Li0406/SMoA)<br>[SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents](https://arxiv.org/abs/2411.03284) <br> Dawei Li, Zhen Tan, Peijia Qian, Yifan Li, Kumar Satvik Chaudhary, Lijie Hu, Jiayi Shen |<img width="1002" alt="image" src="figures/SMoA.png"> |[Github](https://github.com/David-Li0406/SMoA) <br> [Paper](https://arxiv.org/abs/2411.03284)|[//]: #11/18
-
Low-Rank Decomposition
- MoDeGPT: Modular Decomposition for Large Language Model Compression - Heng Lin, Shangqian Gao, James Seale Smith, Abhishek Patel, Shikhar Tuli, Yilin Shen, Hongxia Jin, Yen-Chang Hsu |<img width="1002" alt="image" src="https://arxiv.org/html/2408.09632v1/x2.png"> |[Paper](https://arxiv.org/abs/2408.09632)|[//]: #08/20
-
Network Pruning / Sparsity
-
Survey (or Benchmark)
- ![Star - lcf/LLM-Inference-Bench)<br>[LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/abs/2411.00136) <br> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al | |[Github](https://github.com/argonne-lcf/LLM-Inference-Bench) <br> [Paper](https://arxiv.org/abs/2411.00136)|[//]: #11/18
-
Efficient MOE
- ![Star - of-Experts Training with Network-Traffc-Aware Parallel Optimization](https://arxiv.org/abs/2411.00662) <br> Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li |<img width="1002" alt="image" src="https://arxiv.org/html/2411.00662v1/x1.png"> |[Github](https://github.com/EnflameTechnology/DeepSpeed) <br> [Paper](https://arxiv.org/abs/2411.00662)|[//]: #11/18
- ![Star - 2)<br>[MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) <br> Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01016v1/x1.png"> |[Github](https://github.com/xiaochengsky/MoEI-2) <br> [Paper](https://arxiv.org/abs/2411.01016)|[//]: #11/18
-
Text Compression
- ![Star - Length Tokenization for Efficient LLMs Adapted from LZW Compression](https://arxiv.org/abs/2410.21548) <br> Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21548v1/extracted/5960495/Figures/MultiTok.png"> |[Github](https://github.com/noelkelias/multitok) <br> [Paper](https://arxiv.org/abs/2410.21548)|[//]: #11/17
-
Tuning
- ![Star - IIITD/MonteCLoRA)<br>[Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation](https://arxiv.org/abs/2411.04358) <br> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04358v2/x3.png"> |[Github](https://github.com/LCS2-IIITD/MonteCLoRA) <br> [Paper](https://arxiv.org/abs/2411.04358)|[//]: #11/18
-
Efficient Training
- ![Star - EMNLP'24-blue)]()<br>[Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention](https://arxiv.org/abs/2411.02063) <br> Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou |<img width="1002" alt="image" src="https://arxiv.org/html/2411.02063v1/x1.png"> |[Github](https://github.com/TsinghuaC3I/LPA) <br> [Paper](https://arxiv.org/abs/2411.02063)|[//]: #11/18
- ![Star - Efficient FP8 Training](https://arxiv.org/abs/2410.19313) <br> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |<img width="1002" alt="image" src="https://github.com/NVlabs/COAT/blob/main/docs/figs/FP8PrecisionFlow.png"> |[Github](https://github.com/NVlabs/COAT) <br> [Paper](https://arxiv.org/abs/2410.19313)|[//]: #11/17
- ![Star - v.svg"> |[Github](https://github.com/wuhouming/BitPipe) <br> [Paper](https://arxiv.org/abs/2410.19367)|[//]: #11/17
-
-
Paper from July 4, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Text Compression
-
Tuning
- Code Less, Align More: Efficient LLM Fine-tuning for Code Generation with Data Pruning - Da Tsai, Mingjie Liu, Haoxing Ren |<img width="1002" alt="image" src="https://arxiv.org/html/2407.05040v1/x1.png"> |[Paper](https://arxiv.org/abs/2407.05040)|[//]: #07/10
- ![Publish - Device Fine-Tuning for Personalized LLMs](https://arxiv.org/abs/2407.01031) <br> Dan Peng, Zhihui Fu, Jun Wang ||[Paper](https://arxiv.org/abs/2407.01031)|[//]: #07/05
-
Knowledge Distillation
-
Quantization
-
Inference Acceleration
-
Survey
-
Network Pruning / Sparsity
- ![Publish - tuning](https://aclanthology.org/2024.findings-naacl.1/) <br> Honghe Zhang, XiaolongShi XiaolongShi, Jingwei Sun, Guangzhong Sun |<img width="1002" alt="image" src="figures/CCEMF.png"> |[Paper](https://aclanthology.org/2024.findings-naacl.1/)|[//]: #07/05
-
Hardware/System
- ![Star - LLM)<br>[TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices](https://arxiv.org/abs/2410.00531) <br> Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.00531v1/x4.png"> |[Github](https://github.com/Lizonghang/TPI-LLM) <br> [Paper](https://arxiv.org/abs/2410.00531)|[//]: #10/02
-
-
Paper from June 6, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
- VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
- Loki: Low-Rank Keys for Efficient Sparse Attention
- LLM and GNN are Complementary: Distilling LLM for Multimodal Graph Learning
- Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters - Mixtral) |[//]: #06/11
- ![Publish - exiting for Faster LLM Inference with Thompson Sampling Control Mechanism](https://arxiv.org/abs/2406.03853) <br> Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai |<img width="1002" alt="image" src="https://arxiv.org/html/2406.03853v1/x3.png"> |[Paper](https://arxiv.org/abs/2406.03853)|[//]: #06/12
- PowerInfer-2: Fast Large Language Model Inference on a Smartphone
- Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity
-
-
Paper from June 2, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
-
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
- ![Type
- LCQ: Low-Rank Codebook based Quantization for Large Language Models - Pu Cai, Wu-Jun Li |<img width="1002" alt="image" src="https://arxiv.org/html/2405.20973v1/x5.png"> |[Paper](https://arxiv.org/abs/2405.20973)|[//]: #06/05
- MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization
- Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs - Holder | |[Paper](https://arxiv.org/abs/2405.20835)|[//]: #06/05
- ![Star - Barber)<br>[LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models](https://arxiv.org/abs/2408.10631) <br> Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu |<img width="1002" alt="image" src="https://github.com/YupengSu/LLM-Barber/raw/main/img/figure1a.png"> |[Github](https://github.com/YupengSu/LLM-Barber) <br> [Paper](https://arxiv.org/abs/2408.10631)|[//]: #08/27
- ![Star - Aware-Tuning)<br>[PAT: Pruning-Aware Tuning for Large Language Models](https://arxiv.org/abs/2408.14721) <br> Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du |<img width="1002" alt="image" src="figures/PAT.png"> |[Github](https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning) <br> [Paper](https://arxiv.org/abs/2408.14721)|[//]: #09/02
- ![Star
- ![Star - fi/MobileQuant)<br>[MobileQuant: Mobile-friendly Quantization for On-device Language Models](https://arxiv.org/abs/2408.13933) <br> Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez |<img width="1002" alt="image" src="https://arxiv.org/html/2408.13933v1/x1.png"> |[Github](https://github.com/saic-fi/MobileQuant) <br> [Paper](https://arxiv.org/abs/2408.13933)|[//]: #08/27
- ![Star - LLM)<br>[ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models](https://arxiv.org/abs/2408.08554) <br> Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei |<img width="1002" alt="image" src="figures/abq-llm.png"> |[Github](https://github.com/bytedance/ABQ-LLM) <br> [Paper](https://arxiv.org/abs/2408.08554)|[//]: #08/20
- ![Star - yang-1/DoubleSparse)<br>[Post-Training Sparse Attention with Double Sparsity](https://arxiv.org/abs/2408.07092) <br> Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng |<img width="302" alt="image" src="https://github.com/andy-yang-1/DoubleSparse/raw/main/assets/double-sparsity-gif-v2.gif"> |[Github](https://github.com/andy-yang-1/DoubleSparse) <br> [Paper](https://arxiv.org/abs/2408.07092)|[//]: #08/20
- ![Star - hou/instruction-aware-contextual-compressor)<br>[Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression](https://arxiv.org/abs/2408.15491) <br> Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2408.15491v1/extracted/5817813/arch.png"> |[Github](https://github.com/howard-hou/instruction-aware-contextual-compressor) <br> [Paper](https://arxiv.org/abs/2408.15491)|[//]: #09/02
- Low-Rank Quantization-Aware Training for LLMs
- Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
-
-
KV Cache Compression
- Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
- No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization
- ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition
- WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More
- DB-LLM: Accurate Dual-Binarization for Efficient LLMs - LLM.png"> |[Paper](https://arxiv.org/abs/2402.11960)|
- ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
- MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
- ![Star - free Low-bit Quantization with Matrix Decomposition for KV Cache Compression](https://arxiv.org/abs/2405.12591) <br> Peiyu Liu, Ze-Feng Gao, Wayne Xin Zhao, Yipeng Ma, Tao Wang, Ji-Rong Wen |<img width="1002" alt="image" src="figures/DecoQuant.png"> |[Github](https://github.com/lpyhdzx/DecoQuant_code) <br> [Paper](https://arxiv.org/abs/2405.12591)|
- ![Star - ACL'24-blue)]()<br>[PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference](https://arxiv.org/abs/2405.12532) <br> Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, Hai Zhao |<img width="1002" alt="image" src="figures/PyramidInfer.png"> |[Github](https://github.com/mutonix/pyramidinfer) <br> [Paper](https://arxiv.org/abs/2405.12532)|
- Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
- ![Star - ACL'24-blue)]()<br>[Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://arxiv.org/abs/2405.10637) <br> Haoyi Wu, Kewei Tu |<img width="1002" alt="image" src="figures/LCKV.png"> |[Github](https://github.com/whyNLP/LCKV) <br> [Paper](https://arxiv.org/abs/2405.10637)|
-
Full List
-
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
-
Categories
Paper from Sep 30, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
115
Quantization
76
Inference Acceleration
60
Paper from July 13, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
59
Network Pruning
59
Knowledge Distillation
51
Paper from August 24, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
45
Paper from May 26, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
35
Paper from June 21, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
33
Paper from August 17, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
30
Paper from Sep 2, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
30
Hardware/System
25
Paper from 05/26/2024 - Now (see Full List from 05/22/2023 [here](#full-list))
24
Tuning
17
Survey
17
Text Compression
14
Paper from June 2, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
13
Efficient MOE
12
KV Cache Compression
11
Paper from July 4, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
10
Efficient Architecture of LLM
9
Low-Rank Decomposition
9
Paper from June 6, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
7
Leaderboard
6
Paper from June 13, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))
6
Hardware
3
Full List
1
Sub Categories
Quantization
61
Network Pruning / Sparsity
48
Inference Acceleration
48
Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.
41
Knowledge Distillation
30
KV Cache Compression
29
Please check out all the papers by selecting the sub-area you're interested in. On this page, we're showing papers released in the past 30 days.
24
Tuning
21
Please check out all the papers by selecting the sub-area you're interested in. On this page, we're showing papers released in the past 60 days.
21
Text Compression
20
Hardware/System/Serving
17
Efficient MOE
12
Survey (or Benchmark)
10
Efficient Training
8
Low-Rank Decomposition
5
Efficient Architecture of LLM
5
Hardware/System
4
Survey
4