Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models
https://github.com/horseee/Awesome-Efficient-LLM

Last synced: 5 days ago
JSON representation

  • Paper from August 24, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

    • Inference Acceleration

      • ![Star
      • ![Star - wise Criticality-based Approach for Prefilling Acceleration in LLMs](https://arxiv.org/abs/2409.12490) <br> Junlin Lv, Yuan Feng, Xike Xie, Xin Jia, Qirong Peng, Guiming Xie |<img width="1002" alt="image" src="https://arxiv.org/html/2409.12490v1/x2.png"> |[Github](https://github.com/66RING/CritiPrefill) <br> [Paper](https://arxiv.org/abs/2409.12490)|[//]: #09/21
      • ![Star - the-Fly Self-Speculative Decoding for LLM Inference Acceleration](https://arxiv.org/abs/2410.06916) <br> Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li |<img width="1002" alt="image" src="https://github.com/hemingkx/SWIFT/raw/main/assets/swift.png"> |[Github](https://github.com/hemingkx/SWIFT) <br> [Paper](https://arxiv.org/abs/2410.06916)|[//]: #10/14
      • ![Star - Inspired Adaptive Sparse Activation](https://arxiv.org/abs/2410.18311#) <br> Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen |<img width="1002" alt="image" src="https://wangqinsi1.github.io/coreinfer_page/static/images/overview.png"> |[Github](https://github.com/wangqinsi1/CoreInfer) <br> [Paper](https://arxiv.org/abs/2410.18311#)|[//]: #10/29
      • ![Star - han-lab/duo-attention)<br>[DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads](https://arxiv.org/abs/2410.10819) <br> Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/duo-attention/raw/main/figures/method1.jpg"> |[Github](https://github.com/mit-han-lab/duo-attention) <br> [Paper](https://arxiv.org/abs/2410.10819)|[//]: #10/21
      • ![Star - ICML'23%20Oral-blue)]()<br> :star: [Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time](https://openreview.net/forum?id=wIPIhHd00i) <br> Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen |<img width="202" alt="image" src="figures/DajeVu.png"> |[Github](https://github.com/FMInference/DejaVu) <br> [Paper](https://openreview.net/forum?id=wIPIhHd00i)| [//]: #Recommend
      • ![Star - han-lab/streaming-llm)<br> :star: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) <br> Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis |<img width="1002" alt="image" src="https://github.com/mit-han-lab/streaming-llm/blob/main/figures/schemes.png"> |[Github](https://github.com/mit-han-lab/streaming-llm) <br> [Paper](https://arxiv.org/abs/2309.17453)| [//]: #Recommend
    • KV Cache Compression

      • ![Star - Wise Dissimilar KV Cache Sharing](https://arxiv.org/abs/2410.18517) <br> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |<img width="1002" alt="image" src="https://github.com/yangyifei729/KVSharer/raw/main/img/main_fig.jpg"> |[Github](https://github.com/yangyifei729/KVSharer) <br> [Paper](https://arxiv.org/abs/2410.18517)|[//]: #10/29
      • ![Star - KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference](https://arxiv.org/abs/2407.11550) <br> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |<img width="1002" alt="image" src="figures/adakv.png"> |[Github](https://github.com/FFY0/AdaKV) <br> [Paper](https://arxiv.org/abs/2407.11550)|[//]: #10/13
      • ![Star - Layer KV Sharing for Efficient LLM Inference](https://arxiv.org/abs/2410.14442) <br> You Wu, Haoyi Wu, Kewei Tu |<img width="202" alt="image" src="figures/cross-layer-kv.png"> |[Github](https://github.com/whyNLP/LCKV) <br> [Paper](https://arxiv.org/abs/2410.14442)|[//]: #10/30
      • ![Star - Level KV Cache Compression Method with Integrated Retrieval and Reasoning](https://arxiv.org/abs/2410.19258) <br> Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao |<img width="1002" alt="image" src="https://github.com/FYYFU/HeadKV/raw/main/main.png"> |[Github](https://github.com/FYYFU/HeadKV) <br> [Paper](https://arxiv.org/abs/2410.19258)|[//]: #11/17
    • Tuning

      • ![Star - EMNLP'24-blue)]()<br>[Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models](https://arxiv.org/abs/2410.11772) <br> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11772v1/x3.png"> |[Github](https://github.com/Kaiseem/IST) <br> [Paper](https://arxiv.org/abs/2410.11772)|[//]: #10/21
      • ![Star - IIITD/MonteCLoRA)<br>[Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation](https://arxiv.org/abs/2411.04358) <br> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04358v2/x3.png"> |[Github](https://github.com/LCS2-IIITD/MonteCLoRA) <br> [Paper](https://arxiv.org/abs/2411.04358)|[//]: #11/18
      • ![Star - EMNLP'24%20Findings-blue)]()<br>[QEFT: Quantization for Efficient Fine-Tuning of LLMs](https://arxiv.org/abs/2410.08661) <br> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08661v1/x2.png"> |[Github](https://github.com/xvyaward/qeft) <br> [Paper](https://arxiv.org/abs/2410.08661)|[//]: #10/21
      • ![Star - tuning of MLP Layers](https://arxiv.org/abs/2410.07383) <br> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07383v1/x1.png"> |[Github](https://github.com/sayankotor/sparse_grads) <br> [Paper](https://arxiv.org/abs/2410.07383)|[//]: #10/13
    • Text Compression

      • ![Star
      • ![Star - Shree-Narashiman/AlphaZip)<br>[AlphaZip: Neural Network-Enhanced Lossless Text Compression](https://arxiv.org/abs/2409.15046) <br> Swathi Shree Narashiman, Nitin Chandrachoodan |<img width="1002" alt="image" src="https://arxiv.org/html/2409.15046v1/extracted/5873563/images/architecture_bloack_diagram.png"> |[Github](https://github.com/Swathi-Shree-Narashiman/AlphaZip) <br> [Paper](https://arxiv.org/abs/2409.15046)|[//]: #09/27
      • ![Star - hou/instruction-aware-contextual-compressor)<br>[Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression](https://arxiv.org/abs/2408.15491) <br> Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2408.15491v1/extracted/5817813/arch.png"> |[Github](https://github.com/howard-hou/instruction-aware-contextual-compressor) <br> [Paper](https://arxiv.org/abs/2408.15491)|[//]: #09/02
      • ![Star - Length Tokenization for Efficient LLMs Adapted from LZW Compression](https://arxiv.org/abs/2410.21548) <br> Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21548v1/extracted/5960495/Figures/MultiTok.png"> |[Github](https://github.com/noelkelias/multitok) <br> [Paper](https://arxiv.org/abs/2410.21548)|[//]: #11/17
      • ![Star - context-distillation)<br>[Generative Context Distillation](https://arxiv.org/abs/2411.15927) <br> Haebin Shin, Lei Ji, Yeyun Gong, Sungdong Kim, Eunbi Choi, Minjoon Seo |<img width="1002" alt="image" src="figures/GCD.png"> |[Github](https://github.com/kaistai/generative-context-distillation) <br> [Paper](https://arxiv.org/abs/2411.15927)|[//]: #12/02
    • Low-Rank Decomposition

      • ![Star - ai/Natural-GaLore)<br>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https://arxiv.org/abs/2410.16029) <br> Arijit Das | |[Github](https://github.com/selfsupervised-ai/Natural-GaLore) <br> [Paper](https://arxiv.org/abs/2410.16029)|[//]: #10/30
    • Hardware/System/Serving

      • ![Star - LLM)<br>[TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices](https://arxiv.org/abs/2410.00531) <br> Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.00531v1/x4.png"> |[Github](https://github.com/Lizonghang/TPI-LLM) <br> [Paper](https://arxiv.org/abs/2410.00531)|[//]: #10/02
    • Quantization

      • ![Star - group/Quamba)<br>[Quamba: A Post-Training Quantization Recipe for Selective State Space Models](https://arxiv.org/abs/2410.13229) <br> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13229v1/extracted/5933363/figures/outliers.png"> |[Github](https://github.com/enyac-group/Quamba) <br> [Paper](https://arxiv.org/abs/2410.13229)|[//]: #10/21
      • ![Star - ov-file)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs](https://arxiv.org/abs/2406.01721) <br> Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei |<img width="1002" alt="image" src="https://github.com/Hsu1023/DuQuant/blob/main/imgs/duquant.png"> |[Github](https://github.com/Hsu1023/DuQuant?tab=readme-ov-file) <br> [Paper](https://arxiv.org/abs/2406.01721)|[//]: #09/27
      • ![Star
      • ![Star - bit Vector Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2409.17066) <br> Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang |<img width="1002" alt="image" src="figures/VPTQ.png"> |[Github](https://github.com/microsoft/VPTQ) <br> [Paper](https://arxiv.org/abs/2409.17066)|[//]: #09/27
      • ![Star
      • ![Star - Mozaffari/slim)<br>[SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs](https://arxiv.org/abs/2410.09615) <br> Mohammad Mozaffari, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09615v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slim) <br> [Paper](https://arxiv.org/abs/2410.09615)|[//]: #10/21
      • Matmul or No Matmal in the Era of 1-bit LLMs
      • ![Star - lab/BitMoD-HPCA-25)[![Publish](https://img.shields.io/badge/Conference-HPCA'25-blue)]()<br>[BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration](https://arxiv.org/abs/2411.11745) <br> Yuzong Chen, Ahmed F. AbouElhamayed, Xilai Dai, Yang Wang, Marta Andronic, George A. Constantinides, Mohamed S. Abdelfattah |<img width="1002" alt="image" src="https://arxiv.org/html/2411.11745v1/x5.png"> |[Github](https://github.com/abdelfattah-lab/BitMoD-HPCA-25) <br> [Paper](https://arxiv.org/abs/2411.11745)|[//]: #11/24
    • Efficient Training

      • ![Star - Efficient FP8 Training](https://arxiv.org/abs/2410.19313) <br> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |<img width="1002" alt="image" src="https://github.com/NVlabs/COAT/blob/main/docs/figs/FP8PrecisionFlow.png"> |[Github](https://github.com/NVlabs/COAT) <br> [Paper](https://arxiv.org/abs/2410.19313)|[//]: #11/17
      • ![Star - v.svg"> |[Github](https://github.com/wuhouming/BitPipe) <br> [Paper](https://arxiv.org/abs/2410.19367)|[//]: #11/17
    • Survey (or Benchmark)

      • ![Star - lcf/LLM-Inference-Bench)<br>[LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/abs/2411.00136) <br> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al | |[Github](https://github.com/argonne-lcf/LLM-Inference-Bench) <br> [Paper](https://arxiv.org/abs/2411.00136)|[//]: #11/18
      • ![Star - Compression-Survey)<br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) <br> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12388v2/extracted/5933385/Figures/tree_overview.png"> |[Github](https://github.com/ZongqianLi/Prompt-Compression-Survey) <br> [Paper](https://arxiv.org/abs/2410.12388)|[//]: #10/21
      • ![Star - Compression)<br>[Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2409.13385) <br> Sourav Verma |<img width="1002" alt="image" src="figures/CCRAG_survey.png"> |[Github](https://github.com/SrGrace/Contextual-Compression) <br> [Paper](https://arxiv.org/abs/2409.13385)|[//]: #09/27
    • Efficient MOE

      • ![Star - 2)<br>[MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) <br> Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01016v1/x1.png"> |[Github](https://github.com/xiaochengsky/MoEI-2) <br> [Paper](https://arxiv.org/abs/2411.01016)|[//]: #11/18
      • ![Star - 778/MC-MoE)<br>[MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More](https://arxiv.org/abs/2410.06270) <br> Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/MC-MoE/raw/main/imgs/[email protected]"> |[Github](https://github.com/Aaronhuang-778/MC-MoE) <br> [Paper](https://arxiv.org/abs/2410.06270)|[//]: #10/14
    • Network Pruning / Sparsity

      • Language-specific Calibration for Pruning Multilingual Language Models - Jia Chen, Lucie Flek ||[Paper](https://arxiv.org/abs/2408.14398)|[//]: #08/27
      • Enhancing One-shot Pruned Pre-trained Language Models through Sparse-Dense-Sparse Mechanism
      • ![Star - zjut/Navigation-LLM-layer-pruning)<br>[Reassessing Layer Pruning in LLMs: New Insights and Methods](https://arxiv.org/abs/2411.15558) <br> Yao Lu, Hao Cheng, Yujie Fang, Zeyu Wang, Jiaheng Wei, Dongwei Xu, Qi Xuan, Xiaoniu Yang, Zhaowei Zhu |<img width="1002" alt="image" src="https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning/raw/main/framework.JPG"> |[Github](https://github.com/yaolu-zjut/Navigation-LLM-layer-pruning) <br> [Paper](https://arxiv.org/abs/2411.15558)|[//]: #12/03
      • LLM Pruning and Distillation in Practice: The Minitron Approach
      • ![Star - EIC/AmoebaLLM)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[AmoebaLLM: Constructing Any-Shape Large Language Models for Efficient and Instant Deployment](https://arxiv.org/abs/2411.10606) <br> Yonggan Fu, Zhongzhi Yu, Junwei Li, Jiayi Qian, Yongan Zhang, Xiangchi Yuan, Dachuan Shi, Roman Yakunin, Yingyan Celine Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2411.10606v1/x2.png"> |[Github](https://github.com/GATECH-EIC/AmoebaLLM) <br> [Paper](https://arxiv.org/abs/2411.10606)|[//]: #11/24
      • ![Star
    • Knowledge Distillation

  • Paper from Sep 2, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

  • Knowledge Distillation

    • Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models - AKL.png"> |[Paper](https://arxiv.org/abs/2404.02657) <br> [Blog-Eng](https://zhuanlan.zhihu.com/p/690804722)<br> [Blog-中](https://zhuanlan.zhihu.com/p/690748958)|
    • Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation
    • Teaching Small Language Models to Reason
    • ![Star - distillation) <br>[Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing](https://arxiv.org/abs/2305.16635) <br> Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi |<img width="1002" alt="image" src="figures/impossible_distillation.png"> |[Github](https://github.com/jaehunjung1/impossible-distillation) [paper](https://arxiv.org/abs/2305.16635) |
    • ![Star - distill) <br> [Large Language Model Distillation Doesn't Need a Teacher](https://arxiv.org/abs/2305.14864) <br> Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, Iz Beltagy </br> | <img width="2000" alt="image" src="figures/TeacherFreeLLM.png"> | [Github](https://github.com/ananyahjha93/llm-distill) [paper](https://arxiv.org/abs/2305.14864) |
    • PaD: Program-aided Distillation Specializes Large Models in Reasoning
    • The False Promise of Imitating Proprietary LLMs
    • RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment
    • ![Star - CoT-Specialization)[![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]()<br>[Specializing Smaller Language Models towards Multi-Step Reasoning](https://arxiv.org/abs/2301.12726) <br> Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot |<img width="1002" alt="image" src="figures/ModelSpecialization.png"> |[Github](https://github.com/FranxYao/FlanT5-CoT-Specialization) <br> [Paper](https://arxiv.org/abs/2301.12726)|
    • ![Star - ACL'23%20Outstanding-blue)]()<br>[Distilling Script Knowledge from Large Language Models for Constrained Language Planning](https://arxiv.org/abs/2305.05252) <br> Siyu Yuan, Jiangjie Chen, Ziquan Fu, Xuyang Ge, Soham Shah, Charles Robert Jankowski, Yanghua Xiao, Deqing Yang |<img width="302" alt="image" src="figures/CoScript.png"> |[Github](https://github.com/siyuyuan/coscript) <br> [Paper](https://arxiv.org/abs/2305.05252)|
    • ![Publish - Consistent Chain-of-Thought Distillation](https://arxiv.org/abs/2305.01879) <br> Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, Xiang Ren |<img width="1002" alt="image" src="figures/SCOTT.png"> |[Paper](https://arxiv.org/abs/2305.01879)|
    • ![Star - ACL'23-blue)]()<br>[DISCO: Distilling Counterfactuals with Large Language Models](https://arxiv.org/abs/2212.10534) <br> Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson |<img width="1002" alt="image" src="figures/disco.png"> |[Github](https://github.com/eric11eca/disco) <br> [Paper](https://arxiv.org/abs/2212.10534)|
    • ![Star - ACL'23-blue)]()<br>[I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation](https://arxiv.org/abs/2212.09246) <br> Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi |<img width="1002" alt="image" src="https://i2d2.allen.ai/i2d2-fig1.png"> |[Github](https://github.com/allenai/i2d2) <br> [Paper](https://arxiv.org/abs/2212.09246) <br> [Project](https://i2d2.allen.ai/) |
    • ![Star - ACL'23-blue)]()<br>[Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step](https://arxiv.org/abs/2306.14050) <br> Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi |<img width="202" alt="image" src="figures/SCoTD.png"> |[Github](https://github.com/allenai/cot_distillation) <br> [Paper](https://arxiv.org/abs/2306.14050)|
    • ![Star - NeurIPS'23-blue)]() <br>[Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind](https://arxiv.org/abs/2306.09299) <br> Swarnadeep Saha, Peter Hase, and Mohit Bansal |<img width="302" alt="image" src="https://github.com/swarnaHub/ExplanationIntervention/blob/main/assets/main_fig.png"> |[Github](https://github.com/swarnaHub/ExplanationIntervention) <br> [Paper](https://arxiv.org/abs/2306.09299)|
    • ![Star - EMNLP-2023)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation](https://arxiv.org/abs/2310.14192) <br> Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, Issam H. Laradji |<img width="1002" alt="image" src="figures/PromptMix.png"> |[Github](https://github.com/ServiceNow/PromptMix-EMNLP-2023) <br> [Paper](https://arxiv.org/abs/2310.14192)|
    • ![Star - AAAI'24-blue)]()<br>[Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data](https://arxiv.org/abs/2312.12832) <br> Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, Kan Li |<img width="1002" alt="image" src="https://github.com/Yiwei98/TDG/blob/main/img.png"> |[Github](https://github.com/Yiwei98/TDG) <br> [Paper](https://arxiv.org/abs/2312.12832)|
    • ![Star - ACL'23%20Industry%20Track-blue)]() <br>[GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model](https://arxiv.org/abs/2306.06629) <br> Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Yang Yang, Hongyin Tang, Keqing He, Jiahao Liu, Jingang Wang, Shu Zhao, Peng Zhang, Jie Tang |<img width="1002" alt="image" src="figures/GKD.png"> |[Github](https://github.com/aitsc/GLMKD) <br> [Paper](https://arxiv.org/abs/2306.06629)|
    • ![Star - research/distilling-step-by-step) [![Publish](https://img.shields.io/badge/Conference-ACL'23%20Findings-blue)]() <br> [Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes](https://arxiv.org/abs/2305.02301) <br> Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister | <img width="2000" alt="image" src="figures/Distill_step_by_step.png">| [Github](https://github.com/google-research/distilling-step-by-step) <br> [Paper](https://arxiv.org/abs/2305.02301) |
    • ![Star - EMNLP'23%20Findings-blue)]()<br>[Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models](https://arxiv.org/abs/2310.13395) <br> Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos |<img width="252" alt="image" src="figures/OCaTS.png"> |[Github](https://github.com/stoyian/OCaTS) <br> [Paper](https://arxiv.org/abs/2310.13395)|
    • ![Star - nlp/LaMini-LM) <br> [LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions](https://github.com/mbzuai-nlp/LaMini-LM) <br>Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji | <img width="1002" alt="image" src="https://github.com/mbzuai-nlp/LaMini-LM/blob/main/images/lamini-pipeline.drawio.png"> | [Github](https://github.com/mbzuai-nlp/LaMini-LM) [paper](https://arxiv.org/abs/2304.14402) |
    • Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA - COT.png"> |[Paper](https://arxiv.org/abs/2308.04679)|
    • ![Star - ner/universal-ner)<br>[UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition](https://arxiv.org/abs/2308.03279) <br> Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon |<img width="302" alt="image" src="figures/UniversalNER.png"> |[Github](https://github.com/universal-ner/universal-ner) <br> [Paper](https://arxiv.org/abs/2308.03279) <br> [Project](https://universal-ner.github.io) |
    • ![Star - Loup Tastet |<img width="302" alt="image" src="figures/BabyLLaMA.png"> |[Github](https://github.com/timinar/BabyLlama) <br> [Paper](https://arxiv.org/abs/2308.02019) | [Model](https://huggingface.co/timinar/baby-llama-58m) |
    • ![Star - handbook)<br>[Zephyr: Direct Distillation of LM Alignment](https://arxiv.org/abs/2310.16944) <br> Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf |<img width="1002" alt="image" src="figures/zephyr.png"> |[Github](https://github.com/huggingface/alignment-handbook) <br> [Paper](https://arxiv.org/abs/2310.16944)|
    • ![Star
    • Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models
    • Mixed Distillation Helps Smaller Language Model Better Reasoning
    • Distilling Event Sequence Knowledge From Large Language Models
    • Knowledge Distillation for Closed-Source Language Models
    • ![Star - Young Yun |<img width="1002" alt="image" src="https://arxiv.org/html/2402.03898v1/x4.png"> |[Github](https://github.com/jongwooko/distillm) <br> [Paper](https://arxiv.org/abs/2402.03898)|
    • Large Language Model Meets Graph Neural Network in Knowledge Distillation
    • Improving Small Language Models' Mathematical Reasoning via Equation-of-Thought Distillation
    • Scavenging Hyena: Distilling Transformers into Long Convolution Models - Transfer-HD.png"> |[Paper](https://arxiv.org/abs/2401.17574)|
    • Divide-or-Conquer? Which Part Should You Distill Your LLM?
    • ![Star - cd)<br>[Distillation Contrastive Decoding: Improving LLMs Reasoning with Contrastive Decoding and Distillation](https://arxiv.org/abs/2402.14874) <br> Phuc Phan, Hieu Tran, Long Phan |<img width="1002" alt="image" src="https://github.com/pphuc25/distil-cd/blob/main/assets/figure1-method.jpg"> |[Github](https://github.com/pphuc25/distil-cd) <br> [Paper](https://arxiv.org/abs/2402.14874)|
    • PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning
    • Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning
    • Wisdom of Committee: Distilling from Foundation Model to Specialized Application Model
    • ![Star - BZRD/llm-recipes)<br>[Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs](https://arxiv.org/abs/2402.12030) <br> Nicolas Boizard, Kevin El-Haddad, Céline Hudelot, Pierre Colombo |<img width="1002" alt="image" src="figures/CrossTokenizer.png"> |[Github](https://github.com/Nicolas-BZRD/llm-recipes) [Github](https://github.com/Nicolas-BZRD/llm-distillation) <br> [Paper](https://arxiv.org/abs/2402.12030) <br> [Model](https://huggingface.co/collections/Nicolas-BZRD/llms-distillation-65cfa07f1e4ed7404502a9eb)|
    • Revisiting Knowledge Distillation for Autoregressive Language Models
    • ![Publish
    • ![Star - large-metaie)|
    • Gecko: Versatile Text Embeddings Distilled from Large Language Models
    • DistillSpec: Improving Speculative Decoding via Knowledge Distillation - François Kagy, Rishabh Agarwal |<img width="1002" alt="image" src="figures/DistillSpec.png"> |[Paper](https://arxiv.org/abs/2310.08461)|
    • ![Star - river/LLM_unlearning)<br>[Unmemorization in Large Language Models via Self-Distillation and Deliberate Imagination](https://arxiv.org/abs/2402.10052) <br> Yijiang River Dong, Hongzhou Lin, Mikhail Belkin, Ramon Huerta, Ivan Vulić |<img width="1002" alt="image" src="https://arxiv.org/html/2402.10052v1/x1.png"> |[Github](https://github.com/dong-river/LLM_unlearning) <br> [Paper](https://arxiv.org/abs/2402.10052)|
    • ![Star - to-Reason)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[Democratizing Reasoning Ability: Tailored Learning from Large Language Model](https://aclanthology.org/2023.emnlp-main.120.pdf) <br> Zhaoyang Wang, Shaohan Huang, Yuxuan Liu, Jiahai Wang, Minghui Song, Zihan Zhang, Haizhen Huang, Furu Wei, Weiwei Deng, Feng Sun, Qi Zhang |<img width="1002" alt="image" src="figures/learn-to-reason.png"> |[Github](https://github.com/Raibows/Learn-to-Reason) <br> [Paper](https://aclanthology.org/2023.emnlp-main.120.pdf)|
    • Leveraging Zero-Shot Prompting for Efficient Language Model Distillation
    • Post-Semantic-Thinking: A Robust Strategy to Distill Reasoning Capacity from Large Language Models
    • Distilling Instruction-following Abilities of Large Language Models with Task-aware Curriculum Planning
    • ![Star - ACL'24-blue)]()<br>[RDRec: Rationale Distillation for LLM-based Recommendation](https://arxiv.org/abs/2405.10587) <br> Xinfeng Wang, Jin Cui, Yoshimi Suzuki, Fumiyo Fukumoto |<img width="1002" alt="image" src="figures/RDRec.png"> |[Github](https://github.com/WangXFng/RDRec) <br> [Paper](https://arxiv.org/abs/2405.10587)|
  • Network Pruning

    • ![Star - LANCE/MBS)<br>[Multilingual Brain Surgeon: Large Language Models Can be Compressed Leaving No Language Behind](https://arxiv.org/abs/2404.04748) <br> Hongchuan Zeng, Hongshen Xu, Lu Chen, Kai Yu |<img width="1002" alt="image" src="https://github.com/HongchuanZeng/MBS/raw/main/mbs.png"> |[Github](https://github.com/X-LANCE/MBS) <br> [Paper](https://arxiv.org/abs/2404.04748)|
    • ![Star - cybernetics/Relative-importance-and-activation-pruning)[![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]()<br>[Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models](https://openreview.net/forum?id=Tr0lPx9woF) <br> Yingtao Zhang, Haoli Bai, Haokun Lin, Jialin Zhao, Lu Hou, Carlo Vittorio Cannistraci |<img width="1002" alt="image" src="figures/RIA.png"> |[Github](https://github.com/biomedical-cybernetics/Relative-importance-and-activation-pruning) <br> [Paper](https://openreview.net/forum?id=Tr0lPx9woF)|
    • Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding
    • Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment
    • Dependency-Aware Semi-Structured Sparsity of GLU Variants in Large Language Models
    • Mixture-of-Depths: Dynamically allocating compute in transformer-based language models
    • ![Star - DASLab/sparsegpt) [![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> [SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot](https://github.com/IST-DASLab/sparsegpt) <br> Elias Frantar, Dan Alistarh| <img width="522" alt="image" src="figures/sparsegpt.png"> |[Github](https://github.com/IST-DASLab/sparsegpt) [paper](https://arxiv.org/abs/2301.00774) |
    • ![Star - Pruner) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[LLM-Pruner: On the Structural Pruning of Large Language Models](https://arxiv.org/abs/2305.11627) <br> Xinyin Ma, Gongfan Fang, Xinchao Wang | <img width="561" alt="image" src="figures/llm_pruner.png">| [Github](https://github.com/horseee/LLM-Pruner) [paper](https://arxiv.org/abs/2305.11627)|
    • ![Star - Group/essential_sparsity) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter](https://arxiv.org/abs/2306.03805) <br> Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang |<img width="1002" alt="image" src="https://user-images.githubusercontent.com/6660499/243539825-ca3b1dbe-bc1c-45d9-a6ea-d1d0c991e997.png"> |[Github](https://github.com/VITA-Group/essential_sparsity) <br> [Paper](https://arxiv.org/abs/2306.03805)|
    • ![Star - llm)[![Publish](https://img.shields.io/badge/Conference-VLDB'24-blue)]() [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity](https://arxiv.org/abs/2309.10285) <br> Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song |<img width="602" alt="image" src="figures/FlashLLM.png"> |[Github](https://github.com/AlibabaResearch/flash-llm) <br> [Paper](https://arxiv.org/abs/2309.10285)|
    • ![Star - Pruning-Official)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23%20Findings-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models](https://arxiv.org/abs/2310.10054) <br> Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young Yun |<img width="402" alt="image" src="figures/NASH.png"> |[Github](https://github.com/jongwooko/NASH-Pruning-Official) <br> [Paper](https://arxiv.org/abs/2310.10054)|
    • ![Star - C2A4A6)]()<br>[A Simple and Effective Pruning Approach for Large Language Models](https://arxiv.org/abs/2306.11695) <br> Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter |<img width="1002" alt="image" src="https://user-images.githubusercontent.com/20168304/245999360-f951de47-269d-491d-826a-8e6d85627849.png"> |[Github](https://github.com/locuslab/wanda) <br> [Paper](https://arxiv.org/abs/2306.11695)|
    • ![Type
    • ![Type - KICK.png"> |[Paper](https://arxiv.org/abs/2310.01382)|
    • ![Star - Group/Junk_DNA_Hypothesis)[![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]()<br>[Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity](https://arxiv.org/abs/2310.02277) <br> Lu Yin, Shiwei Liu, Ajay Jaiswal, Souvik Kundu, Zhangyang Wang |<img width="1002" alt="image" src="figures/junk_DNA.png"> |[Github](https://github.com/VITA-Group/Junk_DNA_Hypothesis) <br> [Paper](https://arxiv.org/abs/2310.02277)|
    • ![Star - C2A4A6)]() <br>[Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity](https://arxiv.org/abs/2310.05175) <br> Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu |<img width="1002" alt="image" src="https://github.com/luuyin/OWL/blob/main/Images/Layer_wise_sparsity.png"> |[Github](https://github.com/luuyin/OWL) <br> [Paper](https://arxiv.org/abs/2310.05175)|
    • ![Star - nlp/LLM-Shearing) [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning](https://arxiv.org/abs/2310.06694) <br> Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen |<img width="1002" alt="image" src="figures/LLM-shearing.png"> |[Github](https://github.com/princeton-nlp/LLM-Shearing) <br> [Paper](https://arxiv.org/abs/2310.06694)|
    • ![Star - DASLab/SparseFinetuning) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Sparse Finetuning for Inference Acceleration of Large Language Models](https://arxiv.org/abs/2310.06927) <br> Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh |<img width="1002" alt="image" src="figures/SquareHead.png"> |[Github](https://github.com/IST-DASLab/SparseFinetuning) <br> [Paper](https://arxiv.org/abs/2310.06927)|
    • ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models - C2A4A6)]() <br> Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar |<img width="1002" alt="image" src="figures/relufication.png"> |[Paper](https://arxiv.org/abs/2310.04564)|
    • The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning - C2A4A6)]() <br> Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite |<img width="1002" alt="image" src="figures/recall_and_icl.png"> |[Paper](https://arxiv.org/abs/2310.04680)|
    • ![Star - C2A4A6)]() <br>[Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs](https://arxiv.org/abs/2310.08915) <br> Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji |<img width="202" alt="image" src="figures/DSOT.png"> |[Github](https://github.com/zxyxmu/DSnoT) <br> [Paper](https://arxiv.org/abs/2310.08915)|
    • One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models - C2A4A6)]()<br> Hang Shao, Bei Liu, Yanmin Qian |<img width="202" alt="image" src="figures/sensitivity_sparse.png"> |[Paper](https://arxiv.org/abs/2310.09499)|
    • ![Star - C2A4A6)]() <br>[LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery](https://arxiv.org/abs/2310.18356) <br> Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang |<img width="1002" alt="image" src="figures/LoRAShear.png"> |[Github](https://github.com/microsoft/lorashear) <br> [Paper](https://arxiv.org/abs/2310.18356)|
    • ![Star - Alpha/Divergent_Tokens) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization](https://arxiv.org/abs/2311.01544) <br> Björn Deiseroth, Max Meuer, Nikolas Gritsch, Constantin Eichenberg, Patrick Schramowski, Matthias Aßenmacher, Kristian Kersting |<img width="1002" alt="image" src="figures/FDT.png"> |[Github](https://github.com/Aleph-Alpha/Divergent_Tokens) <br> [Paper](https://arxiv.org/abs/2311.01544)|
    • ![Star - Lab/GBLM-Pruner) [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br>[Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models](https://arxiv.org/abs/2311.04902) <br> Rocktim Jyoti Das, Liqun Ma, Zhiqiang Shen |<img width="1002" alt="image" src="figures/GBLM-Pruner.png"> |[Github](https://github.com/VILA-Lab/GBLM-Pruner) <br> [Paper](https://arxiv.org/abs/2311.04902)|
    • ![Star - Free Fine-tuning for Sparse LLMs](https://arxiv.org/abs/2310.08915) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji |<img width="202" alt="image" src="https://github.com/zyxxmu/DSnoT/blob/main/imgs/framework.png"> |[Github](https://github.com/zyxxmu/DSnoT) <br> [Paper](https://arxiv.org/abs/2310.08915)|
    • ![Type - Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity](https://arxiv.org/abs/2310.15929) <br> Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, Zhanhui Kang |<img width="1002" alt="image" src="figures/e-sparse.png"> |[Paper](https://arxiv.org/abs/2310.15929)|
    • ![Star - IOL/PERP) [![Type](https://img.shields.io/badge/Semi-structured-C2A4A6)]() <br>[PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs](https://arxiv.org/abs/2312.15230) <br> Max Zimmer, Megi Andoni, Christoph Spiegel, Sebastian Pokutta |<img width="1002" alt="image" src="figures/PERP.png"> |[Github](https://github.com/ZIB-IOL/PERP) <br> [Paper](https://arxiv.org/abs/2312.15230)|
    • ![Star - compbio/admm-pruning)<br>[Fast and Optimal Weight Update for Pruned Large Language Models](https://arxiv.org/abs/2401.02938) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br> Vladimír Boža |<img width="202" alt="image" src="figures/admm.png"> |[Github](https://github.com/fmfi-compbio/admm-pruning) <br> [Paper](https://arxiv.org/abs/2401.02938)|
    • ![Type
    • ![Star - safety) [![Type](https://img.shields.io/badge/Unstructured-C2A4A6)]() <br>[Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning](https://arxiv.org/abs/2401.10862) <br> Adib Hasan, Ileana Rugina, Alex Wang |<img width="1002" alt="image" src="figures/eval_safety.png"> |[Github](https://github.com/CrystalEye42/eval-safety) <br> [Paper](https://arxiv.org/abs/2401.10862)|
    • ReLU2 Wins: Discovering Efficient Activation Functions for Sparse LLMs
    • ![Star - Qi Hu, Sichao Liu, Shangling Jui, Kai Han, Yunhe Wang |<img width="1002" alt="image" src="https://github.com/YuchuanTian/RethinkTinyLM/blob/master/fig/improve.png"> |[Github](https://github.com/YuchuanTian/RethinkTinyLM) <br> [Paper](https://arxiv.org/abs/2402.02791)|
    • ![Star - Francois Kagey, Virginia Smith, Graham Neubig, Ameet Talwalkar |<img width="1002" alt="image" src="figures/bonsai.png"> |[Github](https://github.com/ldery/Bonsai) <br> [Paper](https://arxiv.org/abs/2402.05406)|
    • ![Star - attribution-code)<br>[Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications](https://arxiv.org/abs/2402.05162) <br> Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, Peter Henderson |<img width="1002" alt="image" src="https://boyiwei.com/alignment-attribution/static/images/main.png"> |[Github](https://github.com/boyiwei/alignment-attribution-code) <br> [Paper](https://arxiv.org/abs/2402.05162) <br> [Project](https://boyiwei.com/alignment-attribution/)|
    • ![Star - C2A4A6)]()<br>[SliceGPT: Compress Large Language Models by Deleting Rows and Columns](https://arxiv.org/abs/2401.15024) <br> Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, James Hensman |<img width="1002" alt="image" src="figures/SliceGPT.png"> |[Github](https://github.com/microsoft/TransformerCompression) <br> [Paper](https://arxiv.org/abs/2401.15024)|
    • Efficient Pruning of Large Language Model with Adaptive Estimation Fusion
    • ![Star - BESA)<br>[BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation](https://arxiv.org/pdf/2402.16880.pdf) <br> Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="https://arxiv.org/html/2402.16880v1/x1.png"> |[Github](https://github.com/OpenGVLab/LLMPrune-BESA) <br> [Paper](https://arxiv.org/pdf/2402.16880.pdf)|
    • LaCo: Large Language Model Pruning via Layer Collapse
    • ![Star - Song/sparse_gpu_operator)<br>[ProSparse: Introducing and Enhancing Intrinsic Activation Sparsity within Large Language Models](https://arxiv.org/abs/2402.13516) <br> Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, Maosong Sun |<img width="1002" alt="image" src="https://arxiv.org/html/2402.13516v1/x1.png"> |[Github](https://github.com/Raincleared-Song/sparse_gpu_operator) <br> [Paper](https://arxiv.org/abs/2402.13516) <br> [[Model-7B]](https://huggingface.co/SparseLLM/prosparse-llama-2-7b) [[Model-13B]](https://huggingface.co/SparseLLM/prosparse-llama-2-13b)|
    • ![Star - Wise Fine-Tuning for Sparse LLMs](https://arxiv.org/abs/2402.12419) <br> Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, Rongrong Ji |<img width="1002" alt="image" src="figures/EBFT.png"> |[Github](https://github.com/sunggo/EBFT) <br> [Paper](https://arxiv.org/abs/2402.12419)|
    • ![Star - NetsPresso/shortened-llm) [![Publish](https://img.shields.io/badge/Workshop-ICLRW'24-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]() <br> [Shortened LLaMA: A Simple Depth Pruning for Large Language Models](https://arxiv.org/abs/2402.02834) <br> Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song |<img width="1002" alt="image" src="figures/ShortenedLLaMA.png"> |[Github](https://github.com/Nota-NetsPresso/shortened-llm)<br>[Paper](https://arxiv.org/abs/2402.02834)|
    • NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
    • Learn To be Efficient: Build Structured Sparsity in Large Language Models
    • Shortened LLaMA: A Simple Depth Pruning for Large Language Models - Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song |<img width="1002" alt="image" src="figures/ShortenedLLaMA.png"> |[Paper](https://arxiv.org/abs/2402.02834)|
    • ![Star - dev/SLEB)<br>[SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks](https://arxiv.org/abs/2402.09025) <br> Jiwon Song, Kyungseok Oh, Taesu Kim, Hyungjun Kim, Yulhwa Kim, Jae-Joon Kim |<img width="1002" alt="image" src="figures/SLEB.png"> |[Github](https://github.com/leapingjagg-dev/SLEB) <br> [Paper](https://arxiv.org/abs/2402.09025)|
    • HiRE: High Recall Approximate Top-k Estimation for Efficient LLM Inference
    • ![Star - IVA-Lab/FLAP)[![Publish](https://img.shields.io/badge/Conference-AAAI'24-blue)]() [![Type](https://img.shields.io/badge/Structural-C2A4A6)]()<br>[Fluctuation-based Adaptive Structured Pruning for Large Language Models](https://arxiv.org/abs/2312.11983) <br> Yongqi An, Xu Zhao, Tao Yu, Ming Tang, Jinqiao Wang |<img width="1002" alt="image" src="https://github.com/CASIA-IVA-Lab/FLAP/raw/main/figures/overview.png"> |[Github](https://github.com/CASIA-IVA-Lab/FLAP) <br> [Paper](https://arxiv.org/abs/2312.11983)|
    • ![Star - comp-trust/comp-trust) [![Type](https://img.shields.io/badge/w/Quantization-39B0A9)]() <br>[Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression](https://arxiv.org/abs/2403.15447) <br> Junyuan Hong, Jinhao Duan, Chenhui Zhang, Zhangheng Li, Chulin Xie, Kelsey Lieberman, James Diffenderfer, Brian Bartoldson, Ajay Jaiswal, Kaidi Xu, Bhavya Kailkhura, Dan Hendrycks, Dawn Song, Zhangyang Wang, Bo Li |<img width="1002" alt="image" src="https://arxiv.org/html/2403.15447v1/extracted/5477136/fig/teaser.png"> |[Github](https://github.com/decoding-comp-trust/comp-trust) <br> [Paper](https://arxiv.org/abs/2403.15447) <br> [Project](https://decoding-comp-trust.github.io) |
    • Compressing Large Language Models by Streamlining the Unimportant Layer
    • LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
    • ShortGPT: Layers in Large Language Models are More Redundant Than You Expect
    • ![Star
    • LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models
    • CATS: Contextually-Aware Thresholding for Sparsity in Large Language Models - Yong Lee, Donghyun Lee, Genghan Zhang, Mo Tiwari, Azalia Mirhoseini |<img width="1002" alt="image" src="https://arxiv.org/html/2404.08763v1/x5.png"> |[Paper](https://arxiv.org/abs/2404.08763)|
    • ![Type - Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models](https://arxiv.org/abs/2310.09499) <br> Hang Shao, Bei Liu, Yanmin Qian |<img width="202" alt="image" src="figures/sensitivity_sparse.png"> |[Paper](https://arxiv.org/abs/2310.09499)|
    • ![Star - Pruner)[![Publish](https://img.shields.io/badge/Conference-NAACL'24%20Findings-blue)]()<br>[Pruning as a Domain-specific LLM Extractor](https://arxiv.org/abs/2405.06275) <br> Nan Zhang, Yanchi Liu, Xujiang Zhao, Wei Cheng, Runxue Bao, Rui Zhang, Prasenjit Mitra, Haifeng Chen |<img width="1002" alt="image" src="https://github.com/psunlpgroup/D-Pruner/raw/main/assets/prune_types_example.png"> |[Github](https://github.com/psunlpgroup/D-Pruner) <br> [Paper](https://arxiv.org/abs/2405.06275)|
    • ![Star - specific-pruning)[![Publish](https://img.shields.io/badge/Conference-UNLP'24-blue)]()<br>[Language-Specific Pruning for Efficient Reduction of Large Language Models](https://aclanthology.org/2024.unlp-1.16/) <br> Maksym Shamrai | |[Github](https://github.com/mshamrai/language-specific-pruning) <br> [Paper](https://aclanthology.org/2024.unlp-1.16/)|
    • ![Star - v2)<br>[OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning](https://arxiv.org/abs/2405.05957) <br> Dan Qiao, Yi Su, Pinzheng Wang, Jing Ye, Wenjing Xie, Yuechi Zhou, Yuyang Ding, Zecheng Tang, Jikai Wang, Yixin Ji, Yue Wang, Pei Guo, Zechen Sun, Zikang Zhang, Juntao Li, Pingfu Chao, Wenliang Chen, Guohong Fu, Guodong Zhou, Qiaoming Zhu, Min Zhang |<img width="1002" alt="image" src="figures/OpenBA.png"> |[Github](https://github.com/OpenNLG/OpenBA-v2) <br> [Paper](https://arxiv.org/abs/2405.05957)|
  • Quantization

    • Increased LLM Vulnerabilities from Fine-tuning and Quantization
    • Lossless and Near-Lossless Compression for Foundation Models
    • ![Star - Quantization)<br>[How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study](https://arxiv.org/abs/2404.14047) <br> Wei Huang, Xudong Ma, Haotong Qin, Xingyu Zheng, Chengtao Lv, Hong Chen, Jie Luo, Xiaojuan Qi, Xianglong Liu, Michele Magno |<img width="1002" alt="image" src="https://arxiv.org/html/2404.14047v1/x1.png"> |[Github](https://github.com/Macaronlin/LLaMA3-Quantization) <br> [Paper](https://arxiv.org/abs/2404.14047) <br> [Model](https://huggingface.co/LLMQ)|
    • ![Star - lm-confidence)[![Publish](https://img.shields.io/badge/Conference-NAACL'24%20Findings-blue)]()<br>[When Quantization Affects Confidence of Large Language Models?](https://arxiv.org/abs/2405.00632) <br> Irina Proskurina, Luc Brun, Guillaume Metzler, Julien Velcin |<img width="1002" alt="image" src="figures/quantized-lm-confidence.png"> |[Github](https://github.com/upunaprosk/quantized-lm-confidence) <br> [Paper](https://arxiv.org/abs/2405.00632)|
    • ![Star - han-lab/qserve)<br>[QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving](https://arxiv.org/abs/2405.04532) <br> Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/qserve/blob/main/assets/figures/teaser.png"> |[Github](https://github.com/mit-han-lab/qserve) <br> [Paper](https://arxiv.org/abs/2405.04532)|
    • ![Star - DASLab/gptq)[![Publish](https://img.shields.io/badge/Conference-ICLR'22-blue)]()<br>[GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers](https://arxiv.org/abs/2210.17323) <br> Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh |<img width="202" alt="image" src="figures/GPTQ.png"> |[Github](https://github.com/IST-DASLab/gptq) <br> [Paper](https://arxiv.org/abs/2210.17323)|o
    • ![Star - han-lab/smoothquant)[![Publish](https://img.shields.io/badge/Conference-ICML'23-blue)]() <br>[SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2211.10438) <br> Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/smoothquant/blob/main/figures/intuition.png"> |[Github](https://github.com/mit-han-lab/smoothquant) <br> [Paper](https://arxiv.org/abs/2211.10438)|
    • ![Star - NeurIPS'23-blue)]() <br>[QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) <br> Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer | ![](figures/qlora.png) | <br>[Github](https://github.com/artidoro/qlora)</br> [Paper](https://arxiv.org/abs/2305.14314) |
    • ![Star - chee/QuIP) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() <br>[QuIP: 2-Bit Quantization of Large Language Models With Guarantees](https://arxiv.org/abs/2307.13304) <br> Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De SaXQ |<img width="302" alt="image" src="figures/QuIP.png"> |[Github](https://github.com/jerry-chee/QuIP) <br> [Paper](https://arxiv.org/abs/2307.13304)|
    • ![Star - AI-research/outlier-free-transformers) [![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]() <br>[Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing](https://arxiv.org/abs/2306.12929) <br> Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort | ![](figures/QT.png) | [Github](https://github.com/Qualcomm-AI-research/outlier-free-transformers) [Paper](https://arxiv.org/abs/2306.12929) |
    • ![Star - FP4)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23-blue)]()<br>[LLM-FP4: 4-Bit Floating-Point Quantized Transformers](https://arxiv.org/abs/2310.16836) <br> Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng |<img width="1002" alt="image" src="figures/LLM-FP4.png"> |[Github](https://github.com/nbasyl/LLM-FP4) <br> [Paper](https://arxiv.org/abs/2310.16836)|
    • ![Star - Watermark)[![Publish](https://img.shields.io/badge/Conference-EMNLP'23%20Findings-blue)]()<br>[Watermarking LLMs with Weight Quantization](https://arxiv.org/abs/2310.11237) <br> Linyang Li, Botian Jiang, Pengyu Wang, Ke Ren, Hang Yan, Xipeng Qiu |<img width="1002" alt="image" src="figures/watermark_quant.png"> |[Github](https://github.com/Twilight92z/Quantize-Watermark) <br> [Paper](https://arxiv.org/abs/2310.11237)|
    • ![Star - han-lab/llm-awq) <br>[AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration](https://arxiv.org/abs/2306.00978) <br> Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/llm-awq/blob/main/figures/overview.png"> |[Github](https://github.com/mit-han-lab/llm-awq) <br> [Paper](https://arxiv.org/abs/2306.00978)|
    • ![Star - based Post-training Quantization for Large Language Models](https://arxiv.org/abs/2304.01089) <br> Zhihang Yuan and Lin Niu and Jiawei Liu and Wenyu Liu and Xinggang Wang and Yuzhang Shang and Guangyu Sun and Qiang Wu and Jiaxiang Wu and Bingzhe Wu | ![](https://github.com/hahnyuan/RPTQ4LLM/blob/master/ims/cover.png) | <br>[Github](https://github.com/hahnyuan/RPTQ4LLM)</br> [Paper](https://arxiv.org/abs/2304.01089) |
    • ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation - v2.png"> |[Paper](https://arxiv.org/abs/2303.08302)|
    • ![Star - and-Sparse Quantization](https://arxiv.org/pdf/2306.07629.pdf) <br>Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer | <img width="1102" alt="image" src="figures/SqueezeLLM.png"> |[Github](https://github.com/SqueezeAILab/SqueezeLLM) <br> [Paper](https://arxiv.org/pdf/2306.07629.pdf)|
    • Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
    • Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
    • LLM-QAT: Data-Free Quantization Aware Training for Large Language Models - QAT.png"> |[Paper](https://arxiv.org/abs/2305.17888)|
    • ![Star - Quantized Representation for Near-Lossless LLM Weight Compression](https://arxiv.org/abs/2306.03078) <br> Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh |<img width="1002" alt="image" src="figures/SpQR.png"> |[Github](https://github.com/Vahe1994/SpQR) <br> [Paper](https://arxiv.org/abs/2306.03078)|
    • ![Star
    • ![Star - Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen |<img width="1002" alt="image" src="figures/QuantizedEmpirical.png"> |[Github](https://github.com/RUCAIBox/QuantizedEmpirical) <br> [Paper](https://arxiv.org/abs/2307.08072)|
    • ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats - FP.png"> |[Paper](https://arxiv.org/abs/2307.09782)|
    • FPTQ: Fine-grained Post-Training Quantization for Large Language Models
    • QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm
    • Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
    • Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs - compressor) <br> [Paper](https://arxiv.org/abs/2309.05516)|
    • ![Star - lora)<br>[QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2309.14717) <br> Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian |<img width="1002" alt="image" src="https://github.com/yuhuixu1993/qa-lora/blob/main/image/qalora.png"> |[Github](https://github.com/yuhuixu1993/qa-lora) <br> [Paper](https://arxiv.org/abs/2309.14717)|
    • ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
    • ![Star - LLM)<br>[PB-LLM: Partially Binarized Large Language Models](https://arxiv.org/abs/2310.00034) <br> Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong |<img width="1002" alt="image" src="figures/PB-LLM.png"> |[Github](https://github.com/hahnyuan/PB-LLM) <br> [Paper](https://arxiv.org/abs/2310.00034)|
    • Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
    • QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
    • QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
    • LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
    • TEQ: Trainable Equivalent Transformation for Quantization of LLMs - compressor) <br> [Paper](https://arxiv.org/abs/2310.10944)|
    • BitNet: Scaling 1-bit Transformers for Large Language Models
    • Atom: Low-bit Quantization for Efficient and Accurate LLM Serving - Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci |<img width="302" alt="image" src="figures/atom.png"> |[Paper](https://arxiv.org/abs/2310.19102)|
    • AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
    • ![Star
    • A Speed Odyssey for Deployable Quantization of LLMs
    • ![Star - lora)<br>[LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning](https://arxiv.org/abs/2311.12023) <br> Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim |<img width="1002" alt="image" src="figures/LQ-LoRA.png"> |[Github](https://github.com/HanGuo97/lq-lora) <br> [Paper](https://arxiv.org/abs/2311.12023)|
    • Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization - 2-bit.png"> |[Paper](https://arxiv.org/abs/2311.16442)|
    • ![Star - bit Post-Training WeightQuantization for LLM](https://arxiv.org/abs/2312.03788) <br> Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng |<img width="402" alt="image" src="figures/SmoothQuant+.png"> |[Github](https://github.com/adlik/smoothquant+) <br> [Paper](https://arxiv.org/abs/2312.03788)|
    • ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks - 6bit.png"> |[Github](https://github.com/microsoft/DeepSpeed) <br> [Paper](https://arxiv.org/abs/2312.08583)|
    • ![Star
    • ![Star - ICLR'24-blue)]()<br>[OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models](https://arxiv.org/abs/2308.13137) <br> Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="figures/omniquant.png"> |[Github](https://github.com/OpenGVLab/OmniQuant) <br> [Paper](https://arxiv.org/abs/2308.13137)|
    • L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ - joon Kim |<img width="1002" alt="image" src="figures/L4Q.png"> |[Paper](https://arxiv.org/abs/2402.04902)|
    • ![Star - RelaxML/quip-sharp)<br>[QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks](https://arxiv.org/abs/2402.04396) <br> Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, Christopher De Sa |<img width="1002" alt="image" src="figures/QuIP_sign.png"> |[Github](https://github.com/Cornell-RelaxML/quip-sharp) <br> [Paper](https://arxiv.org/abs/2402.04396)|
    • ![Star - 778/BiLLM)<br>[BiLLM: Pushing the Limit of Post-Training Quantization for LLMs](https://arxiv.org/abs/2402.04291) <br> Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/BiLLM/blob/main/imgs/main.png"> |[Github](https://github.com/Aaronhuang-778/BiLLM) <br> [Paper](https://arxiv.org/abs/2402.04291)|
    • ![Star - qlora)<br>[Accurate LoRA-Finetuning Quantization of LLMs via Information Retention](https://arxiv.org/abs/2402.05445) <br> Haotong Qin, Xudong Ma, Xingyu Zheng, Xiaoyang Li, Yang Zhang, Shouda Liu, Jie Luo, Xianglong Liu, Michele Magno |<img width="1002" alt="image" src="https://github.com/htqin/IR-QLoRA/blob/main/imgs/overview.png"> |[Github](https://github.com/htqin/ir-qlora) <br> [Paper](https://arxiv.org/abs/2402.05445)|
    • ApiQ: Finetuning of 2-Bit Quantized Large Language Model
    • FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design - LLM.png"> |[Paper](https://arxiv.org/abs/2401.14112)|
    • ![Star
    • The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
    • ![Star - ai-research/gptvq)<br>[GPTVQ: The Blessing of Dimensionality for LLM Quantization](https://arxiv.org/abs/2402.15319) <br> Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, Paul Whatmough |<img width="1002" alt="image" src="https://arxiv.org/html/2402.15319v1/extracted/5412979/fig/new_fig1a_blue.png"> |[Github](https://github.com/qualcomm-ai-research/gptvq) <br> [Paper](https://arxiv.org/abs/2402.15319)|
    • A Comprehensive Evaluation of Quantization Strategies for Large Language Models
    • ![Star - nics/qllm-eval)<br>[Evaluating Quantized Large Language Models](https://arxiv.org/abs/2402.18158) <br> Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |<img width="302" alt="image" src="figures/qllm-eval.png"> |[Github](https://github.com/thu-nics/qllm-eval) <br> [Paper](https://arxiv.org/abs/2402.18158)|
    • FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization
    • Towards Next-Level Post-Training Quantization of Hyper-Scale Transformers - young Kim, Joonyoung Kim, Yongkweon Jeon |<img width="1002" alt="image" src="https://arxiv.org/html/2402.08958v1/x1.png"> |[Paper](https://arxiv.org/abs/2402.08958)|
    • ![Star - Aware Training for the Acceleration of Lightweight LLMs on the Edge](https://arxiv.org/abs/2402.10787) <br> Xuan Shen, Zhenglun Kong, Changdi Yang, Zhaoyang Han, Lei Lu, Peiyan Dong, Cheng Lyu, Chih-hsiang Li, Xuehang Guo, Zhihao Shu, Wei Niu, Miriam Leeser, Pu Zhao, Yanzhi Wang |<img width="1002" alt="image" src="figures/EdgeQAT.png"> |[Github](https://github.com/shawnricecake/EdgeQAT) <br> [Paper](https://arxiv.org/abs/2402.10787)|
    • ![Star - DuDa/BitDistiller)<br>[BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation](https://arxiv.org/abs/2402.10631) <br> Dayou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, Ningyi Xu |<img width="202" alt="image" src="https://github.com/DD-DuDa/BitDistiller/raw/main/imgs/overview.jpg"> |[Github](https://github.com/DD-DuDa/BitDistiller) <br> [Paper](https://arxiv.org/abs/2402.10631)|
    • ![Star - bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points](https://arxiv.org/abs/2404.12759) <br> Yi Guo, Fanliu Kong, Xiaoyang Li, Hui Li, Wei Chen, Xiaogang Tian, Jinping Cai, Yang Zhang, Shouda Liu |<img width="1002" alt="image" src="https://github.com/bytedance/decoupleQ/raw/main/imgs/img.png"> |[Github](https://github.com/bytedance/decoupleQ) <br> [Paper](https://arxiv.org/abs/2404.12759)|
    • OneBit: Towards Extremely Low-bit Large Language Models
    • ![Star - Tune May Only Be Worth One Bit](https://arxiv.org/abs/2402.10193) <br> James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai |<img width="1002" alt="image" src="https://github.com/FasterDecoding/BitDelta/raw/main/figures/BitDelta.png"> |[Github](https://github.com/FasterDecoding/BitDelta) <br> [Paper](https://arxiv.org/abs/2402.10193)|
    • Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
    • Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models
    • ![Star - Free 4-Bit Inference in Rotated LLMs](https://arxiv.org/abs/2404.00456) <br> Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman |<img width="1002" alt="image" src="https://github.com/spcl/QuaRot/blob/main/img/fig1.png"> |[Github](https://github.com/spcl/QuaRot) <br> [Paper](https://arxiv.org/abs/2404.00456)|
    • Accurate Block Quantization in LLMs with Outliers
    • ![Star - ICLR'24-blue)]()<br>[AffineQuant: Affine Transformation Quantization for Large Language Models](https://arxiv.org/abs/2403.12544) <br> Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, Rongrong Ji |<img width="1002" alt="image" src="https://github.com/bytedance/AffineQuant/blob/main/fig/overview.png"> |[Github](https://github.com/bytedance/AffineQuant) <br> [Paper](https://arxiv.org/abs/2403.12544)|
    • ![Publish
    • What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation
    • FrameQuant: Flexible Low-Bit Quantization for Transformers
    • ![Star - KVCacheQuantization)<br>[QAQ: Quality Adaptive Quantization for LLM KV Cache](https://arxiv.org/abs/2403.04643) <br> Shichen Dong, Wen Cheng, Jiayu Qin, Wei Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2403.04643v1/x1.png"> |[Github](https://github.com/ClubieDong/QAQ-KVCacheQuantization) <br> [Paper](https://arxiv.org/abs/2403.04643)|
    • Quantization of Large Language Models with an Overdetermined Basis
    • ![Star - window Key and Value Cache Quantization for Large Language Models](https://arxiv.org/abs/2405.06219) <br> Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, Dahua Lin |<img width="1002" alt="image" src="figures/SKVQ.png"> |[Github](https://github.com/cat538/SKVQ) <br> [Paper](https://arxiv.org/abs/2405.06219)|
    • ![Star - QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models](https://arxiv.org/abs/2405.06001) <br> Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Yunchen Zhang, Xianglong Liu, Dacheng Tao |<img width="1002" alt="image" src="https://github.com/ModelTC/llmc/raw/main/imgs/best_practice.png"> |[Github](https://github.com/ModelTC/llmc) <br> [Paper](https://arxiv.org/abs/2405.06001)|
  • Efficient MOE

    • Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
    • SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts
    • Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models
    • ![Publish - of-Experts Training via Whole Graph Computation-Communication Overlapping](https://arxiv.org/abs/2404.19429) <br> Chenyu Jiang, Ye Tian, Zhen Jia, Shuai Zheng, Chuan Wu, Yida Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2404.19429v1/x4.png"> |[Paper](https://arxiv.org/abs/2404.19429)|
    • SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
    • ![Star - of-Experts Attention](https://arxiv.org/abs/2312.07987) <br> Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber |<img width="1002" alt="image" src="figures/switchhead.png"> |[Github](https://github.com/robertcsordas/moe_attention) <br> [Paper](https://arxiv.org/abs/2312.07987)|
    • ![Star - Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference](https://arxiv.org/abs/2401.08383) <br> Jinghan Yao, Quentin Anthony, Aamir Shafi, Hari Subramoni, Dhabaleswar K. (DK)Panda |<img width="1002" alt="image" src="figures/exflow.png"> |[Github](https://github.com/YJHMITWEB/ExFlow) <br> [Paper](https://arxiv.org/abs/2401.08383)|
    • ![Star - Infinity)<br>[MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE Serving](https://arxiv.org/abs/2401.14361) <br> Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, Mahesh Marina |<img width="1002" alt="image" src="figures/MOE-Infinity.png"> |[Github](https://github.com/TorchMoE/MoE-Infinity) <br> [Paper](https://arxiv.org/abs/2401.14361)|
    • ![Star - Lance/Expert_Sparsity)<br>[Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models](https://arxiv.org/abs/2402.14800) <br> Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, Hongsheng Li |<img width="1002" alt="image" src="https://arxiv.org/html/2402.14800v1/x2.png"> |[Github](https://github.com/Lucky-Lance/Expert_Sparsity) <br> [Paper](https://arxiv.org/abs/2402.14800)|
    • ![Star - prompted Mixture of Experts for Efficient LLM Generation](https://arxiv.org/abs/2404.01365) <br> Harry Dong, Beidi Chen, Yuejie Chi |<img width="1002" alt="image" src="https://arxiv.org/html/2404.01365v1/extracted/5509263/figures/algorithm.png"> |[Github](https://github.com/hdong920/GRIFFIN) <br> [Paper](https://arxiv.org/abs/2404.01365)|
    • ![Star - GPU Orchestration for Fast Inference of Mixture-of-Experts Models](https://arxiv.org/abs/2402.07033) <br> Keisuke Kamahori, Yile Gu, Kan Zhu, Baris Kasikci |<img width="1002" alt="image" src="https://github.com/efeslab/fiddler/blob/main/asset/key-idea.png"> |[Github](https://github.com/efeslab/fiddler) <br> [Paper](https://arxiv.org/abs/2402.07033)|
    • Enhancing Efficiency in Sparse Models with Sparser Selection
  • Text Compression

  • Hardware/System

  • Paper from May 26, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

    • Please check out all the papers by selecting the sub-area you're interested in. On this page, we're showing papers released in the past 60 days.

      • ![Star - pruning)[![Publish](https://img.shields.io/badge/Conference-ACL24'Findings-blue)]()<br>[Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations](https://arxiv.org/abs/2407.05690) <br> Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.05690v1/x2.png"> |[Github](https://github.com/sbwww/TransAct-pruning) <br> [Paper](https://arxiv.org/abs/2407.05690)|[//]: #07/10
      • ![Star - shufe/Beyond-Perplexity-Compression-Safety-Eval) [![Type](https://img.shields.io/badge/w/Quantization-39B0A9)]() <br>[Beyond Perplexity: Multi-dimensional Safety Evaluation of LLM Compression](https://arxiv.org/abs/2407.04965) <br> Zhichao Xu, Ashim Gupta, Tao Li, Oliver Bentham, Vivek Srikumar | |[Github](https://github.com/zhichaoxu-shufe/Beyond-Perplexity-Compression-Safety-Eval) <br> [Paper](https://arxiv.org/abs/2407.04965)|[//]: #07/10
      • ![Star - tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization](https://arxiv.org/abs/2407.08044) <br> Xijie Huang, Zechun Liu, Shih-Yang Liu, Kwang-Ting Cheng |<img width="1002" alt="image" src="https://arxiv.org/html/2407.08044v1/x1.png"> |[Github](https://github.com/HuangOwen/RoLoRA) <br> [Paper](https://arxiv.org/abs/2407.08044)|[//]: #07/12
      • ![Star - Quantized LLMs](https://arxiv.org/abs/2407.10960) <br> Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim |<img width="302" alt="image" src="https://arxiv.org/html/2407.10960v1/x1.png"> |[Github](https://github.com/HanGuo97/flute) <br> [Paper](https://arxiv.org/abs/2407.10960)|[//]: #07/16
      • ![Star - LLM)<br>[FBI-LLM: Scaling Up Fully Binarized LLMs from Scratch via Autoregressive Distillation](https://arxiv.org/abs/2407.07093) <br> Liqun Ma, Mingjie Sun, Zhiqiang Shen |<img width="1002" alt="image" src="https://github.com/LiqunMa/FBI-LLM/blob/main/figures/structure_and_training_procedure.png"> |[Github](https://github.com/LiqunMa/FBI-LLM) <br> [Paper](https://arxiv.org/abs/2407.07093)|[//]: #07/10
      • ![Star - Tuning of Quantized Large Language Models Through Optimal Balance](https://arxiv.org/abs/2407.17029) <br> Ao Shen, Qiang Wang, Zhiquan Lai, Xionglve Li, Dongsheng Li |<img width="1002" alt="image" src="figures/Q-BaRA.png"> |[Github](https://github.com/xiaocaigou/qbaraqahira) <br> [Paper](https://arxiv.org/abs/2407.17029)|[//]: #07/26
      • ![Star - research/jax-scalify)[![Publish](https://img.shields.io/badge/Conference-ICML'24%20WANT-blue)]()<br>[Scalify: scale propagation for efficient low-precision LLM training](https://arxiv.org/abs/2407.17353) <br> Paul Balança, Sam Hosegood, Carlo Luschi, Andrew Fitzgibbon | |[Github](https://github.com/graphcore-research/jax-scalify) <br> [Paper](https://arxiv.org/abs/2407.17353)|[//]: #07/26
      • ![Star
      • ![Star - grained Pruning for Large Language Models](https://arxiv.org/abs/2406.10594) <br> Longguang Zhong, Fanqi Wan, Ruijun Chen, Xiaojun Quan, Liangzhi Li |<img width="1002" alt="image" src="https://arxiv.org/html/2406.10594v2/x3.png"> |[Github](https://github.com/MrGGLS/BlockPruner) <br> [Paper](https://arxiv.org/abs/2406.10594)|[//]: #07/05
      • ![Star - MAC)<br>[T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge](https://arxiv.org/abs/2407.00088) <br> Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.00088v1/x2.png"> |[Github](https://github.com/microsoft/T-MAC) <br> [Paper](https://arxiv.org/abs/2407.00088)|[//]: #07/03
      • ![Star - TUM/LiveMind)<br>[LiveMind: Low-latency Large Language Models with Simultaneous Inference](https://arxiv.org/abs/2406.14319) <br> Chuangtao Chen, Grace Li Zhang, Xunzhao Yin, Cheng Zhuo, Ulf Schlichtmann, Bing Li |<img width="1002" alt="image" src="https://arxiv.org/html/2406.14319v1/x1.png"> |[Github](https://github.com/ChuangtaoChen-TUM/LiveMind) <br> [Paper](https://arxiv.org/abs/2406.14319)|[//]: #07/05
      • ![Star - research/EEP)<br>[Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs](https://arxiv.org/abs/2407.00945) <br> Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B. Blaschko, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2407.00945v1/extracted/5697370/Figures/use_case.png"> |[Github](https://github.com/imagination-research/EEP) <br> [Paper](https://arxiv.org/abs/2407.00945)|[//]: #07/03
      • ![Star - ACL'24%20Findings-blue)]()<br>[Efficient Sparse Attention needs Adaptive Token Release](https://arxiv.org/abs/2407.02328) <br> Chaoran Zhang, Lixin Zou, Dan Luo, Min Tang, Xiangyang Luo, Zihao Li, Chenliang Li |<img width="1002" alt="image" src="https://arxiv.org/html/2407.02328v1/x1.png"> |[Github](https://github.com/WHUIR/ADORE) <br> [Paper](https://arxiv.org/abs/2407.02328)|[//]: #07/05
      • ![Star - Neng Chuang, Songchen Li et al |<img width="1002" alt="image" src="figures/longctx_bench.png"> |[Github](https://github.com/henryzhongsc/longctx_bench) <br> [Paper](https://arxiv.org/abs/2407.01527)|[//]: #07/03
      • ![Star - lab/CapaBoost)[![Publish](https://img.shields.io/badge/Conference-ICLR'24-blue)]()<br>[Increasing Model Capacity for Free: A Simple Strategy for Parameter Efficient Fine-tuning](https://arxiv.org/abs/2407.01320) <br> Haobo Song, Hao Zhao, Soumajit Majumder, Tao Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2407.01320v1/x2.png"> |[Github](https://github.com/LINs-lab/CapaBoost) <br> [Paper](https://arxiv.org/abs/2407.01320)|[//]: #07/03
      • ![Star
      • ![Star - Aware Training for Large Language Models](https://arxiv.org/abs/2407.11062) <br> Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo |<img width="1002" alt="image" src="https://arxiv.org/html/2407.11062v1/x5.png"> |[Github](https://github.com/OpenGVLab/EfficientQAT) <br> [Paper](https://arxiv.org/abs/2407.11062)|[//]: #07/21
      • ![Star - Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices](https://arxiv.org/abs/2407.11534) <br> Jung Hyun Lee, Jeonghoon Kim, June Yong Yang, Se Jung Kwon, Eunho Yang, Kang Min Yoo, Dongsoo Lee |<img width="1002" alt="image" src="https://arxiv.org/html/2407.11534v1/extracted/5734567/Figures/Fig_ablation_samplesize_flexround.png"> |[Github](https://github.com/onliwad101/FlexRound_LRQ) <br> [Paper](https://arxiv.org/abs/2407.11534)|[//]: #07/21
      • ![Star
      • ![Star - paper)<br>[GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression](https://arxiv.org/abs/2407.12077) <br> Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah |<img width="202" alt="image" src="https://github.com/recursal/GoldFinch-paper/raw/main/assets/architecture.png"> |[Github](https://github.com/recursal/GoldFinch-paper) <br> [Paper](https://arxiv.org/abs/2407.12077)|[//]: #07/21
      • ![Star
    • Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.

  • Paper from 05/26/2024 - Now (see Full List from 05/22/2023 [here](#full-list))

    • Please check out all the papers by selecting the sub-area you're interested in. On this page, we're showing papers released in the past 30 days.

      • ![Star - based Contextual Sparsity for Large Language Models](https://arxiv.org/abs/2406.16635) <br> Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah |<img width="1002" alt="image" src="https://arxiv.org/html/2406.16635v1/x4.png"> |[Github](https://github.com/abdelfattah-lab/shadow_llm/) <br> [Paper](https://arxiv.org/abs/2406.16635)|[//]: #06/26
      • ![Star - Wise Quantization: A Simple and Effective Approach to Quantize LLMs](https://arxiv.org/abs/2406.17415) <br> Razvan-Gabriel Dumitru, Vikas Yadav, Rishabh Maheshwary, Paul-Ioan Clotan, Sathwik Tejaswi Madhusudhan, Mihai Surdeanu |<img width="202" alt="image" src="https://arxiv.org/html/2406.17415v1/x1.png"> |[Github](https://github.com/RazvanDu/LayerwiseQuant) <br> [Paper](https://arxiv.org/abs/2406.17415)|[//]: #06/26
      • ![Star - Bit Quantization for Large Language Models](https://arxiv.org/abs/2406.09904) <br> Ying Zhang, Peng Zhang, Mincong Huang, Jingyang Xiang, Yujie Wang, Chao Wang, Yineng Zhang, Lei Yu, Chuan Liu, Wei Lin |<img width="202" alt="image" src="https://arxiv.org/html/2406.09904v1/x1.png"> |[Github](https://github.com/HandH1998/QQQ) <br> [Paper](https://arxiv.org/abs/2406.09904)|[//]: #06/18
      • ![Star - nics/MoA)<br>[MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression](https://arxiv.org/abs/2406.14909) <br> Tianyu Fu, Haofeng Huang, Xuefei Ning, Genghan Zhang, Boju Chen, Tianqi Wu, Hongyi Wang, Zixiao Huang, Shiyao Li, Shengen Yan, Guohao Dai, Huazhong Yang, Yu Wang |<img width="1002" alt="image" src="https://github.com/thu-nics/MoA/blob/master/assets/workflow.png"> |[Github](https://github.com/thu-nics/MoA) <br> [Paper](https://arxiv.org/abs/2406.14909)|[//]: #06/26
      • ![Star - Lab/moe-quantization)<br>[Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark](https://arxiv.org/abs/2406.08155) <br> Pingzhi Li, Xiaolong Jin, Yu Cheng, Tianlong Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2406.08155v1/x1.png"> |[Github](https://github.com/UNITES-Lab/moe-quantization) <br> [Paper](https://arxiv.org/abs/2406.08155)|[//]: #06/18
      • ![Star - mlkv)<br>[MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding](https://arxiv.org/abs/2406.09297) <br> Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji |<img width="1002" alt="image" src="https://arxiv.org/html/2406.09297v1/extracted/5665367/resources/mlkv-All_KV.png"> |[Github](https://github.com/zaydzuhri/pythia-mlkv) <br> [Paper](https://arxiv.org/abs/2406.09297)|[//]: #06/18
      • ![Star - EIC/Edge-LLM)[![Publish](https://img.shields.io/badge/Conference-DAC'24-blue)]()<br>[EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting](https://arxiv.org/abs/2406.15758) <br> Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang Katie Zhao, Yingyan Celine Lin |<img width="1002" alt="image" src="https://github.com/GATECH-EIC/Edge-LLM/blob/main/images/Edge-LLM-overview.png"> |[Github](https://github.com/GATECH-EIC/Edge-LLM) <br> [Paper](https://arxiv.org/abs/2406.15758)|[//]: #06/26
      • ![Star - Matching Distillation of Large Language Models](https://arxiv.org/abs/2406.02959) <br> Chen Jia |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02959v1/x1.png"> |[Github](https://github.com/jiachenwestlake/MMKD) <br> [Paper](https://arxiv.org/abs/2406.02959)|[//]: #06/11
      • ![Star
      • ![Star - Zero)[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]()<br>[Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models](https://arxiv.org/abs/2406.02924) <br> Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang Wang, Xiaowen Chu |<img width="1002" alt="image" src="https://raw.githubusercontent.com/pprp/Pruner-Zero/main/.github/images/pruner-zero-main-figure.png"> |[Github](https://github.com/pprp/Pruner-Zero) <br> [Paper](https://arxiv.org/abs/2406.02924)|[//]: #06/11
      • ![Star - Mozaffari/slope)<br>[SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs](https://arxiv.org/abs/2405.16325) <br> Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2405.16325v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slope) <br> [Paper](https://arxiv.org/abs/2405.16325)| [//]: #05/29
      • ![Star - Lance/SPP)<br>[SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models](https://arxiv.org/abs/2405.16057) <br> Xudong Lu, Aojun Zhou, Yuhui Xu, Renrui Zhang, Peng Gao, Hongsheng Li |<img width="1002" alt="image" src="https://github.com/Lucky-Lance/SPP/raw/main/asserts/SPP.png"> |[Github](https://github.com/Lucky-Lance/SPP) <br> [Paper](https://arxiv.org/abs/2405.16057)| [//]: #05/29
      • ![Star
      • ![Star - EIC/ShiftAddLLM)<br>[ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization](https://arxiv.org/abs/2406.05981) <br> Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Lin |<img width="1002" alt="image" src="https://github.com/GATECH-EIC/ShiftAddLLM/raw/main/assets/overview.jpg"> |[Github](https://github.com/GATECH-EIC/ShiftAddLLM) <br> [Paper](https://arxiv.org/abs/2406.05981)|[//]: #06/11
      • ![Star - sri/llm-quantization-attack)<br>[Exploiting LLM Quantization](https://arxiv.org/abs/2405.18137) <br> Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev |<img width="1002" alt="image" src="figures/exploiting_llm_quantization.png"> |[Github](https://github.com/eth-sri/llm-quantization-attack) <br> [Paper](https://arxiv.org/abs/2405.18137)| [//]: #05/29
      • ![Star - 778/SliM-LLM)<br>[SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models](https://arxiv.org/abs/2405.14917) <br> Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/SliM-LLM/blob/main/imgs/[email protected]"> |[Github](https://github.com/Aaronhuang-778/SliM-LLM) <br> [Paper](https://arxiv.org/abs/2405.14917)| [//]: #05/29
      • ![Star - tuning)<br>[PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression](https://arxiv.org/abs/2405.14852) <br> Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik |<img width="1002" alt="image" src="figures/pv-tuning.png"> |[Github](https://github.com/Vahe1994/AQLM/tree/pv-tuning) <br> [Paper](https://arxiv.org/abs/2405.14852)| [//]: #05/29
      • ![Star - EIC/Linearized-LLM)[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]()<br>[When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models](https://arxiv.org/abs/2406.07368) <br> Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine)Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2406.07368v1/x5.png"> |[Github](https://github.com/GATECH-EIC/Linearized-LLM) <br> [Paper](https://arxiv.org/abs/2406.07368)|[//]: #06/12
      • ![Star - research/Q-LLM)<br>[QuickLLaMA: Query-aware Inference Acceleration for Large Language Models](https://arxiv.org/abs/2406.07528) <br> Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia |<img width="1002" alt="image" src="https://github.com/dvlab-research/Q-LLM/raw/master/img/framework.png"> |[Github](https://github.com/dvlab-research/Q-LLM) <br> [Paper](https://arxiv.org/abs/2406.07528)|[//]: #06/12
      • ![Star - prompt-decoding)<br>[Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference](https://arxiv.org/abs/2405.18628) <br> Hao (Mark)Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan |<img width="1002" alt="image" src="https://github.com/hmarkc/parallel-prompt-decoding/raw/main/assets/Overview.png"> |[Github](https://github.com/hmarkc/parallel-prompt-decoding) <br> [Paper](https://arxiv.org/abs/2405.18628)| [//]: #05/31
      • ![Star - Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead](https://arxiv.org/abs/2406.03482) <br> Amir Zandieh, Majid Daliri, Insu Han |<img width="1002" alt="image" src="figures/QJL.png"> |[Github](https://github.com/amirzandieh/QJL) <br> [Paper](https://arxiv.org/abs/2406.03482)|[//]: #06/11
      • ![Star - PEFT)[![Publish](https://img.shields.io/badge/Conference-ACL'24%20Findings-blue)]()<br>[Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning](https://arxiv.org/abs/2406.03792) <br> Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2406.03792v1/x5.png"> |[Github](https://github.com/gccnlp/Light-PEFT) <br> [Paper](https://arxiv.org/abs/2406.03792)|[//]: #06/12
      • ![Star - lab/DynMoE)<br>[Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models](https://arxiv.org/abs/2405.14297) <br> Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin |<img width="1002" alt="image" src="figures/dynmoe.png"> |[Github](https://github.com/LINs-lab/DynMoE) <br> [Paper](https://arxiv.org/abs/2405.14297)| [//]: #05/29
      • ![Star - transformer)<br>[Block Transformer: Global-to-Local Language Modeling for Fast Inference](https://arxiv.org/abs/2406.02657) <br> Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02657v1/x1.png"> |[Github](https://github.com/itsnamgyu/block-transformer) <br> [Paper](https://arxiv.org/abs/2406.02657)|[//]: #06/12
  • Paper from July 13, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

    • Quantization

      • LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid
      • ![Star
      • ![Star - Mozaffari/slim)<br>[SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs](https://arxiv.org/abs/2410.09615) <br> Mohammad Mozaffari, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09615v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slim) <br> [Paper](https://arxiv.org/abs/2410.09615)|[//]: #10/21
      • ![Star - Aware Post-Training Weight-Only Quantization For LLMs](https://arxiv.org/abs/2410.12187) <br> Yingsong Luo, Ling Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12187v2/x1.png"> |[Github](https://github.com/LuoYingSong/DAQ) <br> [Paper](https://arxiv.org/abs/2410.12187)|[//]: #10/21
      • ![Star - group/Quamba)<br>[Quamba: A Post-Training Quantization Recipe for Selective State Space Models](https://arxiv.org/abs/2410.13229) <br> Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Diana Marculescu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.13229v1/extracted/5933363/figures/outliers.png"> |[Github](https://github.com/enyac-group/Quamba) <br> [Paper](https://arxiv.org/abs/2410.13229)|[//]: #10/21
      • Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization
      • ![Star - bit AI Infra: Part 1.1, Fast and Lossless BitNet b1.58 Inference on CPUs](https://arxiv.org/abs/2410.16144) <br> Jinheng Wang, Hansong Zhou, Ting Song, Shaoguang Mao, Shuming Ma, Hongyu Wang, Yan Xia, Furu Wei |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16144v2/x1.png"> |[Github](https://github.com/microsoft/BitNet) <br> [Paper](https://arxiv.org/abs/2410.16144)|[//]: #10/30
      • ![Star - Yaacov, Ron Banner, Kfir Yehuda Levy |<img width="1002" alt="image" src="figures/EXAQ.png"> |[Github](https://github.com/Anonymous1252022/EXAQ) <br> [Paper](https://arxiv.org/abs/2410.03185)|[//]: #10/14
      • ![Star
    • Knowledge Distillation

    • Network Pruning / Sparsity

    • Inference Acceleration

      • ![Star - han-lab/duo-attention)<br>[DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads](https://arxiv.org/abs/2410.10819) <br> Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, Song Han |<img width="1002" alt="image" src="https://github.com/mit-han-lab/duo-attention/raw/main/figures/method1.jpg"> |[Github](https://github.com/mit-han-lab/duo-attention) <br> [Paper](https://arxiv.org/abs/2410.10819)|[//]: #10/21
      • Accelerating Large Language Model Inference with Self-Supervised Early Exits
      • An Efficient Inference Framework for Early-exit Large Language Models
      • ![Publish
      • LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
      • Multi-Token Joint Speculative Decoding for Accelerating Large Language Model Inference
      • ![Star - Exit LLMs](https://arxiv.org/abs/2410.18952) <br> Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec |<img width="1002" alt="image" src="https://github.com/MatteoNulli/Vocabulary_pruning/raw/main/src/images/final_nips.svg"> |[Github](https://github.com/MatteoNulli/Vocabulary_pruning) <br> [Paper](https://arxiv.org/abs/2410.18952)|[//]: #10/29
      • ![Star - Inspired Adaptive Sparse Activation](https://arxiv.org/abs/2410.18311#) <br> Qinsi Wang, Saeed Vahidian, Hancheng Ye, Jianyang Gu, Jianyi Zhang, Yiran Chen |<img width="1002" alt="image" src="https://wangqinsi1.github.io/coreinfer_page/static/images/overview.png"> |[Github](https://github.com/wangqinsi1/CoreInfer) <br> [Paper](https://arxiv.org/abs/2410.18311#)|[//]: #10/29
      • ![Star - AI-Lab/MagicPIG)<br>[MagicPIG: LSH Sampling for Efficient LLM Generation](https://arxiv.org/abs/2410.16179) <br> Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, Beidi Chen |<img width="1002" alt="image" src="https://arxiv.org/html/2410.16179v2/x15.png"> |[Github](https://github.com/Infini-AI-Lab/MagicPIG) <br> [Paper](https://arxiv.org/abs/2410.16179)|[//]: #10/30
      • ![Star - the-Fly Self-Speculative Decoding for LLM Inference Acceleration](https://arxiv.org/abs/2410.06916) <br> Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, Wenjie Li |<img width="1002" alt="image" src="https://github.com/hemingkx/SWIFT/raw/main/assets/swift.png"> |[Github](https://github.com/hemingkx/SWIFT) <br> [Paper](https://arxiv.org/abs/2410.06916)|[//]: #10/14
      • ![Star - Augmented Generation with Precomputed KV Caches for Chunked Text](https://arxiv.org/abs/2410.07590) <br> Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, Yaohua Tang |<img width="1002" alt="image" src="https://github.com/MooreThreads/TurboRAG/raw/main/assets/image/TurboRAG.png"> |[Github](https://github.com/MooreThreads/TurboRAG) <br> [Paper](https://arxiv.org/abs/2410.07590)|[//]: #10/13
      • Adaptive Draft-Verification for Efficient Large Language Model Decoding
    • Efficient Architecture of LLM

      • ![Star - Hay So, Ting Cao, Fan Yang, Mao Yang |<img width="202" alt="image" src="https://arxiv.org/html/2410.13276v1/x4.png"> |[Github](https://github.com/microsoft/SeerAttention) <br> [Paper](https://arxiv.org/abs/2410.13276)|[//]: #10/21
      • SentenceVAE: Enable Next-sentence Prediction for Large Language Models with Faster Speed, Higher Accuracy and Longer Context
      • ![Star - HWAI/Basis_Sharing)<br>[Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression](https://arxiv.org/abs/2410.03765) <br> Jingcun Wang, Yu-Guang Chen, Ing-Chao Lin, Bing Li, Grace Li Zhang |<img width="1002" alt="image" src="https://arxiv.org/html/2410.03765v1/x1.png"> |[Github](https://github.com/TUDa-HWAI/Basis_Sharing) <br> [Paper](https://arxiv.org/abs/2410.03765)|[//]: #10/14
    • Tuning

      • ![Star - Column Updates](https://arxiv.org/abs/2410.10075) <br> Md Kowsher, Tara Esmaeilbeig, Chun-Nam Yu, Mojtaba Soltanalian, Niloofar Yousefi |<img width="1002" alt="image" src="https://github.com/Kowsher/RoCoFT/blob/main/figures/rocoft.png"> |[Github](https://github.com/Kowsher/RoCoFT) <br> [Paper](https://arxiv.org/abs/2410.10075)|[//]: #10/21
      • ![Star - EMNLP'24-blue)]()<br>[Layer-wise Importance Matters: Less Memory for Better Performance in Parameter-efficient Fine-tuning of Large Language Models](https://arxiv.org/abs/2410.11772) <br> Kai Yao, Penlei Gao, Lichun Li, Yuan Zhao, Xiaofeng Wang, Wei Wang, Jianke Zhu |<img width="1002" alt="image" src="https://arxiv.org/html/2410.11772v1/x3.png"> |[Github](https://github.com/Kaiseem/IST) <br> [Paper](https://arxiv.org/abs/2410.11772)|[//]: #10/21
      • ![Star - EMNLP'24%20Findings-blue)]()<br>[QEFT: Quantization for Efficient Fine-Tuning of LLMs](https://arxiv.org/abs/2410.08661) <br> Changhun Lee, Jun-gyu Jin, Younghyun Cho, Eunhyeok Park |<img width="1002" alt="image" src="https://arxiv.org/html/2410.08661v1/x2.png"> |[Github](https://github.com/xvyaward/qeft) <br> [Paper](https://arxiv.org/abs/2410.08661)|[//]: #10/21
      • ![Star - Chang/BIPEFT)[![Publish](https://img.shields.io/badge/Conference-EMNLP'24%20Findings-blue)]()<br>[BIPEFT: Budget-Guided Iterative Search for Parameter Efficient Fine-Tuning of Large Pretrained Language Models](https://arxiv.org/abs/2410.09079) <br> Aofei Chang, Jiaqi Wang, Han Liu, Parminder Bhatia, Cao Xiao, Ting Wang, Fenglong Ma |<img width="1002" alt="image" src="https://arxiv.org/html/2410.09079v1/x1.png"> |[Github](https://github.com/Aofei-Chang/BIPEFT) <br> [Paper](https://arxiv.org/abs/2410.09079)|[//]: #10/21
      • Tensor Train Low-rank Approximation (TT-LoRA): Democratizing AI with Accelerated LLMs
      • ![Star - tuning of MLP Layers](https://arxiv.org/abs/2410.07383) <br> Viktoriia Chekalina, Anna Rudenko, Gleb Mezentsev, Alexander Mikhalev, Alexander Panchenko, Ivan Oseledets |<img width="1002" alt="image" src="https://arxiv.org/html/2410.07383v1/x1.png"> |[Github](https://github.com/sayankotor/sparse_grads) <br> [Paper](https://arxiv.org/abs/2410.07383)|[//]: #10/13
    • Survey

      • ![Star - Compression-Survey)<br>[Prompt Compression for Large Language Models: A Survey](https://arxiv.org/abs/2410.12388) <br> Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier |<img width="1002" alt="image" src="https://arxiv.org/html/2410.12388v2/extracted/5933385/Figures/tree_overview.png"> |[Github](https://github.com/ZongqianLi/Prompt-Compression-Survey) <br> [Paper](https://arxiv.org/abs/2410.12388)|[//]: #10/21
      • ![Publish
    • KV Cache Compression

      • RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
      • ![Star
      • ![Star - Wise Dissimilar KV Cache Sharing](https://arxiv.org/abs/2410.18517) <br> Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen |<img width="1002" alt="image" src="https://github.com/yangyifei729/KVSharer/raw/main/img/main_fig.jpg"> |[Github](https://github.com/yangyifei729/KVSharer) <br> [Paper](https://arxiv.org/abs/2410.18517)|[//]: #10/29
      • ![Star - KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference](https://arxiv.org/abs/2407.11550) <br> Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou |<img width="1002" alt="image" src="figures/adakv.png"> |[Github](https://github.com/FFY0/AdaKV) <br> [Paper](https://arxiv.org/abs/2407.11550)|[//]: #10/13
      • PQCache: Product Quantization-based KVCache for Long Context LLM Inference
    • Text Compression

    • Efficient MOE

    • Low-Rank Decomposition

      • ![Star - ai/Natural-GaLore)<br>[Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning](https://arxiv.org/abs/2410.16029) <br> Arijit Das | |[Github](https://github.com/selfsupervised-ai/Natural-GaLore) <br> [Paper](https://arxiv.org/abs/2410.16029)|[//]: #10/30
    • Hardware/System

  • Inference Acceleration

  • Efficient Architecture of LLM

  • Survey

    • A Survey on Efficient Inference for Large Language Models - Ping Zhang, Yuhan Dong, Yu Wang. [[Paper]](https://arxiv.org/abs/2404.14294)
    • A Survey on Model Compression for Large Language Models
    • ![Star - LLM-Survey) [The Efficiency Spectrum of Large Language Models: An Algorithmic Survey](https://arxiv.org/abs/2312.00678). Tianyu Ding, Tianyi Chen, Haidong Zhu, Jiachen Jiang, Yiqi Zhong, Jinxin Zhou, Guangzhi Wang, Zhihui Zhu, Ilya Zharkov, Luming Liang. [[Paper]](https://arxiv.org/abs/2312.00678)[[Github]](https://github.com/tding1/Efficient-LLM-Survey)
    • ![Star - MLSys-Lab/Efficient-LLMs-Survey) [Efficient Large Language Models: A Survey](https://arxiv.org/abs/2312.03863). Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, Mi Zhang. [[Paper]](https://arxiv.org/abs/2312.03863)[[Github]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)
    • Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
    • ![Star - shii/Awesome-Resource-Efficient-LLM-Papers) [Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models](https://arxiv.org/abs/2401.00625). Guangji Bai, Zheng Chai, Chen Ling, Shiyu Wang, Jiaying Lu, Nan Zhang, Tingwei Shi, Ziyang Yu, Mengdan Zhu, Yifei Zhang, Carl Yang, Yue Cheng, Liang Zhao. [[Paper]](https://arxiv.org/abs/2401.00625)[[Github]](https://github.com/tiingweii-shii/Awesome-Resource-Efficient-LLM-Papers)
    • ![Star - efficient LLM and Multimodal Foundation Models](https://arxiv.org/abs/2401.08092). Mengwei Xu, Wangsong Yin, Dongqi Cai, Rongjie Yi, Daliang Xu, Qipeng Wang, Bingyang Wu, Yihao Zhao, Chen Yang, Shihe Wang, Qiyang Zhang, Zhenyan Lu, Li Zhang, Shangguang Wang, Yuanchun Li, Yunxin Liu, Xin Jin, Xuanzhe Liu. [[Paper]](https://arxiv.org/abs/2401.08092)[[Github]](https://github.com/UbiquitousLearning/Efficient_Foundation_Model_Survey)
    • A Survey on Hardware Accelerators for Large Language Models
    • ![Star - Qin Zhang, Yunxin Liu. [[Paper]](https://arxiv.org/abs/2401.05459)[[Github]](https://github.com/MobileLLM/Personal_LLM_Agents_Survey)
    • A Comprehensive Survey of Compression Algorithms for Language Models
    • ![Star - Bench) [Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding](https://arxiv.org/abs/2401.07851). Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, Zhifang Sui. [[Paper]](https://arxiv.org/abs/2401.07851)[[Github]](https://github.com/hemingkx/Spec-Bench)[[Blog]](https://sites.google.com/view/spec-bench)
    • ![Star - LLM-Survey) [Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward](https://arxiv.org/abs/2402.01799). Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, Deepak Gupta. [[Paper]](https://arxiv.org/abs/2402.01799)[[Github]](https://github.com/nyunAI/Faster-LLM-Survey)
    • ![Star - Knowledge-Distillation-of-LLMs) [A Survey on Knowledge Distillation of Large Language Models](https://arxiv.org/abs/2402.13116). Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng Tao, Tianyi Zhou. [[Paper]](https://arxiv.org/abs/2402.13116)[[Github]](https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs)
    • Efficient Prompting Methods for Large Language Models: A Survey
    • A Survey on Transformer Compression
    • Model Compression and Efficient Inference for Large Language Models: A Survey
    • A Comprehensive Survey of Accelerated Generation Techniques in Large Language Models
  • Paper from June 21, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

    • Quantization

      • SDQ: Sparse Decomposed Quantization for LLM Inference - An Tsai, Stephen W. Keckler, Tushar Krishna |<img width="1002" alt="image" src="https://arxiv.org/html/2406.13868v1/x3.png"> |[Paper](https://arxiv.org/abs/2406.13868)|[//]: #06/24
      • ![Star - bit Vector Post-Training Quantization for Large Language Models](https://arxiv.org/abs/2409.17066) <br> Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, Mao Yang |<img width="1002" alt="image" src="figures/VPTQ.png"> |[Github](https://github.com/microsoft/VPTQ) <br> [Paper](https://arxiv.org/abs/2409.17066)|[//]: #09/27
      • ![Star - FlashAttention2024/INT-FlashAttention)<br>[INT-FlashAttention: Enabling Flash Attention for INT8 Quantization](https://arxiv.org/abs/2409.16997) <br> Shimao Chen, Zirui Liu, Zhiying Wu, Ce Zheng, Peizhuang Cong, Zihan Jiang, Yuhan Wu, Lei Su, Tong Yang |<img width="1002" alt="image" src="https://arxiv.org/html/2409.16997v2/x1.png"> |[Github](https://github.com/INT-FlashAttention2024/INT-FlashAttention) <br> [Paper](https://arxiv.org/abs/2409.16997)|[//]: #09/27
      • ![Star - ov-file)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'24-blue)]()<br>[DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs](https://arxiv.org/abs/2406.01721) <br> Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, Ying Wei |<img width="1002" alt="image" src="https://github.com/Hsu1023/DuQuant/blob/main/imgs/duquant.png"> |[Github](https://github.com/Hsu1023/DuQuant?tab=readme-ov-file) <br> [Paper](https://arxiv.org/abs/2406.01721)|[//]: #09/27
      • Attention-aware Post-training Quantization without Backpropagation - young Kim, Eulrang Cho, Chungman Lee, Joonyoung Kim, Yongkweon Jeon |<img width="1002" alt="image" src="https://arxiv.org/html/2406.13474v1/x1.png"> |[Paper](https://arxiv.org/abs/2406.13474)|[//]: #06/24
      • CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
      • Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models - Joon Kim |<img width="1002" alt="image" src="https://arxiv.org/html/2406.12311v1/x2.png"> |[Paper](https://arxiv.org/abs/2406.12311)|[//]: #06/23
    • Inference Acceleration

    • Hardware/System

    • Network Pruning / Sparsity

    • Knowledge Distillation

      • ![Star
      • ![Star - UOFA/Prompt-LLMR)[![Publish](https://img.shields.io/badge/Conference-LREC-COLING'24-blue)]()<br>[LLMR: Knowledge Distillation with a Large Language Model-Induced Reward](https://arxiv.org/abs/2409.12500) <br> Dongheng Li, Yongchang Hao, Lili Mou |<img width="1002" alt="image" src="https://github.com/MANGA-UOFA/Prompt-LLMR/blob/main/LLMR-main/assets/model.png"> |[Github](https://github.com/MANGA-UOFA/Prompt-LLMR) <br> [Paper](https://arxiv.org/abs/2409.12500)|[//]: #09/21
    • KV Cache Compression

      • ![Star - Cache with Precision-Aligned Quantization](https://arxiv.org/abs/2409.16546) <br> Yifan Tan, Haoze Wang, Chao Yan, Yangdong Deng |<img width="1002" alt="image" src="https://arxiv.org/html/2409.16546v1/extracted/5867591/Figure6.png"> |[Github](https://github.com/AlignedQuant/AlignedKV) <br> [Paper](https://arxiv.org/abs/2409.16546)|[//]: #09/27
    • Text Compression

      • ![Star
      • ![Star
      • ![Star - Shree-Narashiman/AlphaZip)<br>[AlphaZip: Neural Network-Enhanced Lossless Text Compression](https://arxiv.org/abs/2409.15046) <br> Swathi Shree Narashiman, Nitin Chandrachoodan |<img width="1002" alt="image" src="https://arxiv.org/html/2409.15046v1/extracted/5873563/images/architecture_bloack_diagram.png"> |[Github](https://github.com/Swathi-Shree-Narashiman/AlphaZip) <br> [Paper](https://arxiv.org/abs/2409.15046)|[//]: #09/27
      • Brevity is the soul of wit: Pruning long files for code generation
    • Tuning

    • Survey

      • ![Star - Compression)<br>[Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey](https://arxiv.org/abs/2409.13385) <br> Sourav Verma |<img width="1002" alt="image" src="figures/CCRAG_survey.png"> |[Github](https://github.com/SrGrace/Contextual-Compression) <br> [Paper](https://arxiv.org/abs/2409.13385)|[//]: #09/27
    • Low-Rank Decomposition

  • Low-Rank Decomposition

    • ![Star - ICML'23-blue)]() <br>[LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation](https://arxiv.org/abs/2306.11222) <br> Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, Tuo Zhao |<img width="302" alt="image" src="figures/LoSparse.png"> |[Github](https://github.com/yxli2123/LoSparse) <br> [Paper](https://arxiv.org/abs/2306.11222)|
    • ![Star - compressor)[![Publish](https://img.shields.io/badge/Conference-NeurIPS'23-blue)]()<br>[Matrix Compression via Randomized Low Rank and Low Precision Factorization](https://arxiv.org/abs/2310.11028) <br> Rajarshi Saha, Varun Srivastava, Mert Pilanci |<img width="1002" alt="image" src="figures/LPLR.png"> |[Github](https://github.com/pilancilab/matrix-compressor) <br> [Paper](https://arxiv.org/abs/2310.11028)|
    • TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition - SVD.png"> |[Paper](https://arxiv.org/abs/2307.00526)|
    • LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
    • ![Star - rom)<br>[Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models](https://arxiv.org/abs/2312.07046) <br> Arnav Chavan, Nahush Lele, Deepak Gupta |<img width="1002" alt="image" src="figures/LLM-ROM.png"> |[Github](https://github.com/transmuteAI/trailmet/tree/main/trailmet/algorithms/llm-rom) <br> [Paper](https://arxiv.org/abs/2312.07046)|
    • Data-free Weight Compress and Denoise for Large Language Models
    • ![Star - MLSys-Lab/SVD-LLM)<br>[SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression](https://arxiv.org/abs/2403.07378) <br> Xin Wang, Yu Zheng, Zhongwei Wan, Mi Zhang |<img width="1002" alt="image" src="https://github.com/AIoT-MLSys-Lab/SVD-LLM/raw/main/figures/framework.png"> |[Github](https://github.com/AIoT-MLSys-Lab/SVD-LLM) <br> [Paper](https://arxiv.org/abs/2403.07378)|
    • ![Star - ACL'24%20Findings-blue)]()<br>[Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization](https://arxiv.org/abs/2405.10616) <br> Yixin Ji, Yang Xiang, Juntao Li, Wei Chen, Zhongyi Liu, Kehai Chen, Min Zhang |<img width="1002" alt="image" src="figures/bolaco.png"> |[Github](https://github.com/Dereck0602/Bolaco) <br> [Paper](https://arxiv.org/abs/2405.10616)|
    • ![Star - LLM)[![Publish](https://img.shields.io/badge/Conference-ACL'24-blue)]()<br>[Surgical Feature-Space Decomposition of LLMs: Why, When and How?](https://arxiv.org/abs/2405.13039) <br> Arnav Chavan, Nahush Lele, Deepak Gupta |<img width="1002" alt="image" src="figures/SFSD-LLM.png"> |[Github](https://github.com/nyunAI/SFSD-LLM) <br> [Paper](https://arxiv.org/abs/2405.13039)|
  • Hardware

    • ![Star - AILab/flash-attention) [FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691). Tri Dao. [[Paper]](https://arxiv.org/abs/2307.08691)[[Github]](https://github.com/Dao-AILab/flash-attention)
    • ![Star - Kelley. [[Paper]](https://arxiv.org/abs/2311.09431)[[Github]](https://github.com/exists-forall/striped_attention/)
    • ![Star - IPADS/PowerInfer) [PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU](https://arxiv.org/abs/2312.12456). Yixin Song, Zeyu Mi, Haotong Xie, Haibo Chen. [[Paper]](https://arxiv.org/abs/2312.12456)[[Github]](https://github.com/SJTU-IPADS/PowerInfer)
  • Tuning

  • Leaderboard

  • Paper from Sep 30, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

  • Paper from June 13, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

  • Paper from August 17, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

    • KV Cache Compression

    • Knowledge Distillation

    • Quantization

      • STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs
      • ![Star - Computing-Lab-Yale/TesseraQ)<br>[TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction](https://arxiv.org/abs/2410.19103) <br> Yuhang Li, Priyadarshini Panda |<img width="1002" alt="image" src="https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ/raw/main/imgs/tesseraq.png"> |[Github](https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ) <br> [Paper](https://arxiv.org/abs/2410.19103)|[//]: #11/17
      • ![Star - Grained Size Control for Compressed Large Language Models in Variable Memory Environments](https://arxiv.org/abs/2410.23918) <br> Xinghao Wang, Pengyu Wang, Bo Wang, Dong Zhang, Yunhua Zhou, Xipeng Qiu |<img width="1002" alt="image" src="https://github.com/xinghaow99/BitStack/raw/main/assets/bitstack.png"> |[Github](https://github.com/xinghaow99/BitStack) <br> [Paper](https://arxiv.org/abs/2410.23918)|[//]: #11/17
    • Hardware/System/Serving

    • Inference Acceleration

    • Low-Rank Decomposition

    • Network Pruning / Sparsity

    • Survey (or Benchmark)

      • ![Star - lcf/LLM-Inference-Bench)<br>[LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators](https://arxiv.org/abs/2411.00136) <br> Krishna Teja Chitty-Venkata, Siddhisanket Raskar, Bharat Kale, Farah Ferdaus et al | |[Github](https://github.com/argonne-lcf/LLM-Inference-Bench) <br> [Paper](https://arxiv.org/abs/2411.00136)|[//]: #11/18
    • Efficient MOE

      • ![Star - of-Experts Training with Network-Traffc-Aware Parallel Optimization](https://arxiv.org/abs/2411.00662) <br> Jingming Guo, Yan Liu, Yu Meng, Zhiwei Tao, Banglan Liu, Gang Chen, Xiang Li |<img width="1002" alt="image" src="https://arxiv.org/html/2411.00662v1/x1.png"> |[Github](https://github.com/EnflameTechnology/DeepSpeed) <br> [Paper](https://arxiv.org/abs/2411.00662)|[//]: #11/18
      • ![Star - 2)<br>[MoE-I2: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition](https://arxiv.org/abs/2411.01016) <br> Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, Bo Yuan |<img width="1002" alt="image" src="https://arxiv.org/html/2411.01016v1/x1.png"> |[Github](https://github.com/xiaochengsky/MoEI-2) <br> [Paper](https://arxiv.org/abs/2411.01016)|[//]: #11/18
    • Text Compression

      • ![Star - Length Tokenization for Efficient LLMs Adapted from LZW Compression](https://arxiv.org/abs/2410.21548) <br> Noel Elias, Homa Esfahanizadeh, Kaan Kale, Sriram Vishwanath, Muriel Medard |<img width="1002" alt="image" src="https://arxiv.org/html/2410.21548v1/extracted/5960495/Figures/MultiTok.png"> |[Github](https://github.com/noelkelias/multitok) <br> [Paper](https://arxiv.org/abs/2410.21548)|[//]: #11/17
    • Tuning

      • ![Star - IIITD/MonteCLoRA)<br>[Robust and Efficient Fine-tuning of LLMs with Bayesian Reparameterization of Low-Rank Adaptation](https://arxiv.org/abs/2411.04358) <br> Ayan Sengupta, Vaibhav Seth, Arinjay Pathak, Natraj Raman, Sriram Gopalakrishnan, Tanmoy Chakraborty |<img width="1002" alt="image" src="https://arxiv.org/html/2411.04358v2/x3.png"> |[Github](https://github.com/LCS2-IIITD/MonteCLoRA) <br> [Paper](https://arxiv.org/abs/2411.04358)|[//]: #11/18
    • Efficient Training

      • ![Star - EMNLP'24-blue)]()<br>[Scalable Efficient Training of Large Language Models with Low-dimensional Projected Attention](https://arxiv.org/abs/2411.02063) <br> Xingtai Lv, Ning Ding, Kaiyan Zhang, Ermo Hua, Ganqu Cui, Bowen Zhou |<img width="1002" alt="image" src="https://arxiv.org/html/2411.02063v1/x1.png"> |[Github](https://github.com/TsinghuaC3I/LPA) <br> [Paper](https://arxiv.org/abs/2411.02063)|[//]: #11/18
      • ![Star - Efficient FP8 Training](https://arxiv.org/abs/2410.19313) <br> Haocheng Xi, Han Cai, Ligeng Zhu, Yao Lu, Kurt Keutzer, Jianfei Chen, Song Han |<img width="1002" alt="image" src="https://github.com/NVlabs/COAT/blob/main/docs/figs/FP8PrecisionFlow.png"> |[Github](https://github.com/NVlabs/COAT) <br> [Paper](https://arxiv.org/abs/2410.19313)|[//]: #11/17
      • ![Star - v.svg"> |[Github](https://github.com/wuhouming/BitPipe) <br> [Paper](https://arxiv.org/abs/2410.19367)|[//]: #11/17
  • Paper from July 4, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

  • Paper from June 6, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

  • Paper from June 2, 2024 - Now (see Full List from May 22, 2023 [here](#full-list))

    • Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.

      • ![Type
      • LCQ: Low-Rank Codebook based Quantization for Large Language Models - Pu Cai, Wu-Jun Li |<img width="1002" alt="image" src="https://arxiv.org/html/2405.20973v1/x5.png"> |[Paper](https://arxiv.org/abs/2405.20973)|[//]: #06/05
      • MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization
      • Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs - Holder | |[Paper](https://arxiv.org/abs/2405.20835)|[//]: #06/05
      • ![Star - Barber)<br>[LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models](https://arxiv.org/abs/2408.10631) <br> Yupeng Su, Ziyi Guan, Xiaoqun Liu, Tianlai Jin, Dongkuan Wu, Graziano Chesi, Ngai Wong, Hao Yu |<img width="1002" alt="image" src="https://github.com/YupengSu/LLM-Barber/raw/main/img/figure1a.png"> |[Github](https://github.com/YupengSu/LLM-Barber) <br> [Paper](https://arxiv.org/abs/2408.10631)|[//]: #08/27
      • ![Star - Aware-Tuning)<br>[PAT: Pruning-Aware Tuning for Large Language Models](https://arxiv.org/abs/2408.14721) <br> Yijiang Liu, Huanrui Yang, Youxin Chen, Rongyu Zhang, Miao Wang, Yuan Du, Li Du |<img width="1002" alt="image" src="figures/PAT.png"> |[Github](https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning) <br> [Paper](https://arxiv.org/abs/2408.14721)|[//]: #09/02
      • ![Star
      • ![Star - fi/MobileQuant)<br>[MobileQuant: Mobile-friendly Quantization for On-device Language Models](https://arxiv.org/abs/2408.13933) <br> Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez |<img width="1002" alt="image" src="https://arxiv.org/html/2408.13933v1/x1.png"> |[Github](https://github.com/saic-fi/MobileQuant) <br> [Paper](https://arxiv.org/abs/2408.13933)|[//]: #08/27
      • ![Star - LLM)<br>[ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models](https://arxiv.org/abs/2408.08554) <br> Chao Zeng, Songwei Liu, Yusheng Xie, Hong Liu, Xiaojian Wang, Miao Wei, Shu Yang, Fangmin Chen, Xing Mei |<img width="1002" alt="image" src="figures/abq-llm.png"> |[Github](https://github.com/bytedance/ABQ-LLM) <br> [Paper](https://arxiv.org/abs/2408.08554)|[//]: #08/20
      • ![Star - yang-1/DoubleSparse)<br>[Post-Training Sparse Attention with Double Sparsity](https://arxiv.org/abs/2408.07092) <br> Shuo Yang, Ying Sheng, Joseph E. Gonzalez, Ion Stoica, Lianmin Zheng |<img width="302" alt="image" src="https://github.com/andy-yang-1/DoubleSparse/raw/main/assets/double-sparsity-gif-v2.gif"> |[Github](https://github.com/andy-yang-1/DoubleSparse) <br> [Paper](https://arxiv.org/abs/2408.07092)|[//]: #08/20
      • ![Star - hou/instruction-aware-contextual-compressor)<br>[Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression](https://arxiv.org/abs/2408.15491) <br> Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu |<img width="1002" alt="image" src="https://arxiv.org/html/2408.15491v1/extracted/5817813/arch.png"> |[Github](https://github.com/howard-hou/instruction-aware-contextual-compressor) <br> [Paper](https://arxiv.org/abs/2408.15491)|[//]: #09/02
      • Low-Rank Quantization-Aware Training for LLMs
      • Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
  • KV Cache Compression

  • Full List

    • Please check out all the papers by selecting the sub-area you're interested in. On this main page, we're showing papers released in the past 90 days.