Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models
https://github.com/horseee/Awesome-Efficient-LLM

Publish
LLM and GNN are Complementary: Distilling LLM for Multimodal Graph Learning
![Star - Matching Distillation of Large Language Models](https://arxiv.org/abs/2406.02959) Chen Jia |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02959v1/x1.png"> |[Github](https://github.com/jiachenwestlake/MMKD) [Paper](https://arxiv.org/abs/2406.02959)|[//]: #06/11
![Star
![Star - Zero)[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]() [Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models](https://arxiv.org/abs/2406.02924) Peijie Dong, Lujun Li, Zhenheng Tang, Xiang Liu, Xinglin Pan, Qiang Wang, Xiaowen Chu |<img width="1002" alt="image" src="https://raw.githubusercontent.com/pprp/Pruner-Zero/main/.github/images/pruner-zero-main-figure.png"> |[Github](https://github.com/pprp/Pruner-Zero) [Paper](https://arxiv.org/abs/2406.02924)|[//]: #06/11
Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters - Mixtral) |[//]: #06/11
VTrans: Accelerating Transformer Compression with Variational Information Bottleneck based Pruning
![Type
Large Language Model Pruning - Jia Song, Hsing-Kuo Pao |<img width="1002" alt="image" src="https://arxiv.org/html/2406.00030v1/x1.png"> |[Paper](https://arxiv.org/abs/2406.00030)|[//]: #06/05
FinerCut: Finer-grained Interpretable Layer Pruning for Large Language Models
![Star - Mozaffari/slope) [SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs](https://arxiv.org/abs/2405.16325) Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi |<img width="1002" alt="image" src="https://arxiv.org/html/2405.16325v1/x1.png"> |[Github](https://github.com/Mohammad-Mozaffari/slope) [Paper](https://arxiv.org/abs/2405.16325)| [//]: #05/29
![Star - Lance/SPP) [SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models](https://arxiv.org/abs/2405.16057) Xudong Lu, Aojun Zhou, Yuhui Xu, Renrui Zhang, Peng Gao, Hongsheng Li |<img width="1002" alt="image" src="https://github.com/Lucky-Lance/SPP/raw/main/asserts/SPP.png"> |[Github](https://github.com/Lucky-Lance/SPP) [Paper](https://arxiv.org/abs/2405.16057)| [//]: #05/29
![Star - EIC/ShiftAddLLM) [ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization](https://arxiv.org/abs/2406.05981) Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Lin |<img width="1002" alt="image" src="https://github.com/GATECH-EIC/ShiftAddLLM/raw/main/assets/overview.jpg"> |[Github](https://github.com/GATECH-EIC/ShiftAddLLM) [Paper](https://arxiv.org/abs/2406.05981)|[//]: #06/11
Low-Rank Quantization-Aware Training for LLMs
LCQ: Low-Rank Codebook based Quantization for Large Language Models - Pu Cai, Wu-Jun Li |<img width="1002" alt="image" src="https://arxiv.org/html/2405.20973v1/x5.png"> |[Paper](https://arxiv.org/abs/2405.20973)|[//]: #06/05
MagR: Weight Magnitude Reduction for Enhancing Post-Training Quantization
Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs - Holder | |[Paper](https://arxiv.org/abs/2405.20835)|[//]: #06/05
![Star
I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models - LLM.png"> |[Paper](https://arxiv.org/abs/2405.17849)| [//]: #05/29
![Star - sri/llm-quantization-attack) [Exploiting LLM Quantization](https://arxiv.org/abs/2405.18137) Kazuki Egashira, Mark Vero, Robin Staab, Jingxuan He, Martin Vechev |<img width="1002" alt="image" src="figures/exploiting_llm_quantization.png"> |[Github](https://github.com/eth-sri/llm-quantization-attack) [Paper](https://arxiv.org/abs/2405.18137)| [//]: #05/29
CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs
SpinQuant -- LLM quantization with learned rotations
![Star - 778/SliM-LLM) [SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models](https://arxiv.org/abs/2405.14917) Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Xianglong Liu, Luca Benini, Michele Magno, Xiaojuan Qi |<img width="1002" alt="image" src="https://github.com/Aaronhuang-778/SliM-LLM/blob/main/imgs/[email protected]"> |[Github](https://github.com/Aaronhuang-778/SliM-LLM) [Paper](https://arxiv.org/abs/2405.14917)| [//]: #05/29
![Star - tuning) [PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression](https://arxiv.org/abs/2405.14852) Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik |<img width="1002" alt="image" src="figures/pv-tuning.png"> |[Github](https://github.com/Vahe1994/AQLM/tree/pv-tuning) [Paper](https://arxiv.org/abs/2405.14852)| [//]: #05/29
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
![Star - EIC/Linearized-LLM)[![Publish](https://img.shields.io/badge/Conference-ICML'24-blue)]() [When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models](https://arxiv.org/abs/2406.07368) Haoran You, Yichao Fu, Zheng Wang, Amir Yazdanbakhsh, Yingyan (Celine)Lin |<img width="1002" alt="image" src="https://arxiv.org/html/2406.07368v1/x5.png"> |[Github](https://github.com/GATECH-EIC/Linearized-LLM) [Paper](https://arxiv.org/abs/2406.07368)|[//]: #06/12
![Star - research/Q-LLM) [QuickLLaMA: Query-aware Inference Acceleration for Large Language Models](https://arxiv.org/abs/2406.07528) Jingyao Li, Han Shi, Xin Jiang, Zhenguo Li, Hong Xu, Jiaya Jia |<img width="1002" alt="image" src="https://github.com/dvlab-research/Q-LLM/raw/master/img/framework.png"> |[Github](https://github.com/dvlab-research/Q-LLM) [Paper](https://arxiv.org/abs/2406.07528)|[//]: #06/12
![Publish - exiting for Faster LLM Inference with Thompson Sampling Control Mechanism](https://arxiv.org/abs/2406.03853) Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai |<img width="1002" alt="image" src="https://arxiv.org/html/2406.03853v1/x3.png"> |[Paper](https://arxiv.org/abs/2406.03853)|[//]: #06/12
Faster Cascades via Speculative Decoding
![Star - prompt-decoding) [Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference](https://arxiv.org/abs/2405.18628) Hao (Mark)Chen, Wayne Luk, Ka Fai Cedric Yiu, Rui Li, Konstantin Mishchenko, Stylianos I. Venieris, Hongxiang Fan |<img width="1002" alt="image" src="https://github.com/hmarkc/parallel-prompt-decoding/raw/main/assets/Overview.png"> |[Github](https://github.com/hmarkc/parallel-prompt-decoding) [Paper](https://arxiv.org/abs/2405.18628)| [//]: #05/31
Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
![Publish - Data Experts for Large-Scale Sparse Models](https://arxiv.org/abs/2405.18832) Taehyun Kim, Kwanseok Choi, Youngmock Cho, Jaehoon Cho, Hyuk-Jae Lee, Jaewoong Sim |<img width="1002" alt="image" src="https://arxiv.org/html/2405.18832v1/x4.png"> |[Paper](https://arxiv.org/abs/2405.18832)| [//]: #05/31
![Star - lab/DynMoE) [Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models](https://arxiv.org/abs/2405.14297) Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Tao Lin |<img width="1002" alt="image" src="figures/dynmoe.png"> |[Github](https://github.com/LINs-lab/DynMoE) [Paper](https://arxiv.org/abs/2405.14297)| [//]: #05/29
![Publish - tuned Sparse Mixture-of-Experts](https://arxiv.org/pdf/2405.16646) Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers |<img width="1002" alt="image" src="https://arxiv.org/html/2405.16646v2/extracted/5626402/Fig/token_expert_combined_2.png"> |[Paper](https://arxiv.org/pdf/2405.16646)| [//]: #05/29
![Star - transformer) [Block Transformer: Global-to-Local Language Modeling for Fast Inference](https://arxiv.org/abs/2406.02657) Namgyu Ho, Sangmin Bae, Taehyeon Kim, Hyunjik Jo, Yireun Kim, Tal Schuster, Adam Fisch, James Thorne, Se-Young Yun |<img width="1002" alt="image" src="https://arxiv.org/html/2406.02657v1/x1.png"> |[Github](https://github.com/itsnamgyu/block-transformer) [Paper](https://arxiv.org/abs/2406.02657)|[//]: #06/12
Loki: Low-Rank Keys for Efficient Sparse Attention
![Star - Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead](https://arxiv.org/abs/2406.03482) Amir Zandieh, Majid Daliri, Insu Han |<img width="1002" alt="image" src="figures/QJL.png"> |[Github](https://github.com/amirzandieh/QJL) [Paper](https://arxiv.org/abs/2406.03482)|[//]: #06/11
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
PowerInfer-2: Fast Large Language Model Inference on a Smartphone
![Publish - based Applications with Semantic Variable](https://arxiv.org/abs/2405.19888) Chaofan Lin, Zhenhua Han, Chengruidong Zhang, Yuqing Yang, Fan Yang, Chen Chen, Lili Qiu |<img width="1002" alt="image" src="figures/parrot.png"> |[Paper](https://arxiv.org/abs/2405.19888)| [//]: #05/31
![Star - PEFT)[![Publish](https://img.shields.io/badge/Conference-ACL'24%20Findings-blue)]() [Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning](https://arxiv.org/abs/2406.03792) Naibin Gu, Peng Fu, Xiyu Liu, Bowen Shen, Zheng Lin, Weiping Wang |<img width="1002" alt="image" src="https://arxiv.org/html/2406.03792v1/x5.png"> |[Github](https://github.com/gccnlp/Light-PEFT) [Paper](https://arxiv.org/abs/2406.03792)|[//]: #06/12
Zeroth-Order Fine-Tuning of LLMs with Extreme Sparsity