{"id":13653613,"url":"https://github.com/HuangOwen/Awesome-LLM-Compression","last_synced_at":"2025-04-23T06:31:50.059Z","repository":{"id":172622646,"uuid":"647138126","full_name":"HuangOwen/Awesome-LLM-Compression","owner":"HuangOwen","description":"Awesome LLM compression research papers and tools.","archived":false,"fork":false,"pushed_at":"2025-04-17T09:33:57.000Z","size":778,"stargazers_count":1469,"open_issues_count":0,"forks_count":93,"subscribers_count":44,"default_branch":"main","last_synced_at":"2025-04-18T00:22:22.358Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/HuangOwen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-05-30T06:26:11.000Z","updated_at":"2025-04-17T09:34:00.000Z","dependencies_parsed_at":null,"dependency_job_id":"e81a4eb9-576a-4ebf-b2a7-ec1ca7996fc5","html_url":"https://github.com/HuangOwen/Awesome-LLM-Compression","commit_stats":null,"previous_names":["huangowen/awesome-llm-compression"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangOwen%2FAwesome-LLM-Compression","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangOwen%2FAwesome-LLM-Compression/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangOwen%2FAwesome-LLM-Compression/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/HuangOwen%2FAwesome-LLM-Compression/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/HuangOwen","download_url":"https://codeload.github.com/HuangOwen/Awesome-LLM-Compression/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249561463,"owners_count":21291839,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:13.549Z","updated_at":"2025-04-23T06:31:45.042Z","avatar_url":"https://github.com/HuangOwen.png","language":null,"funding_links":[],"categories":["其他相关论文","Other-Awesome-Lists","Other Papers","Topics","NLP","Building","Other Lists","A01_文本生成_文本对话","Relevant Awesome Lists","Related Awesome Repositories"],"sub_categories":["Popular-LLM","LLM Compression \u0026 Long Context","Tools","TeX Lists","大语言对话模型及数据","2015","Papers"],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003ch1\u003eAwesome LLM Compression\u003c/h1\u003e\n    \u003ca href=\"https://awesome.re\"\u003e\u003cimg src=\"https://awesome.re/badge.svg\"/\u003e\u003c/a\u003e\n    \u003cimg src=https://img.shields.io/github/stars/HuangOwen/Awesome-LLM-Compression.svg?style=social \u003e\n    \u003cimg src=https://img.shields.io/github/watchers/HuangOwen/Awesome-LLM-Compression.svg?style=social \u003e\n\u003c/div\u003e\n\n![](quantization.gif)\n\nAwesome LLM compression research papers and tools to accelerate LLM training and inference. \n\n# Contents\n\n- [📑 Papers](#papers)\n  - [Survey](#survey)\n  - [Quantization](#quantization)\n  - [Pruning and Sparsity](#pruning-and-sparsity)\n  - [Distillation](#distillation)\n  - [Efficient Prompting](#efficient-prompting)\n  - [KV Cache Compression](#kv-cache-compression)\n  - [Other](#other)\n- [🔧 Tools](#tools)\n- [🙌 Contributing](#contributing)\n- [🌟 Star History](#star-history)\n\n## Papers\n\n### Survey\n\n- A Survey on Model Compression for Large Language Models \u003cbr\u003e TACL [[Paper]](https://arxiv.org/abs/2308.07633)\n\n- The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2312.00960) [[Code]](https://github.com/NamburiSrinath/LLMCompression)\n\n- The Efficiency Spectrum of Large Language Models: An Algorithmic Survey \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.00678)\n\n- Efficient Large Language Models: A Survey \u003cbr\u003e TMLR [[Paper]](https://arxiv.org/abs/2312.03863) [[GitHub Page]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)\n\n- Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems \u003cbr\u003e ICML 2024 Tutorial [[Paper]](https://arxiv.org/abs/2312.15234) [[Tutorial]](https://icml.cc/virtual/2024/tutorial/35229)\n\n- Understanding LLMs: A Comprehensive Overview from Training to Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.02038) \n\n- Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward \u003cbr\u003e IJCAI 2024 (Survey Track) [[Paper]](https://arxiv.org/abs/2402.01799) [[GitHub Page]](https://github.com/nyunAI/Faster-LLM-Survey)\n\n- A Survey of Resource-efficient LLM and Multimodal Foundation Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.08092) \n\n- A Survey on Hardware Accelerators for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.09890) \n\n- A Comprehensive Survey of Compression Algorithms for Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.15347)\n\n- A Survey on Transformer Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.05964)\n\n- Model Compression and Efficient Inference for Large Language Models: A Survey \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.09748) \n\n- LLM Inference Unveiled: Survey and Roofline Model Insights \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.16363) \n\n- A Survey on Knowledge Distillation of Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.13116) [[GitHub Page]](https://github.com/Tebmer/Awesome-Knowledge-Distillation-of-LLMs)\n\n- Efficient Prompting Methods for Large Language Models: A Survey \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.01077)\n\n- Survey on Knowledge Distillation for Large Language Models: Methods, Evaluation, and Application \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.01885)\n\n- On-Device Language Models: A Comprehensive Review \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.00088) [[GitHub Page]](https://github.com/NexaAI/Awesome-LLMs-on-device) [[Download On-device LLMs]](https://nexaai.com/models)\n\n- A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.16694) \n\n- Contextual Compression in Retrieval-Augmented Generation for Large Language Models: A Survey \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.13385) \n\n- Prompt Compression for Large Language Models: A Survey \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.12388) \n\n- A Comprehensive Study on Quantization Techniques for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2411.02530) \n\n### Quantization\n\n- ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers \u003cbr\u003e NeurIPS 2022 [[Paper]](https://arxiv.org/abs/2206.01861) [[Code (DeepSpeed)]](https://github.com/microsoft/DeepSpeed)\n\n- LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale \u003cbr\u003e NeurIPS 2022 [[Paper]](https://arxiv.org/abs/2208.07339) [[Code]](https://github.com/TimDettmers/bitsandbytes)\n\n- Outlier Suppression: Pushing the Limit of Low-bit Transformer Language Models \u003cbr\u003e NeurIPS 2022 [[Paper]](https://arxiv.org/abs/2209.13325) [[Code]](https://github.com/wimh966/outlier_suppression)\n\n- LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models \u003cbr\u003e Arxiv 2022 [[Paper]](https://arxiv.org/abs/2206.09557) \n\n- SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models \u003cbr\u003e ICML 2023 [[Paper]](https://arxiv.org/abs/2211.10438) [[Code]](https://github.com/mit-han-lab/smoothquant)\n\n- FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization \u003cbr\u003e ICML 2023 [[Paper]](https://arxiv.org/abs/2306.00317) [[Code (DeepSpeed)]](https://github.com/microsoft/DeepSpeed)\n\n- Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases \u003cbr\u003e ICML 2023 [[Paper]](https://arxiv.org/abs/2301.12017) [[Code]](https://openreview.net/attachment?id=-tYCaP0phY_\u0026name=supplementary_material)\n\n- The case for 4-bit precision: k-bit Inference Scaling Laws \u003cbr\u003e ICML 2023 [[Paper]](https://proceedings.mlr.press/v202/dettmers23a.html)\n\n- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers \u003cbr\u003e ICLR 2023 [[Paper]](https://arxiv.org/abs/2210.17323) [[Code]](https://github.com/IST-DASLab/gptq)\n\n- PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2306.00014) \n\n- Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization \u003cbr\u003e ACL 2023 [[Paper]](https://aclanthology.org/2023.findings-acl.15.pdf) \n\n- QLoRA: Efficient Finetuning of Quantized LLMs \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2305.14314) [[Code]](https://github.com/artidoro/qlora)\n\n- The Quantization Model of Neural Scaling \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2303.13506)\n\n- Quantized Distributed Training of Large Models with Convergence Guarantees \u003cbr\u003e ICML 2023 [[Paper]](https://arxiv.org/abs/2302.02390)\n\n- RPTQ: Reorder-based Post-training Quantization for Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2304.01089) [[Code]](https://github.com/hahnyuan/RPTQ4LLM)\n\n- ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation \u003cbr\u003e AAAI 2024 [[Paper]](https://arxiv.org/abs/2303.08302) [[Code]](https://github.com/microsoft/DeepSpeed)\n\n- Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.12356)\n\n- Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2305.14152)\n\n- Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.11186)\n\n- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration \u003cbr\u003e MLSys 2024 (Best Paper 🏆) [[Paper]](https://arxiv.org/abs/2306.00978) [[Code]](https://github.com/mit-han-lab/llm-awq)\n\n- LLM-QAT: Data-Free Quantization Aware Training for Large Language Models \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2305.17888) [[Code]](https://github.com/facebookresearch/LLM-QAT)\n\n- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2306.03078) [[Code]](https://github.com/Vahe1994/SpQR)\n\n- OWQ: Lessons learned from activation outliers for weight quantization in large language models \u003cbr\u003e AAAI 2024 [[Paper]](https://arxiv.org/abs/2306.02272)\n\n- SqueezeLLM: Dense-and-Sparse Quantization \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2306.07629)  [[Code]](https://github.com/SqueezeAILab/SqueezeLLM)\n\n- INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2306.08162)\n\n- LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2311.12023)\n\n- INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.03712) [[Code]](https://github.com/lightmatter-ai/INT-FP-QSim)\n\n- QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.03738) [[Code]](https://github.com/IST-DASLab/QIGen)\n\n- Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study \u003cbr\u003e COLING 2024 [[Paper]](https://arxiv.org/abs/2307.08072)\n\n- ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.09782) [[Code (DeepSpeed)]](https://github.com/microsoft/DeepSpeed)\n\n- OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization \u003cbr\u003e ISCA 2023 [[Paper]](https://arxiv.org/abs/2304.07493)\n\n- NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.05600)\n\n- GPT-Zip: Deep Compression of Finetuned Large Language Models \u003cbr\u003e ICML 2023 Workshop ES-FoMO [[Paper]](https://openreview.net/forum?id=hO0c2tG2xL)\n\n- Generating Efficient Kernels for Quantized Inference on Large Language Models \u003cbr\u003e ICML 2023 Workshop ES-FoMO [[Paper]](https://openreview.net/forum?id=jjazoNAf1S)\n\n- Gradient-Based Post-Training Quantization: Challenging the Status Quo \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.07662)\n\n- FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.09723)\n\n- OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2308.13137) [[Code]](https://github.com/OpenGVLab/OmniQuant)\n\n- FPTQ: Fine-grained Post-Training Quantization for Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.15987)\n\n- eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models \u003cbr\u003e IEEE Computer Architecture Letters 2023 [[Paper]](https://arxiv.org/abs/2309.00964)\n\n- QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.01885)\n\n- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models \u003cbr\u003e AAAI 2024 [[Paper]](https://arxiv.org/abs/2309.02784)\n\n- Understanding the Impact of Post-Training Quantization on Large-scale Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.05210)\n\n- MEMORY-VQ: Compression for Tractable Internet-Scale Memory \u003cbr\u003e NAACL 2024 [[Paper]](https://arxiv.org/abs/2308.14903)\n\n- Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2309.05516) [[Code]](https://github.com/intel/auto-round)\n\n- Efficient Post-training Quantization with FP8 Formats \u003cbr\u003e MLSys 2024 [[Paper]](https://arxiv.org/abs/2309.14592) [[Code (Intel® Neural Compressor)]](https://github.com/intel/neural-compressor)\n\n- QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2309.14717) [[Code]](https://github.com/yuhuixu1993/qa-lora)\n\n- Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2309.15531) [[Code]](https://github.com/johnheo/adadim-llm)\n\n- ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers \u003cbr\u003e TMLR (Featured Certification 🌟) [[Paper]](https://arxiv.org/abs/2309.16119) \n\n- PB-LLM: Partially Binarized Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2310.00034) [[Code]](https://github.com/hahnyuan/PB-LLM)\n\n- Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.04836) \n\n- QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2310.08041) [[Code]](https://github.com/ModelTC/QLLM)\n\n- LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2310.08659) [[Code]](https://github.com/yxli2123/LoftQ)\n\n- QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.07147) \n\n- TEQ: Trainable Equivalent Transformation for Quantization of LLMs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.10944) [[Code (Intel® Neural Compressor)]](https://github.com/intel/neural-compressor)\n\n- BitNet: Scaling 1-bit Transformers for Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.11453)  [[Code]](https://github.com/Beomi/BitNet-Transformers)\n\n- FP8-LM: Training FP8 Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.18313) [[Code]](https://github.com/Azure/MS-AMP)\n\n- QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2310.09259) [[Code]](https://github.com/IST-DASLab/QUIK)\n\n- AFPQ: Asymmetric Floating Point Quantization for LLMs \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2311.01792) [[Code]](https://github.com/zhangsichengsjtu/AFPQ)\n\n- AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.01305) \n\n- Atom: Low-bit Quantization for Efficient and Accurate LLM Serving \u003cbr\u003e MLSys 2024 [[Paper]](https://arxiv.org/abs/2310.19102) [[Code]](https://github.com/efeslab/Atom)\n\n- QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.16795) \n\n- Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.03687) \n\n- How Does Calibration Data Affect the Post-training Pruning and Quantization of Large Language Models? \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.09755)\n\n- A Speed Odyssey for Deployable Quantization of LLMs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.09550)\n\n- Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.16442)\n\n- Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2306.12929) [[Code]](https://github.com/Qualcomm-AI-research/outlier-free-transformers)\n\n- Efficient LLM Inference on CPUs \u003cbr\u003e NeurIPS 2023 on Efficient Natural Language and Speech Processing [[Paper]](https://arxiv.org/abs/2311.00502) [[Code]](https://github.com/intel/intel-extension-for-transformers)\n\n- The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models \u003cbr\u003e EMNLP Findings 2023 [[Paper]](https://arxiv.org/abs/2312.00960) \n\n- Zero-Shot Sharpness-Aware Quantization for Pre-trained Language Models \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.13315) \n\n- Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference? \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.05079) [[Code]](https://github.com/ChengZhang-98/llm-mixed-q) \n\n- Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2304.09145)\n\n- Watermarking LLMs with Weight Quantization \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.11237) [[Code]](https://github.com/Twilight92z/Quantize-Watermark)\n\n- Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2311.05161)\n\n- LLM-FP4: 4-Bit Floating-Point Quantized Transformers \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.16836) [[Code]](https://github.com/nbasyl/LLM-FP4)\n\n- Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge \u003cbr\u003e AAAI 2024 [[Paper]](https://arxiv.org/abs/2312.05693)\n\n- SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.03788)\n\n- CBQ: Cross-Block Quantization for Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.07950)\n\n- ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.08583)\n\n- QuIP: 2-Bit Quantization of Large Language Models With Guarantees \u003cbr\u003e NeurIPS 2023 [[Paper]](https://openreview.net/pdf?id=xrk9g5vcXR) [[Code]](https://github.com/jerry-chee/QuIP)\n\n- A Performance Evaluation of a Quantized Large Language Model on Various Smartphones \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.12472)\n\n- DeltaZip: Multi-Tenant Language Model Serving via Delta Compression \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.05215) [[Code]](https://github.com/eth-easl/deltazip)\n\n- FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGA \u003cbr\u003e FPGA 2024 [[Paper]](https://arxiv.org/abs/2401.03868)\n\n- Extreme Compression of Large Language Models via Additive Quantization \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2401.06118)\n\n- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.07159)\n\n- Inferflow: an Efficient and Highly Configurable Inference Engine for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.08294)\n\n- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design \u003cbr\u003e USENIX ATC 2024 [[Paper]](https://arxiv.org/abs/2401.14112)\n\n- Can Large Language Models Understand Context? \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.18079)\n\n- EdgeQAT: Entropy and Distribution Guided Quantization-Aware Training for the Acceleration of Lightweight LLMs on the Edge \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.10787) [[Code]](https://github.com/shawnricecake/EdgeQAT)\n\n- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.10517) \n\n- LQER: Low-Rank Quantization Error Reconstruction for LLMs \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2402.02446) \n\n- BiLLM: Pushing the Limit of Post-Training Quantization for LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.04291) [[Code]](https://github.com/Aaronhuang-778/BiLLM)\n\n- QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2402.04396) [[Code]](https://github.com/Cornell-RelaxML/quip-sharp)\n\n- L4Q: Parameter Efficient Quantization-Aware Training on Large Language Models via LoRA-wise LSQ \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.04902) \n\n- TP-Aware Dequantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.04925) \n\n- ApiQ: Finetuning of 2-Bit Quantized Large Language Model \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2402.05147) \n\n- Accurate LoRA-Finetuning Quantization of LLMs via Information Retention \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.05445) [[Code]](https://github.com/htqin/ir-qlora)\n\n- BitDelta: Your Fine-Tune May Only Be Worth One Bit \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.10193) [[Code]](https://github.com/FasterDecoding/BitDelta)\n\n- QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning \u003cbr\u003e EMNLP 2024 Industry Track [[Paper]](https://arxiv.org/abs/2402.10462) \n\n- Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2402.10517) \n\n- BitDistiller: Unleashing the Potential of Sub-4-Bit LLMs via Self-Distillation \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2402.10631) [[Code]](https://github.com/DD-DuDa/BitDistiller)\n\n- OneBit: Towards Extremely Low-bit Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.11295)\n\n- DB-LLM: Accurate Dual-Binarization for Efficient LLMs \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2402.11960)\n\n- WKVQuant: Quantizing Weight and Key/Value Cache for Large Language Models Gains More \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.12065)\n\n- GPTVQ: The Blessing of Dimensionality for LLM Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.15319) [[Code]](https://github.com/qualcomm-ai-research/gptvq)\n\n- APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models \u003cbr\u003e DAC 2024 [[Paper]](https://arxiv.org/abs/2402.14866) \n\n- A Comprehensive Evaluation of Quantization Strategies for Large Language Models \u003cbr\u003e DAC 2024 [[Paper]](https://arxiv.org/abs/2402.16775) \n\n- Evaluating Quantized Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.18158)\n\n- FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.17985)\n\n- LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.01136)\n\n- IntactKV: Improving Large Languagze Model Quantization by Keeping Pivot Tokens Intact \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2403.01241) [[Code]](https://github.com/ruikangliu/IntactKV)\n\n- On the Compressibility of Quantized Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.01384)\n\n- EasyQuant: An Efficient Data-free Quantization Algorithm for LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.02775)\n\n- What Makes Quantization for Large Language Models Hard? An Empirical Study from the Lens of Perturbation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.06408)\n\n- SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.07378) [[Code]](https://github.com/AIoT-MLSys-Lab/SVD-LLM)\n\n- AffineQuant: Affine Transformation Quantization for Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://browse.arxiv.org/abs/2402.00858) [[Code]](https://github.com/bytedance/AffineQuant)\n\n- Oh! We Freeze: Improving Quantized Knowledge Distillation via Signal Propagation Analysis for Large Language Models \u003cbr\u003e ICLR Practical ML for Low Resource Settings Workshop 2024 [[Paper]](https://arxiv.org/abs/2403.18159) \n\n- Accurate Block Quantization in LLMs with Outliers \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.20137)\n\n- QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.00456) [[Code]](https://github.com/spcl/QuaRot)\n\n- Minimize Quantization Output Error with Bias Compensation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.01892) [[Code]](https://github.com/GongCheng1919/bias-compensation)\n\n- Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.02837)\n\n- Increased LLM Vulnerabilities from Fine-tuning and Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.04392)\n\n- Quantization of Large Language Models with an Overdetermined Basis \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.09737)\n\n- How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.14047) [[Code]](https://github.com/Macaronlin/LLaMA3-Quantization) [[Model]](https://huggingface.co/LLMQ)\n\n- How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.16898)\n\n- Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.03605) [[Code]](https://github.com/aninrusimha/qat-pretrain)\n\n- When Quantization Affects Confidence of Large Language Models? \u003cbr\u003e NAACL 2024 [[Paper]](https://arxiv.org/abs/2405.00632)\n\n- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.04532) [[Code]](https://github.com/mit-han-lab/qserve)\n\n- Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2405.03103)\n\n- LLM-QBench: A Benchmark Towards the Best Practice for Post-training Quantization of Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.06001) [[Code]](https://github.com/ModelTC/llmc)\n\n- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.06219) \n\n- Combining multiple post-training techniques to achieve most efficient quantized LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.07135) \n\n- Edge Intelligence Optimization for Large Language Model Inference with Batching and Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.07140) \n\n- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.14917) [[Code]](https://github.com/Aaronhuang-778/SliM-LLM)\n\n- OAC: Output-adaptive Calibration for Accurate Post-training Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.15025) \n\n- PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.14852) \n\n- SpinQuant -- LLM quantization with learned rotations \u003cbr\u003e Arxiv 2024 [[Paper]](https://www.arxiv.org/abs/2405.16406) \n\n- Compressing Large Language Models using Low Rank and Low Precision Decomposition \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.18886) [[Code]](https://github.com/pilancilab/caldera)\n\n- Athena: Efficient Block-Wise Post-Training Quantization for Large Language Models Using Second-Order Matrix Derivative Information \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.17470) \n\n- Exploiting LLM Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.18137) \n\n- One QuantLLM for ALL: Fine-tuning Quantized LLMs Once for Efficient Deployments \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.20202) \n\n- LCQ: Low-Rank Codebook based Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.20973) \n\n- LoQT: Low Rank Adapters for Quantized Training \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.16528) [[Code]](https://github.com/sebulo/LoQT)\n\n- CLAQ: Pushing the Limits of Low-Bit Post-Training Quantization for LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.17233) [[Code]](https://github.com/fayuge/CLAQ)\n\n- I-LLM: Efficient Integer-Only Inference for Fully-Quantized Low-Bit Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.17849)\n\n- Outliers and Calibration Sets have Diminishing Effect on Quantization of Modern LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.20835)\n\n- DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2406.01721) [[Code]](https://github.com/Hsu1023/DuQuant)\n\n- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.05981) [[Code]](https://github.com/GATECH-EIC/ShiftAddLLM)\n\n- Low-Rank Quantization-Aware Training for LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.06385)\n\n- TernaryLLM: Ternarized Large Language Model \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.07177)\n\n- Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.08155) [[Code]](https://github.com/UNITES-Lab/moe-quantization)\n\n- Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.08903)\n\n- QQQ: Quality Quattuor-Bit Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.09904) [[Code]](https://github.com/HandH1998/QQQ)\n\n- QTIP: Quantization with Trellises and Incoherence Processing \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2406.11235) [[Code]](https://github.com/Cornell-RelaxML/qtip)\n\n- Prefixing Attention Sinks can Mitigate Activation Outliers for Large Language Model Quantization \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2406.12016) \n\n- Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.12311) \n\n- Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization \u003cbr\u003e ISCA 2024 [[Paper]](https://arxiv.org/abs/2406.12930) \n\n- SDQ: Sparse Decomposed Quantization for LLM Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.13868) \n\n- Attention-aware Post-training Quantization without Backpropagation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.13474) \n\n- EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.15758) [[Code]](https://github.com/GATECH-EIC/Edge-LLM)\n\n- Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.16299) \n\n- Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.17415) [[Code]](https://github.com/RazvanDu/LayerwiseQuant)\n\n- CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.17542) \n\n- OutlierTune: Efficient Channel-Wise Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.18832) \n\n- T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.00088) [[Code]](https://github.com/microsoft/T-MAC)\n\n- GPTQT: Quantize Large Language Models Twice to Push the Efficiency \u003cbr\u003e ICORIS 2024 [[Paper]](https://arxiv.org/abs/2407.02891) \n\n- Improving Conversational Abilities of Quantized Large Language Models via Direct Preference Alignment \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2407.03051) \n\n- How Does Quantization Affect Multilingual LLMs? \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2407.03211) \n\n- RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2407.08044) [[Code]](https://github.com/HuangOwen/RoLoRA) \n\n- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.08296) [[Code]](https://github.com/VITA-Group/Q-GaLore) \n\n- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.08608) [[Code]](https://github.com/Dao-AILab/flash-attention) \n\n- Accuracy is Not All You Need \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.09141)\n\n- BitNet b1.58 Reloaded: State-of-the-art Performance Also on Smaller Networks \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.09527)\n\n- LeanQuant: Accurate Large Language Model Quantization with Loss-Error-Aware Grid \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.10032)\n\n- Fast Matrix Multiplications for Lookup Table-Quantized LLMs \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2407.10960) [[Code]](https://github.com/HanGuo97/flute) \n\n- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.11062) [[Code]](https://github.com/OpenGVLab/EfficientQAT) \n\n- LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.11534) [[Code]](https://github.com/onliwad101/FlexRound_LRQ) \n\n- Exploring Quantization for Efficient Pre-Training of Transformer Language Models \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2407.11722) [[Code]](https://github.com/chandar-lab/EfficientLLMs) \n\n- Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.12327) [[Code]](https://github.com/NolanoOrg/SpectraSuite) \n\n- Mamba-PTQ: Outlier Channels in Recurrent Large Language Models  \u003cbr\u003e Efficient Systems for Foundation Models Workshop @ ICML 2024 [[Paper]](https://arxiv.org/abs/2407.12397)\n\n- Compensate Quantization Errors+: Quantized Models Are Inquisitive Learners \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.15508)\n\n- Accurate and Efficient Fine-Tuning of Quantized Large Language Models Through Optimal Balance \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.17029) [[Code]](https://github.com/xiaocaigou/qbaraqahira) \n\n- STBLLM: Breaking the 1-Bit Barrier with Structured Binary LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.01803)\n\n- Advancing Multimodal Large Language Models with Quantization-Aware Scale Learning for Efficient Adaptation \u003cbr\u003e ACM MM 2024 [[Paper]](https://arxiv.org/abs/2408.03735)\n\n- ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.08554) \n\n- MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.11743) [[Code (Marlin)]](https://github.com/IST-DASLab/marlin) [[Code (Sparse Marlin)]](https://github.com/IST-DASLab/Sparse-Marlin)\n\n- Matmul or No Matmal in the Era of 1-bit LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.11939)\n\n- MobileQuant: Mobile-friendly Quantization for On-device Language Models  \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2408.13933) [[Code]](https://github.com/saic-fi/MobileQuant) \n\n- GIFT-SW: Gaussian noise Injected Fine-Tuning of Salient Weights for LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.15300) [[Code]](https://github.com/On-Point-RND/GIFT_SW-v2-Gaussian-noise-Injected-Fine-Tuning-of-Salient-Weights-for-LLMs) \n\n- Foundations of Large Language Model Compression -- Part 1: Weight Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.02026) [[Code]](https://github.com/seannz/cvxq) \n\n- OPAL: Outlier-Preserved Microscaling Quantization A ccelerator for Generative Large Language Models \u003cbr\u003e DAC 2024 [[Paper]](https://arxiv.org/abs/2409.05902)\n\n- VPTQ: Extreme Low-bit Vector Post-Training Quantization for Large Language Models \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2409.17066) [[Code]](https://github.com/microsoft/VPTQ)\n\n- Scaling FP8 training to trillion-token LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.12517)\n\n- Accumulator-Aware Post-Training Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.17092)\n\n- Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.17870)\n\n- Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.20361) [[Code]](https://github.com/Coco58323/Rotated_Runtime_Smooth)\n\n- EXAQ: Exponent Aware Quantization For LLMs Acceleration \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.03185) \n\n- ARB-LLM: Alternating Refined Binarizations for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.03129) [[Code]](https://github.com/ZHITENGLI/ARB-LLM)\n\n- PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.05265) [[Code]](https://github.com/ChenMnZ/PrefixQuant)\n\n- SpaLLM: Unified Compressive Adaptation of Large Language Models with Sketching \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.06364) \n\n- Scaling Laws for Mixed quantization in Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.06722) \n\n- Q-VLM: Post-training Quantization for Large Vision-Language Models \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2410.08119) [[Code]](https://github.com/ChangyuanWang17/QVLM)\n\n- CrossQuant: A Post-Training Quantization Method with Smaller Quantization Kernel for Precise Large Language Model Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.07505) \n\n- FlatQuant: Flatness Matters for LLM Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.09426) [[Code]](https://github.com/ruikangliu/FlatQuant)\n\n- DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.08666) \n\n- QEFT: Quantization for Efficient Fine-Tuning of LLMs \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2410.08661) [[Code]](https://github.com/xvyaward/qeft)\n\n- Continuous Approximations for Improving Quantization Aware Training of LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.10849) \n\n- DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.12187) \n\n- COMET: Towards Partical W4A4KV4 LLMs Serving \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.12168) \n\n- Scaling laws for post-training quantized large language models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.12119) \n\n- Channel-Wise Mixed-Precision Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.13056) \n\n- Understanding the difficulty of low-precision post-training quantization of large language models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.14570) \n\n- QuAILoRA: Quantization-Aware Initialization for LoRA \u003cbr\u003e NeurIPS Workshop on Efficient Natural Language and Speech Processing (ENLSP-IV) 2024 [[Paper]](https://arxiv.org/abs/2410.14713) \n\n- SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2410.15526) \n\n- Pyramid Vector Quantization for LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.16926) \n\n- TesseraQ: Ultra Low-Bit LLM Post-Training Quantization with Block Reconstruction \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.19103) [[Code]](https://github.com/Intelligent-Computing-Lab-Yale/TesseraQ)\n\n- COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.19313) [[Code]](https://github.com/NVlabs/COAT)\n\n- BitStack: Fine-Grained Size Control for Compressed Large Language Models in Variable Memory Environments \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.23918) [[Code]](https://github.com/xinghaow99/BitStack)\n\n- GWQ: Gradient-Aware Weight Quantization for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2411.00850) \n\n- \"Give Me BF16 or Give Me Death\"? Accuracy-Performance Trade-Offs in LLM Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2411.02355) \n\n- Interactions Across Blocks in Post-Training Quantization of Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2411.03934) \n\n### Pruning and Sparsity\n\n- The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers \u003cbr\u003e ICLR 2023 [[Paper]](https://openreview.net/forum?id=TJ2nxciYCk-)\n\n- Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time \u003cbr\u003e ICML 2023 [[Paper]](https://proceedings.mlr.press/v202/liu23am.html)  [[Code]](https://github.com/FMInference/DejaVu)\n\n- LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation \u003cbr\u003e ICML 2023 [[Paper]](https://arxiv.org/abs/2306.11222)  [[Code]](https://github.com/yxli2123/LoSparse)\n\n- LLM-Pruner: On the Structural Pruning of Large Language Models \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2305.11627) [[Code]](https://github.com/horseee/LLM-Pruner)\n\n- ZipLM: Inference-Aware Structured Pruning of Language Models \u003cbr\u003e NeurIPS 2023  [[Paper]](https://arxiv.org/abs/2302.04089) [[Code]](https://github.com/IST-DASLab/ZipLM)\n\n- H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/pdf/2306.14048.pdf) [[Code]](https://github.com/FMInference/H2O)\n\n- The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter \u003cbr\u003e NeurIPS 2023 [[Paper]](https://openreview.net/pdf?id=bU9hwbsVcy) [[Code]](https://github.com/VITA-Group/essential_sparsity)\n\n- Learning to Compress Prompts with Gist Tokens \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/pdf/2304.08467.pdf)\n\n- Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers \u003cbr\u003e NeurIPS 2023 [[Paper]](https://openreview.net/pdf?id=uvdJgFFzby)\n\n- Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models \u003cbr\u003e ICLR 2023 TinyPapers [[Paper]](https://openreview.net/pdf?id=cKlgcx7nSZ)\n\n- SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot \u003cbr\u003e ICML 2023 [[Paper]](https://arxiv.org/abs/2301.00774) [[Code]](https://github.com/IST-DASLab/sparsegpt)\n\n- AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning \u003cbr\u003e ICLR 2023 [[Paper]](https://arxiv.org/abs/2303.10512)\n\n- Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2212.09095) [[Code]](https://github.com/amazon-science/llm-interpret)\n\n- Structured Pruning for Efficient Generative Pre-trained Language Models \u003cbr\u003e ACL 2023 [[Paper]](https://aclanthology.org/2023.findings-acl.692.pdf)\n\n- A Simple and Effective Pruning Approach for Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2306.11695) [[Code]](https://github.com/locuslab/wanda)\n\n- Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2305.18403) \n\n- Structural pruning of large language models via neural architecture search \u003cbr\u003e AutoML 2023 [[Paper]](https://www.amazon.science/publications/structural-pruning-of-large-language-models-via-neural-architecture-search) \n\n- Pruning Large Language Models via Accuracy Predictor \u003cbr\u003e ICASSP 2024 [[Paper]](https://arxiv.org/abs/2309.09507) \n\n- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity \u003cbr\u003e VLDB 2024 [[Paper]](https://arxiv.org/abs/2309.10285) [[Cde]](https://github.com/AlibabaResearch/flash-llm)\n\n- Compressing LLMs: The Truth is Rarely Pure and Never Simple \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2310.01382) \n\n- Pruning Small Pre-Trained Weights Irreversibly and Monotonically Impairs \"Difficult\" Downstream Tasks in LLMs \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2310.02277)  [[Code]](https://github.com/VITA-Group/Junk_DNA_Hypothesis)\n\n- Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.05015) [[Code]](https://github.com/microsoft/Moonlit/tree/main/Compresso)\n\n- Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.05175) [[Code]](https://github.com/luuyin/OWL)\n\n- Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.06694) [[Code]](https://github.com/princeton-nlp/LLM-Shearing)\n\n- Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2310.08915) [[Code]](https://github.com/zxyxmu/DSnoT)\n\n- One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models \u003cbr\u003e ICASSP 2024 [[Paper]](https://arxiv.org/abs/2310.09499) \n\n- Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning \u003cbr\u003e EMNLP Findings 2023 [[Paper]](https://arxiv.org/abs/2310.12774) \n\n- The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models \u003cbr\u003e EMNLP Findings 2023 [[Paper]](https://arxiv.org/abs/2312.00960) \n\n- Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.01544) \n\n- LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.18356) \n\n- ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.04564) \n\n- E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.15929)\n\n- Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2311.04902) [[Code]](https://github.com/RocktimJyotiDas/GBLM-Pruner)\n\n- On the Impact of Calibration Data in Post-training Quantization and Pruning \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2311.09755)\n\n- BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation \u003cbr\u003e OpenReview [[Paper]](https://openreview.net/pdf?id=gC6JTEU3jl) [[Code]](https://github.com/LinkAnonymous/BESA)\n\n- PUSHING GRADIENT TOWARDS ZERO: A NOVEL PRUNING METHOD FOR LARGE LANGUAGE MODELS \u003cbr\u003e OpenReview 2023 [[Paper]](https://openreview.net/attachment?id=IU4L7wiwxw\u0026name=pdf)\n\n- Plug-and-Play: An Efficient Post-training Pruning Method for Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://openreview.net/forum?id=Tr0lPx9woF) [[Code]](https://github.com/biomedical-cybernetics/Relative-importance-and-activation-pruning)\n\n- Lighter, yet More Faithful: Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/pdf/2311.09335.pdf) [[Code]](https://github.com/casszhao/PruneHall)\n\n- LORAPRUNE: PRUNING MEETS LOW-RANK PARAMETER-EFFICIENT FINE-TUNING \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/pdf/2305.18403.pdf)\n\n- Mini-GPTs: Efficient Large Language Models through Contextual Pruning \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.12682) [[Code]](https://github.com/tval2/contextual-pruning)\n\n- The LLM Surgeon \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.17244)\n\n- Fluctuation-based Adaptive Structured Pruning for Large Language Models \u003cbr\u003e AAAI 2024 [[Paper]](https://arxiv.org/abs/2312.11983)\n\n- How to Prune Your Language Model: Recovering Accuracy on the \"Sparsity May Cry'' Benchmark \u003cbr\u003e CPAL 2024 [[Paper]](https://arxiv.org/abs/2312.13547)\n\n- PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.15230)\n\n- Fast and Optimal Weight Update for Pruned Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.02938)\n\n- APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.12200)\n\n- Scaling Sparse Fine-Tuning to Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.16405)\n\n- SliceGPT: Compress Large Language Models by Deleting Rows and Columns \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2401.15024) [[Code]](https://github.com/microsoft/TransformerCompression)\n\n- Shortened LLaMA: A Simple Depth Pruning for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.02834)\n\n- Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.05406) [[Code]](https://github.com/ldery/Bonsai)\n\n- NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.09773)\n\n- LaCo: Large Language Model Pruning via Layer Collapse \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2402.11187) \n\n- Why Lift so Heavy? Slimming Large Language Models by Cutting Off the Layers \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.11700)\n\n- EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/pdf/2402.12419) [[Code]](https://github.com/sunggo/EBFT)\n\n- Data-free Weight Compress and Denoise for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.16319)\n\n- Gradient-Free Adaptive Global Pruning for Pre-trained Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.17946)\n\n- ShortGPT: Layers in Large Language Models are More Redundant Than You Expect \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.03853)\n\n- LLaVA-PruMerge: Adaptive Token Reduction for Efficient Large Multimodal Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.15388) [[Code]](https://github.com/42Shawn/LLaVA-PruMerge)\n\n- Compressing Large Language Models by Streamlining the Unimportant Layer \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.19135)\n\n- LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.09695)\n\n- LoNAS: Elastic Low-Rank Adapters for Efficient Large Language Models \u003cbr\u003e COLING 2024 [[Paper]](https://aclanthology.org/2024.lrec-main.940) [[Code]](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/LoNAS)\n\n- Shears: Unstructured Sparsity with Neural Low-rank Adapter Search \u003cbr\u003e NAACL 2024 [[Paper]](https://arxiv.org/abs/2404.10934) [[Code]](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/Shears)\n\n- Eigenpruning \u003cbr\u003e NAACL 2024 Abstract [[Paper]](https://arxiv.org/abs/2404.03147)\n\n- OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.05957)\n\n- Pruning as a Domain-specific LLM Extractor \u003cbr\u003e NAACL 2024 Findings [[Paper]](https://arxiv.org/abs/2405.06275) [[Code]](https://github.com/psunlpgroup/D-Pruner)\n\n- Differentiable Model Scaling using Differentiable Topk \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2405.07194)\n\n- COPAL: Continual Pruning in Large Language Generative Models \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2405.02347)\n\n- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for Large Language Models  \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2406.02924) [[Code]](https://github.com/pprp/Pruner-Zero)\n\n- Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2405.10616)\n\n- Surgical Feature-Space Decomposition of LLMs: Why, When and How? \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2405.13039)\n\n- Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2407.05690)\n\n- Light-PEFT: Lightening Parameter-Efficient Fine-Tuning via Early Pruning \u003cbr\u003e ACL Findings 2024 [[Paper]](https://arxiv.org/abs/2406.03792) [[Code]](https://github.com/gccnlp/Light-PEFT)\n\n- Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2406.10774) [[Code]](https://github.com/mit-han-lab/Quest)\n\n- MoreauPruner: Robust Pruning of Large Language Models against Weight Perturbations \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.07017) [[Code]](https://github.com/ShiningSord/MoreauPruner)\n\n- ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.07831)\n\n- HiP Attention: Sparse Sub-Quadratic Attention with Hierarchical Attention Pruning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.09827)\n\n- Optimization-based Structural Pruning for Large Language Models without Back-Propagation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.10576)\n\n- BlockPruner: Fine-grained Pruning for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.10594) [[Code]](https://github.com/MrGGLS/BlockPruner)\n\n- Rethinking Pruning Large Language Models: Benefits and Pitfalls of Reconstruction Error Minimization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.15524) \n\n- RankAdaptor: Hierarchical Dynamic Low-Rank Adaptation for Structural Pruned LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.15734) \n\n- What Matters in Transformers? Not All Attention is Needed \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.15786) [[Code]](https://github.com/Shwai-He/LLM-Drop)\n\n- Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2406.16330) \n\n- ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.16635) [[Code]](https://github.com/abdelfattah-lab/shadow_llm/)\n\n- Finding Transformer Circuits with Edge Pruning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.16778) [[Code]](https://github.com/princeton-nlp/Edge-Pruning)\n\n- Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.00945) [[Code]](https://github.com/imagination-research/EEP)\n\n- MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.11681) \n\n- Reconstruct the Pruned Model without Any Retraining  \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.13331) \n\n- A deeper look at depth pruning of LLMs \u003cbr\u003e ICML TF2M Workshop 2024 [[Paper]](https://arxiv.org/abs/2407.16286) [[Code]](https://github.com/shoaibahmed/llm_depth_pruning)\n\n- Greedy Output Approximation: Towards Efficient Structured Pruning for LLMs Without Retraining \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.19126) \n\n- Pruning Large Language Models with Semi-Structural Adaptive Sparse Training \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.20584) \n\n- A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.03728) \n\n- ThinK: Thinner Key Cache by Query-Driven Pruning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.21018) \n\n- LLM-Barber: Block-Aware Rebuilder for Sparsity Mask in One-Shot for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.10631) [[Code]](https://github.com/YupengSu/LLM-Barber)\n\n- LLM Pruning and Distillation in Practice: The Minitron Approach \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.11796) [[Models]](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base)\n\n- Training-Free Activation Sparsity in Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.14690) \n\n- PAT: Pruning-Aware Tuning for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.14721) [[Code]](https://github.com/kriskrisliu/PAT_Pruning-Aware-Tuning)\n\n- Sirius: Contextual Sparsity with Correction for Efficient LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.03856) [[Code]](https://github.com/Infini-AI-Lab/Sirius)\n\n- STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.06211)\n\n- Search for Efficient Large Language Models \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2409.17372)\n\n- SlimGPT: Layer-wise Structured Pruning for Large Language Models \u003cbr\u003e NeurIPS 2024 [[Paper]](https://nips.cc/virtual/2024/poster/95477)\n\n- Learn To be Efficient: Build Structured Sparsity in Large Language Models \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2402.06126)\n\n- ALS: Adaptive Layer Sparsity for Large Language Models via Activation Correlation Assessment \u003cbr\u003e NeurIPS 2024 [[Paper]](https://nips.cc/virtual/2024/poster/95693)\n\n- Getting Free Bits Back from Rotational Symmetries in LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.01309)\n\n- SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.09615) [[Code]](https://github.com/Mohammad-Mozaffari/slim)\n\n- Self-Data Distillation for Recovering Quality in Pruned Large Language Models \u003cbr\u003e NeurIPS 2024 Machine Learning and Compression Workshop [[Paper]](https://arxiv.org/abs/2410.09982) \n\n- EvoPress: Towards Optimal Dynamic Model Compression via Evolutionary Search \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.14649) [[Code]](https://github.com/IST-DASLab/EvoPress)\n\n- Pruning Foundation Models for High Accuracy without Retraining \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2410.15567) [[Code]](https://github.com/piuzha/APT)\n\n- Beware of Calibration Data for Pruning Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.17711)\n\n- SQFT: Low-cost Model Adaptation in Low-precision Sparse Foundation Models \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2410.03750) [[Code]](https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning/tree/main/SQFT)\n\n### Distillation\n\n- Lifting the Curse of Capacity Gap in Distilling Language Models \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2305.12129) [[Code]](https://github.com/GeneZC/MiniMoE)\n\n- Symbolic Chain-of-Thought Distillation: Small Models Can Also \"Think\" Step-by-Step \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2306.14050) \n\n- Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2305.02301) \n\n- SCOTT: Self-Consistent Chain-of-Thought Distillation \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2305.01879) \n\n- DISCO: Distilling Counterfactuals with Large Language Models \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2212.10534) [[Code]](https://github.com/eric11eca/disco)\n\n- LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2304.14402) [[Code]](https://github.com/mbzuai-nlp/LaMini-LM)\n\n- How To Train Your (Compressed) Large Language Model \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.14864) \n\n- The False Promise of Imitating Proprietary LLMs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.15717)\n\n- GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo \u003cbr\u003e Arxiv 2023 [[Paper]](https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf) [[Code]](https://github.com/nomic-ai/gpt4all)\n\n- PaD: Program-aided Distillation Specializes Large Models in Reasoning \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.13888) \n\n- MiniLLM: Knowledge Distillation of Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2306.08543) [[Code]](https://github.com/microsoft/LMOps/tree/main/minillm)\n\n- On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2306.13649)\n\n- GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2306.13649)\n\n- Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2306.14122)\n\n- Task-agnostic Distillation of Encoder-Decoder Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.12330)\n\n- Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.04679)\n\n- Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty \u003cbr\u003e CoNLL 2023 [[Paper]](https://arxiv.org/abs/2308.02019) [[Code]](https://github.com/timinar/BabyLlama)\n\n- Can a student Large Language Model perform as well as it's teacher? \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.02421)\n\n- Multistage Collaborative Knowledge Distillation from Large Language Models \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2311.08640) [[Code]](https://github.com/andotalao24/Multistage-Collaborative-Knowledge-Distillation)\n\n- Lion: Adversarial Distillation of Closed-Source Large Language Model \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2305.12870) [[Code]](https://github.com/YJiangcm/Lion)\n\n- MCC-KD: Multi-CoT Consistent Knowledge Distillation \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.14747)\n\n- PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.14192)\n\n- YODA: Teacher-Student Progressive Learning for Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2401.15670)\n\n- Knowledge Fusion of Large Language Models \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2401.10491) [[Code]](https://github.com/fanqiwan/FuseLLM)\n\n- Knowledge Distillation for Closed-Source Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.07013)\n\n- TinyLLM: Learning a Small Student from Multiple Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.04616)\n\n- Towards Cross-Tokenizer Distillation: the Universal Logit Distillation Loss for LLMs  \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.12030)\n\n- Revisiting Knowledge Distillation for Autoregressive Language Models \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2402.11890)\n\n- Sinkhorn Distance Minimization for Knowledge Distillation \u003cbr\u003e COLING 2024 [[Paper]](https://arxiv.org/abs/2402.17110) \n\n- Divide-or-Conquer? Which Part Should You Distill Your LLM? \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.15000)\n\n- Learning to Maximize Mutual Information for Chain-of-Thought Distillation \u003cbr\u003e ACL 2024 Findings [[Paper]](https://arxiv.org/abs/2403.03348)\n\n- DistiLLM: Towards Streamlined Distillation for Large Language Models \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2402.03898) [[Code]](https://github.com/jongwooko/distillm)\n\n- Efficiently Distilling LLMs for Edge Applications \u003cbr\u003e NAACL 2024 [[Paper]](https://arxiv.org/abs/2404.01353)\n\n- Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.02657)\n\n- Distilling Algorithmic Reasoning from LLMs via Explaining Solution Programs \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.08148)\n\n- Direct Preference Knowledge Distillation for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.19774) [[Codes]](https://github.com/microsoft/LMOps/tree/main/dpkd)\n\n- Dual-Space Knowledge Distillation for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.17328) [[Codes]](https://github.com/songmzhang/DSKD)\n\n- DDK: Distilling Domain Knowledge for Efficient Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.16154)\n\n- Compact Language Models via Pruning and Knowledge Distillation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.14679) [[Code]](https://github.com/NVlabs/Minitron)\n\n- LLM Pruning and Distillation in Practice: The Minitron Approach \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.11796) [[Models]](https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base)\n\n- The Mamba in the Llama: Distilling and Accelerating Hybrid Models  \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.15237) \n\n- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2410.03061) \n\n- SWITCH: Studying with Teacher for Knowledge Distillation of Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.19503)\n\n- Mentor-KD: Making Small Language Models Better Multi-step Reasoners \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2410.09037) [[Code]](https://github.com/2hojae/mentor-kd)\n\n- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.12512) \n\n### Efficient Prompting\n\n- Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2306.01150) [[Code]](https://github.com/fanyin3639/Rethinking-instruction-effectiveness)\n\n- Batch Prompting: Efficient Inference with Large Language Model APIs \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2301.08721) [[Code]](https://github.com/HKUNLP/batch-prompting) \n\n- Adapting Language Models to Compress Contexts \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2305.14788) [[Code]](https://github.com/princeton-nlp/AutoCompressors)\n\n- Compressing Context to Enhance Inference Efficiency of Large Language Models \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.06201) [[Code]](https://github.com/liyucheng09/Selective_Context)\n\n- LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.05736) [[Code]](https://github.com/microsoft/LLMLingua)\n\n- Vector-Quantized Prompt Learning for Paraphrase Generation \u003cbr\u003e EMNLP Findings 2023 [[Paper]](https://arxiv.org/abs/2311.14949)\n\n- Efficient Prompting via Dynamic In-Context Learning \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.11170)\n\n- Learning to Compress Prompts with Gist Tokens \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2304.08467) [[Code]](https://github.com/jayelm/gisting)\n\n- In-context Autoencoder for Context Compression in a Large Language Model \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2307.06945) \n\n- Discrete Prompt Compression with Reinforcement Learning \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.08758) [[Code]](https://github.com/nenomigami/PromptCompressor)\n\n- BatchPrompt: Accomplish more with less \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.00384) \n\n- (Dynamic) Prompting might be all you need to repair Compressed LLMs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.00867) \n\n- RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.04408) [[Code]](https://github.com/carriex/recomp)\n\n- LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression \u003cbr\u003e ACL 2023 [[Paper]](https://arxiv.org/abs/2310.06839) [[Code]](https://github.com/microsoft/LLMLingua)\n\n- Extending Context Window of Large Language Models via Semantic Compression \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.09571)\n\n- Fewer is More: Boosting LLM Reasoning with Reinforced Context Pruning \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2312.08901) [[Code]](https://github.com/HuangOwen/CoT-Influx)\n\n- The Impact of Reasoning Step Length on Large Language Models \u003cbr\u003e ACL 2024 Findings [[Paper]](https://arxiv.org/abs/2401.04925)\n\n- Compressed Context Memory For Online Language Model Interaction \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2312.03414) [[Code]](https://github.com/snu-mllab/context-memory)\n\n- Learning to Compress Prompt in Natural Language Formats \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.18700)\n\n- Say More with Less: Understanding Prompt Learning Behaviors through Gist Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.04925) [[Code]](https://github.com/OpenMatch/Gist-COCO)\n\n- StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.08312) \n\n- LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.12968)  [[Code]](https://github.com/microsoft/LLMLingua)\n\n- PCToolkit: A Unified Plug-and-Play Prompt Compression Toolkit of Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.17411)  [[Code]](https://github.com/3DAgentWorld/Toolkit-for-Prompt-Compression)\n\n- PROMPT-SAW: Leveraging Relation-Aware Graphs for Textual Prompt Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.00489) \n\n- Prompts As Programs: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.02319) [[Code]](https://github.com/microsoft/sammo)\n\n- Adapting LLMs for Efficient Context Processing through Soft Prompt Compression \u003cbr\u003e IPCA 2024 [[Paper]](https://arxiv.org/abs/2404.04997) \n\n- Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.03085) \n\n- Unifying Demonstration Selection and Compression for In-Context Learning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.17062) \n\n- SelfCP: Compressing Long Prompt to 1/12 Using the Frozen Large Language Model Itself \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.17052) \n\n- Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.15504) \n\n- QUITO: Accelerating Long-Context Reasoning through Query-Guided Context Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.00274) [[Code]](https://github.com/Wenshansilvia/attention_compressor)\n\n- 500xCompressor: Generalized Prompt Compression for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.03094)\n\n- Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.15491)\n\n- Prompt Compression with Context-Aware Sentence Encoding for Fast and Improved LLM Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.01227) [[Code]](https://github.com/Workday/cpc)\n\n- Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.07331)\n\n- Parse Trees Guided LLM Prompt Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.15395)\n\n- AlphaZip: Neural Network-Enhanced Lossless Text Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.15046)\n\n- Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.17422) [[Code]](https://github.com/SalesforceAIResearch/GemFilter)\n\n- Perception Compressor:A training-free prompt compression method in long context scenarios \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.19272)\n\n- From Reading to Compressing: Exploring the Multi-document Reader for Prompt Compression \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2410.04139)\n\n- Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2410.11786)\n\n- Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2410.14042)\n\n### KV Cache Compression\n\n- Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/pdf/2305.17118.pdf)\n\n- Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs \u003cbr\u003e ICLR 2024 [[Paper]](https://arxiv.org/abs/2310.01801)  \n\n- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2401.18079)\n\n- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2402.02750) [[Code]](https://github.com/jy-yuan/KIVI)\n\n- No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2402.18096)\n\n- Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference \u003cbr\u003e MLSys 2024 [[Paper]](https://arxiv.org/abs/2403.09054)\n\n- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.05527)\n\n- QAQ: Quality Adaptive Quantization for LLM KV Cache \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.04643) [[Code]](https://github.com/ClubieDong/QAQ-KVCacheQuantization)\n\n- KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.03917)\n\n- PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2405.12532) \n\n- Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.12591) \n\n- ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.14256)\n\n- MiniCache: KV Cache Compression in Depth Dimension for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.14366)\n\n- PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling \u003cbr\u003e Arxiv 2024 [[Paper]](http://arxiv.org/abs/2406.02069)\n\n- QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.03482) [[Code]](https://github.com/amirzandieh/QJL)\n\n- Effectively Compress KV Heads for LLM \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.07056)\n\n- A Simple and Effective L2 Norm-Based Strategy for KV Cache Compression \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2406.11430)\n\n- PQCache: Product Quantization-based KVCache for Long Context LLM Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.12820)\n\n- Palu: Compressing KV-Cache with Low-Rank Projection \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.21118) [[Code]](https://github.com/shadowpa0327/Palu)\n\n- RazorAttention: Efficient KV Cache Compression Through Retrieval Heads \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.15891)\n\n- Finch: Prompt-guided Key-Value Cache Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/pdf/2408.00167)\n\n- Zero-Delay QKV Compression for Mitigating KV Cache and Network Bottlenecks in LLM Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.04107)\n\n- Eigen Attention: Attention in Low-Rank Space for KV Cache Compression \u003cbr\u003e EMNLP Findings 2024 [[Paper]](https://arxiv.org/abs/2408.05646) [[Code]](https://github.com/UtkarshSaxena1/EigenAttn/tree/main)\n\n- CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.10593) [[Code]](https://github.com/wln20/CSKV)\n\n- LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.03111)\n\n- SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.13846) [[Code]](https://github.com/sail-sg/SimLayerKV)\n\n- MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.14731)\n\n- AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.13212) \n\n- Residual vector quantization for KV cache compression in large language model \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.15704) [[Code]](https://github.com/iankur/vqllm)\n\n- Lossless KV Cache Compression to 2% \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.15252)\n\n- KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.18517) [[Code]](https://github.com/yangyifei729/KVSharer)\n\n- Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.19318) [[Code]](https://github.com/Clement25/SharedLLM)\n\n### Other\n\n- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness \u003cbr\u003e NeurIPS 2022 [[Paper]](https://arxiv.org/abs/2205.14135) [[Code]](https://github.com/Dao-AILab/flash-attention)\n\n- TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.00526)\n\n- Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2305.15805)\n\n- SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.02628)\n\n- Scaling In-Context Demonstrations with Structured Attention \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.02690)\n\n- Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2305.13144) [[Code]](https://github.com/zhengzangw/Sequence-Scheduling)\n\n- CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2307.07705)\n\n- Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2308.07641)\n\n- LLMCad: Fast and Scalable On-device Large Language Model Inference \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.04255)\n\n- vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.06180)\n\n- LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.12307) [[Code]](https://github.com/dvlab-research/LongLoRA)\n\n- LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.14021) [[Code]](https://huggingface.co/nolanoAI)\n\n- Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2310.15961) \n\n- Efficient Streaming Language Models with Attention Sinks \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2309.17453) [[Code]](https://github.com/mit-han-lab/streaming-llm)\n\n- Efficient Large Language Models Fine-Tuning On Graphs \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.04737)\n\n- SparQ Attention: Bandwidth-Efficient LLM Inference \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.04985)\n\n- Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.07046) \n\n- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU \u003cbr\u003e Arxiv 2023 [[Paper]](https://arxiv.org/abs/2312.12456)  [[Code]](https://github.com/SJTU-IPADS/PowerInfer)\n\n- Dataset Quantization \u003cbr\u003e ICCV 2023 [[Paper]](https://arxiv.org/abs/2308.10524) [[Code]](https://github.com/magic-research/Dataset_Quantization)\n\n- Text Alignment Is An Efficient Unified Model for Massive NLP Tasks \u003cbr\u003e NeurIPS 2023 [[Paper]](https://arxiv.org/abs/2307.02729) [[Code]](https://github.com/yuh-zha/Align)\n\n- Context Compression for Auto-regressive Transformers with Sentinel Tokens \u003cbr\u003e EMNLP 2023 [[Paper]](https://arxiv.org/abs/2310.08152) [[Code]](https://github.com/DRSY/KV_Compression)\n\n- TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction \u003cbr\u003e EMNLP Findings 2023 [[Paper]](https://arxiv.org/abs/2310.15556)\n\n- Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression \u003cbr\u003e EMNLP Findings 2023 [[Paper]](https://arxiv.org/abs/2310.15594)\n\n- FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.04044)\n\n- LoMA: Lossless Compressed Memory Attention \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.09486)\n\n- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.10774) [[Code]](https://github.com/FasterDecoding/Medusa)\n\n- BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.12522) [[Code]](https://github.com/linfeng93/BiTA)\n\n- CompactifAI: Extreme Compression of Large Language Models using Quantum-Inspired Tensor Networks \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2401.14109)\n\n- MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2402.14905) [[Code]](https://github.com/facebookresearch/MobileLLM)\n\n- BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.02827) [[Code]](https://github.com/Ledzy/BAdam)\n\n- NoMAD-Attention: Efficient LLM Inference on CPUs Through Multiply-add-free Attention \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.01273)\n\n- Not all Layers of LLMs are Necessary during Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.02181)\n\n- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.03507)\n\n- Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.09636)\n\n- Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System \u003cbr\u003e HPCA 2024 [[Paper]](https://arxiv.org/abs/2403.06664)\n\n- ALoRA: Allocating Low-Rank Adaptation for Fine-tuning Large Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2403.16187)\n\n- Parameter Efficient Quasi-Orthogonal Fine-Tuning via Givens Rotation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.04316)\n\n- Training LLMs over Neurally Compressed Text \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.03626)\n\n- TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.11912) [[Code]](https://github.com/Infini-AI-Lab/TriForce)\n\n- SnapKV: LLM Knows What You are Looking for Before Generation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2404.14469) [[Code]](https://github.com/FasterDecoding/SnapKV)\n\n- Characterizing the Accuracy - Efficiency Trade-off of Low-rank Decomposition in Language Models \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.06626)\n\n- KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2405.05329)\n\n- Token-wise Influential Training Data Retrieval for Large Language Models \u003cbr\u003e ACL 2024 [[Paper]](https://arxiv.org/abs/2405.11724) [[Code]](https://github.com/huawei-lin/RapidIn)\n\n- Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2405.15877)\n\n- Demystifying the Compression of Mixture-of-Experts Through a Unified Framework \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2406.02500) [[Code]](https://github.com/CASE-Lab-UMD/Unified-MoE-Compression)\n\n- LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.14057)\n\n- AdaCoder: Adaptive Prompt Compression for Programmatic Visual Question Answering \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2407.19410)\n\n- CaM: Cache Merging for Memory-efficient LLMs Inference \u003cbr\u003e ICML 2024 [[Paper]](https://openreview.net/forum?id=LCTmppB165) [[Code]](https://github.com/zyxxmu/cam)\n\n- CLLMs: Consistency Large Language Models \u003cbr\u003e ICML 2024 [[Paper]](https://arxiv.org/abs/2403.00835) [[Code]](https://github.com/hao-ai-lab/Consistency_LLM)\n\n- MoDeGPT: Modular Decomposition for Large Language Model Compression  \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2408.09632)\n\n- Accelerating Large Language Model Training with Hybrid GPU-based Compression  \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2409.02423)\n\n- Language Models as Zero-shot Lossless Gradient Compressors: Towards General Neural Parameter Prior Models \u003cbr\u003e NeurIPS 2024 [[Paper]](https://arxiv.org/abs/2409.17836)\n\n- KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.00161)\n\n- InfiniPot: Infinite Context Processing on Memory-Constrained LLMs \u003cbr\u003e EMNLP 2024 [[Paper]](https://arxiv.org/abs/2410.01518)\n\n- SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.02367) [[Code]](https://github.com/thu-ml/SageAttention)\n\n- UNComp: Uncertainty-Aware Long-Context Compressor for Efficient Large Language Model Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.03090)\n\n- Basis Sharing: Cross-Layer Parameter Sharing for Large Language Model Compression \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.03765) [[Code]](https://arxiv.org/abs/2410.03765)\n\n- Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.06577)\n\n- DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.10819) [[Code]](https://github.com/mit-han-lab/duo-attention)\n\n- Progressive Mixed-Precision Decoding for Efficient LLM Inference \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.13461)\n\n- EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation \u003cbr\u003e Arxiv 2024 [[Paper]](https://arxiv.org/abs/2410.21271)\n\n- LLMCBench: Benchmarking Large Language Model Compression for Efficient Deployment \u003cbr\u003e NeurIPS 2024 Datasets and Benchmarks Track [[Paper]](https://arxiv.org/abs/2410.21352) [[Code]](https://github.com/AboveParadise/LLMCBench)\n\n- NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks \u003cbr\u003e Arxiv 2024 [[paper]](https://arxiv.org/abs/2410.20650) [[Code]](https://github.com/BorealisAI/neuzip)\n\n## Tools\n\n- BMCook: Model Compression for Big Models [[Code]](https://github.com/OpenBMB/BMCook)\n  \n- llama.cpp: Inference of LLaMA model in pure C/C++ [[Code]](https://github.com/ggerganov/llama.cpp)\n\n- LangChain: Building applications with LLMs through composability [[Code]](https://github.com/hwchase17/langchain)\n\n- GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [[Code]](https://github.com/qwopqwop200/GPTQ-for-LLaMa)\n\n- Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [[Code]](https://github.com/PhoebusSi/Alpaca-CoT)\n\n- vllm: A high-throughput and memory-efficient inference and serving engine for LLMs [[Code]](https://github.com/vllm-project/vllm)\n\n- LLaMA Efficient Tuning: Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA) [[Code]](https://github.com/hiyouga/LLaMA-Efficient-Tuning)\n\n- gpt-fast: Simple and efficient pytorch-native transformer text generation in \u003c1000 LOC of python. [[Code]](https://github.com/pytorch-labs/gpt-fast)\n\n- Efficient-Tuning-LLMs: (Efficient Finetuning of QLoRA LLMs). QLoRA, LLama, bloom, baichuan-7B, GLM [[Code]](https://github.com/jianzhnie/Efficient-Tuning-LLMs)\n\n- bitsandbytes: 8-bit CUDA functions for PyTorch [[Code]](https://github.com/TimDettmers/bitsandbytes)\n\n- ExLlama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [[Code]](https://github.com/turboderp/exllama)\n\n- lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [[Code]](https://github.com/Lightning-AI/lit-gpt)\n\n- Lit-LLaMA: Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [[Code]](https://github.com/Lightning-AI/lit-llama)\n\n- lama.onnx: LLaMa/RWKV onnx models, quantization and testcase [[Code]](https://github.com/tpoisonooo/llama.onnx)\n\n- fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend. [[Code]](https://github.com/PotatoSpudowski/fastLLaMa)\n\n- Sparsebit: A model compression and acceleration toolbox based on pytorch. [[Code]](https://github.com/megvii-research/Sparsebit)\n\n- llama2.c: Inference Llama 2 in one file of pure C [[Code]](https://github.com/karpathy/llama2.c)\n\n- Megatron-LM: Ongoing research training transformer models at scale [[Code]](https://github.com/NVIDIA/Megatron-LM)\n\n- ggml: Tensor library for machine learning [[Code]](https://github.com/ggerganov/ggml)\n\n- LLamaSharp: C#/.NET binding of llama.cpp, including LLaMa/GPT model inference and quantization, ASP.NET core integration and UI [[Code]](https://github.com/SciSharp/LLamaSharp)\n\n- rwkv.cpp: NT4/INT5/INT8 and FP16 inference on CPU for RWKV language model [[Code]](https://github.com/saharNooby/rwkv.cpp)\n\n- Can my GPU run this LLM?: Calculate GPU memory requirement \u0026 breakdown for training/inference of LLM models. Supports ggml/bnb quantization [[Code]](https://github.com/RahulSChand/gpu_poor)\n\n- TinyChatEngine: On-Device LLM Inference Library [[Code]](https://github.com/mit-han-lab/TinyChatEngine)\n\n- TensorRT-LLM: TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. [[Code]](https://github.com/NVIDIA/TensorRT-LLM)\n\n- IntLLaMA: A fast and light quantization solution for LLaMA [[Code]](https://github.com/megvii-research/IntLLaMA)\n\n- EasyLLM: Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing usability, it also ensures training efficiency [[Code]](https://github.com/ModelTC/EasyLLM)\n\n- GreenBit LLaMA: Advanced Ultra-Low Bitrate Compression Techniques for the LLaMA Family of LLMs [[Code]](https://github.com/GreenBitAI/low_bit_llama)\n\n- Intel® Neural Compressor: An open-source Python library supporting popular model compression techniques on all mainstream deep learning frameworks (TensorFlow, PyTorch, ONNX Runtime, and MXNet) [[Code]](https://github.com/intel/neural-compressor)\n\n- LLM-Viewer: Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface. [[Code]](https://github.com/hahnyuan/LLM-Viewer)\n\n- LLaMA3-Quantization: A repository dedicated to evaluating the performance of quantizied LLaMA3 using various quantization methods. [[Code]](https://github.com/Macaronlin/LLaMA3-Quantization)\n\n- LLamaSharp: A C#/.NET library to run LLM models (🦙LLaMA/LLaVA) on your local device efficiently. [[Code]](https://github.com/SciSharp/LLamaSharp)\n\n- Green-bit-LLM: A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs. [[Code]](https://github.com/GreenBitAI/green-bit-llm) [[Model]](https://huggingface.co/GreenBitAI)\n\n- Bitorch Engine: Streamlining AI with Open-Source Low-Bit Quantization. [[Code]](https://github.com/GreenBitAI/bitorch-engine)\n\n- llama-zip: LLM-powered lossless compression tool [[Code]](https://github.com/AlexBuz/llama-zip)\n\n- LLaMA-Factory: Unify Efficient Fine-Tuning of 100+ LLMs [[Code]](https://github.com/hiyouga/LLaMA-Factory)\n\n- LLMC: A tool designed for LLM Compression. [[Code]](https://github.com/ModelTC/llmc)\n\n- BitBLAS: BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment. [[Code]](https://github.com/microsoft/BitBLAS)\n\n- AutoFP8: Open-source FP8 quantization library for producing compressed checkpoints for running in vLLM  [[Code]](https://github.com/neuralmagic/AutoFP8)\n\n- AutoGGUF: automatically quant GGUF models [[Code]](https://github.com/leafspark/AutoGGUF)\n\n- Transformer Compression: For releasing code related to compression methods for transformers, accompanying our publications [[Code]](https://github.com/microsoft/TransformerCompression)\n\n- Electron-BitNet: Running Microsoft's BitNet via Electron [[Code]](https://github.com/grctest/Electron-BitNet)\n\n## Contributing\nThis is an active repository and your contributions are always welcome! Before you add papers/tools into the awesome list, please make sure that:\n\n- The paper or tools is related to **Large Language Models (LLMs)**. If the compression algorithms or tools are only evaluated on small-scale language models (e.g., BERT), they should not be included in the list.\n- The paper should be inserted in the correct position in chronological order (publication/arxiv release time). \n- The link to [Paper] should be the arxiv page, not the pdf page if this is a paper posted on arxiv.\n- If the paper is accpeted, please use the correct publication venue instead of arxiv\n\nThanks again for all the awesome contributors to this list!\n\n\u003ca href=\"https://github.com/HuangOwen/Awesome-LLM-Compression/graphs/contributors\"\u003e\u003cimg src=\"https://contrib.rocks/image?repo=HuangOwen/Awesome-LLM-Compression\u0026max=240\u0026columns=12\" /\u003e\u003c/a\u003e\n\n## Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=HuangOwen/Awesome-LLM-Compression\u0026type=Date)](https://star-history.com/#HuangOwen/Awesome-LLM-Compression\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHuangOwen%2FAwesome-LLM-Compression","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FHuangOwen%2FAwesome-LLM-Compression","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FHuangOwen%2FAwesome-LLM-Compression/lists"}