Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
https://github.com/xlite-dev/Awesome-LLM-Inference

Last synced: 3 days ago
JSON representation

📖Contents
- 📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
  - KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory) ![](https://img.shields.io/github/stars/snu-mllab/Context-Memory.svg?style=social) |⭐️⭐️ |
  - FASTDECODE
  - Sparsity-Aware KV Caching
  - H2O
  - **AdaKV**
  - **KV Cache Recomputation**
  - **DynamicKV**
  - ThinK
  - Palu
  - ZipCache
  - **Adaptive KV Cache Compress**
  - CompressKV
  - **DistKV-LLM**
  - Prompt Caching
  - **AlignedKV**
  - **Shared Prefixes**
  - Zero-Delay QKV Compression
  - SqueezeAttention
  - Less
  - MiKV
  - QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization) ![](https://img.shields.io/github/stars/ClubieDong/QAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |
  - DMC
  - SnapKV
  - **ChunkAttention** - attention]](https://github.com/microsoft/chunk-attention) ![](https://img.shields.io/github/stars/microsoft/chunk-attention.svg?style=social) |⭐️⭐️ |
  - GEAR - project/GEAR) ![](https://img.shields.io/github/stars/opengear-project/GEAR.svg?style=social)|⭐️ |
  - KVCache-1Bit
  - LTP
  - KV Cache Compress
  - **Adaptive KV Cache Compress**
  - CacheGen
  - Less
  - MiKV
  - FASTDECODE
  - CacheGen
  - Prompt Caching
  - KV Cache Compress
  - H2O
  - LTP
  - MQA
  - **GQA**
  - **LayerKV**
  - DMC
  - Keyformer - matrix-ai/keyformer-llm) ![](https://img.shields.io/github/stars/d-matrix-ai/keyformer-llm.svg?style=social)|⭐️⭐️ |
  - QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization) ![](https://img.shields.io/github/stars/ClubieDong/QAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |
  - KV-Runahead
  - Chunked Prefills
  - **ClusterKV**
  - Chunked Prefills
  - **Shared Prefixes**
  - SqueezeAttention
  - **TensorRT-LLM KV Cache FP8** - LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ |
  - Sparsity-Aware KV Caching
  - MiniCache
  - CacheBlend
  - MemServe
  - QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) ![](https://img.shields.io/github/stars/epfml/dynamic-sparse-flash-attention.svg?style=social)|⭐️ |
  - **PagedAttention** - project/vllm) ![](https://img.shields.io/github/stars/vllm-project/vllm.svg?style=social)|⭐️⭐️ |
  - MLKV - mlkv]](https://github.com/zaydzuhri/pythia-mlkv) ![](https://img.shields.io/github/stars/zaydzuhri/pythia-mlkv.svg?style=social)|⭐️ |
  - **KV Cache Prefetch**
  - **CacheCraft**
  - Keyformer - matrix-ai/keyformer-llm) ![](https://img.shields.io/github/stars/d-matrix-ai/keyformer-llm.svg?style=social)|⭐️⭐️ |
  - **KVzip** - mllab/KVzip) ![](https://img.shields.io/github/stars/snu-mllab/KVzip.svg?style=social&label=Star)|⭐️⭐️|
  - MQA
  - **PagedAttention** - project/vllm) ![](https://img.shields.io/github/stars/vllm-project/vllm.svg?style=social)|⭐️⭐️ |
  - KV Cache FP8 + WINT4
  - **RadixAttention** - project/sglang) ![](https://img.shields.io/github/stars/sgl-project/sglang.svg?style=social) |⭐️⭐️ |
  - KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory) ![](https://img.shields.io/github/stars/snu-mllab/Context-Memory.svg?style=social) |⭐️⭐️ |
  - **Inference-Time Hyper-Scaling**
  - **GQA**
  - QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) ![](https://img.shields.io/github/stars/epfml/dynamic-sparse-flash-attention.svg?style=social)|⭐️ |
  - vAttention
- 📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist))
  - FlightLLM
  - LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers) ![](https://img.shields.io/github/stars/intel/intel-extension-for-transformers.svg?style=social) |⭐️ |
  - LinguaLinked
  - FlexGen
  - OpenVINO
- 📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
  - SpecInfer
  - FastServe
  - StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) ![](https://img.shields.io/github/stars/mit-han-lab/streaming-llm.svg?style=social)|⭐️ |
  - **PETALS** - workshop/petals) ![](https://img.shields.io/github/stars/bigscience-workshop/petals.svg?style=social)|⭐️⭐️ |
- 📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
  - **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) ![](https://img.shields.io/github/stars/insuhan/hyper-attn.svg?style=social)|⭐️⭐️ |
  - **Streaming Attention**
  - **CLA**
  - **RingAttention**
  - **StripedAttention** - forall/striped_attention/) ![](https://img.shields.io/github/stars/exists-forall/striped_attention.svg?style=social) |⭐️⭐️ |
  - **LightningAttention-1**
  - RAGCache
  - LOOK-M - M]](https://github.com/SUSTechBruce/LOOK-M) ![](https://img.shields.io/github/stars/SUSTechBruce/LOOK-M.svg?style=social) |⭐️⭐️ |
  - **SentenceVAE**
  - **Blockwise Attention**
  - RAGCache
  - **KCache**
  - **Prompt Cache**
  - **Streaming Attention**
  - **Prompt Cache**
  - **LightningAttention-1**
  - **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) ![](https://img.shields.io/github/stars/OpenNLPLab/lightning-attention.svg?style=social)|⭐️⭐️ |
  - **RelayAttention**
  - Landmark Attention - attention](https://github.com/epfml/landmark-attention/) ![](https://img.shields.io/github/stars/epfml/landmark-attention.svg?style=social)|⭐️⭐️ |
  - **InstInfer**
  - **RetrievalAttention**
  - **KVQuant**
  - Landmark Attention - attention](https://github.com/epfml/landmark-attention/) ![](https://img.shields.io/github/stars/epfml/landmark-attention.svg?style=social)|⭐️⭐️ |
  - **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) ![](https://img.shields.io/github/stars/insuhan/hyper-attn.svg?style=social)|⭐️⭐️ |
  - Infini-attention
  - SKVQ
  - **ShadowKV**
  - **InfiniGen**
  - **Quest** - han-lab/Quest) ![](https://img.shields.io/github/stars/mit-han-lab/Quest.svg?style=social) |⭐️⭐️ |
  - PQCache
  - **HOMER**
  - **REFORM**
  - **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) ![](https://img.shields.io/github/stars/OpenNLPLab/lightning-attention.svg?style=social)|⭐️⭐️ |
  - **RingAttention**
  - **StripedAttention** - forall/striped_attention/) ![](https://img.shields.io/github/stars/exists-forall/striped_attention.svg?style=social) |⭐️⭐️ |
  - **YOCO** - YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO) ![](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social) |⭐️⭐️ |
  - **MInference**
  - **Lightning Attention** - 01]](https://github.com/MiniMax-AI/MiniMax-01) ![](https://img.shields.io/github/stars/MiniMax-AI/MiniMax-01.svg?style=social) | ⭐️⭐️ |
  - **RelayAttention**
- 📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
  - FLOP, I/O
  - **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
  - Flash-Decoding++
  - SparseGPT - DASLab/sparsegpt) ![](https://img.shields.io/github/stars/IST-DASLab/sparsegpt.svg?style=social) |⭐️ |
  - **Sparse Frontier** - frontier) ![](https://img.shields.io/github/stars/PiotrNawrot/sparse-frontier) | ⭐️⭐️ |
  - **TurboAttention**
  - **CHESS**
  - MoA - nics/MoA) ![](https://img.shields.io/github/stars/thu-nics/MoA.svg?style=social) | ⭐️ |
  - INT-FLASHATTENTION - FlashAttention]](https://github.com/INT-FlashAttention2024/INT-FlashAttention) ![](https://img.shields.io/github/stars/INT-FlashAttention2024/INT-FlashAttention.svg?style=social) | ⭐️ |
  - SCCA
  - **FlashLLM**
  - CHAI
  - **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
  - Online Softmax
  - **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
  - Hash Attention
  - Online Softmax
  - **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
  - Flash-Decoding++
  - **FlashLLM**
  - Online Softmax
  - Online Softmax
  - Hash Attention
  - **GLA**
  - CHAI
  - **Flash-Decoding** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
  - Flash Tree Attention
  - **GLA**
  - SCCA
  - **Flex Attention** - gym]](https://github.com/pytorch-labs/attention-gym) ![](https://img.shields.io/github/stars/pytorch-labs/attention-gym) | ⭐️⭐️ |
  - DeFT
  - **FFPA** - attn-mma]](https://github.com/DefTruth/ffpa-attn-mma) ![](https://img.shields.io/github/stars/DefTruth/ffpa-attn-mma)|⭐️⭐️ |
  - **Slim attention** - ai/transformer-tricks) ![](https://img.shields.io/github/stars/OpenMachine-ai/transformer-tricks.svg?style=social) | ⭐️⭐️⭐️ |
  - **SeerAttention**
  - Shared Attention
  - **Squeezed Attention**
  - **SpargeAttention** - ml/SpargeAttn) ![](https://img.shields.io/github/stars/thu-ml/SpargeAttn) | ⭐️⭐️ |
  - **FFPA** - attn-mma]](https://github.com/xlite-dev/ffpa-attn-mma) ![](https://img.shields.io/github/stars/xlite-dev/ffpa-attn-mma)|⭐️⭐️ |
  - **FFPA** - attn]](https://github.com/xlite-dev/ffpa-attn) ![](https://img.shields.io/github/stars/xlite-dev/ffpa-attn)|⭐️⭐️ |
  - **SageAttention-3** - ml/SageAttention) ![](https://img.shields.io/github/stars/thu-ml/SageAttention) | ⭐️⭐️ |
  - **Parallel Encoding** - AI-Lab/APE) ![](https://img.shields.io/github/stars/Infini-AI-Lab/APE) | ⭐️⭐️ |
  - **Parallel Encoding** - attention]](https://github.com/TemporaryLoRA/Block-attention) ![](https://img.shields.io/github/stars/TemporaryLoRA/Block-attention) | ⭐️⭐️ |
  - **MMInference**
  - FlashAttention
  - SparseGPT - DASLab/sparsegpt) ![](https://img.shields.io/github/stars/IST-DASLab/sparsegpt.svg?style=social) |⭐️ |
  - **SageAttention** - ml/SageAttention) ![](https://img.shields.io/github/stars/thu-ml/SageAttention) | ⭐️⭐️ |
  - **SageAttention-2** - ml/SageAttention) ![](https://img.shields.io/github/stars/thu-ml/SageAttention) | ⭐️⭐️ |
  - **FlashAttention-3** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- 📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist))
  - Tensor Core
  - FP8
  - QUICK
  - Tensor Parallel
  - Microbenchmark
  - Tensor Cores
- 📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
  - **FLAP** - IVA-Lab/FLAP) ![](https://img.shields.io/github/stars/CASIA-IVA-Lab/FLAP.svg?style=social)|⭐️⭐️ |
  - **LASER**
  - SDMPrune
  - FFSplit
  - **LASER**
  - **FLAP** - IVA-Lab/FLAP) ![](https://img.shields.io/github/stars/CASIA-IVA-Lab/FLAP.svg?style=social)|⭐️⭐️ |
  - PowerInfer - IPADS/PowerInfer) ![](https://img.shields.io/github/stars/SJTU-IPADS/PowerInfer.svg?style=social)|⭐️ |
  - **Simba**
  - **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) ![](https://img.shields.io/github/stars/fmfi-compbio/admm-pruning.svg?style=social)|⭐️ |
- 📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
  - **OSD**
  - **Cascade Speculative**
  - Token Recycling
  - **Medusa**
  - **TriForce** - AI-Lab/TriForce) ![](https://img.shields.io/github/stars/Infini-AI-Lab/TriForce.svg?style=social)|⭐️⭐️ |
  - **Cascade Speculative**
  - LookaheadDecoding - ai-lab/LookaheadDecoding) ![](https://img.shields.io/github/stars/hao-ai-lab/LookaheadDecoding.svg?style=social) |⭐️⭐️ |
  - **Speculative Decoding**
  - **Speculative Decoding**
  - **Speculative Decoding** - lty/ParallelSpeculativeDecoding) ![](https://img.shields.io/github/stars/smart-lty/ParallelSpeculativeDecoding.svg?style=social) |⭐️⭐️ |
  - **FocusLLM**
  - **MagicDec** - AI-Lab/MagicDec/) ![](https://img.shields.io/github/stars/Infini-AI-Lab/MagicDec.svg?style=social)|⭐️ |
  - **Parallel Decoding**
  - **Speculative Sampling**
  - **Speculative Sampling**
  - **Hidden Transfer**
  - **Hidden Transfer**
  - **PARALLELSPEC**
  - **Fast Best-of-N**
  - Instructive Decoding - Decoding]](https://github.com/joonkeekim/Instructive-Decoding) ![](https://img.shields.io/github/stars/joonkeekim/Instructive-Decoding.svg?style=social)|⭐️ |
  - **STAND**
  - **Mamba Drafters**
  - **Parallel Decoding**
  - **Speculative Sampling**
  - **Speculative Sampling**
  - **OSD**
  - LookaheadDecoding - ai-lab/LookaheadDecoding) ![](https://img.shields.io/github/stars/hao-ai-lab/LookaheadDecoding.svg?style=social) |⭐️⭐️ |
  - **Speculative Decoding** - mad-dash/decoding-speculative-decoding) ![](https://img.shields.io/github/stars/uw-mad-dash/decoding-speculative-decoding.svg?style=social) |⭐️|
  - S3D
  - **Hybrid Inference**
  - Multi-Token Speculative Decoding
- 📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
  - **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) ![](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg?style=social)|⭐️⭐️ |
  - **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) ![](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg?style=social)|⭐️⭐️ |
  - MoE-Mamba
  - **WINT8/4**
  - MoE Inference
  - **WINT8/4**
  - MoE
  - MoE-Mamba
  - DeepSeek-V2 - V2]](https://github.com/deepseek-ai/DeepSeek-V2) ![](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-V2.svg?style=social)| ⭐️⭐️ |
- 📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
  - **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) ![](https://img.shields.io/github/stars/BlinkDL/RWKV-LM.svg?style=social)|⭐️⭐️ |
  - Kraken
  - **RWKV-CLIP** - CLIP]](https://github.com/deepglint/RWKV-CLIP) ![](https://img.shields.io/github/stars/deepglint/RWKV-CLIP.svg?style=social)|⭐️⭐️ |
  - **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) ![](https://img.shields.io/github/stars/BlinkDL/RWKV-LM.svg?style=social)|⭐️⭐️ |
  - **Mamba** - spaces/mamba) ![](https://img.shields.io/github/stars/state-spaces/mamba.svg?style=social)|⭐️⭐️ |
  - **FLA** - linear-attention]](https://github.com/sustcsonglin/flash-linear-attention) ![](https://img.shields.io/github/stars/sustcsonglin/flash-linear-attention.svg?style=social)|⭐️⭐️ |
- 📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
  - FastBERT
  - **SkipDecode**
  - **EE-Tuning** - Tuning]](https://github.com/pan-x-c/EE-LLM) ![](https://img.shields.io/github/stars/pan-x-c/EE-LLM.svg?style=social) |⭐️⭐️ |
  - **KOALA**
  - BERxiT
  - DeeBERT
  - **LITE**
  - DeeBERT
  - **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM) ![](https://img.shields.io/github/stars/pan-x-c/EE-LLM.svg?style=social) |⭐️⭐️ |
  - **FREE**
  - **LITE**
  - **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM) ![](https://img.shields.io/github/stars/pan-x-c/EE-LLM.svg?style=social) |⭐️⭐️ |
  - **FREE**
  - Skip Attention
- 📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
  - **BitNet v2**
  - I-LLM
  - ABQ-LLM - LLM]](https://github.com/bytedance/ABQ-LLM) ![](https://img.shields.io/github/stars/bytedance/ABQ-LLM.svg?style=social)|⭐️ |
  - FP6-LLM
  - VPTQ
  - **ZeroQuant**
  - LLM.int8()
  - **GPTQ** - DASLab/gptq) ![](https://img.shields.io/github/stars/IST-DASLab/gptq.svg?style=social)|⭐️⭐️ |
  - 2-bit LLM
  - SqueezeLLM
  - ZeroQuant-FP
  - **ZeroQuant**
  - LLM.int8()
  - **SmoothQuant** - han-lab/smoothquant) ![](https://img.shields.io/github/stars/mit-han-lab/smoothquant.svg?style=social)|⭐️⭐️ |
  - ZeroQuant-V2
  - OdysseyLLM W4A8
  - **SmoothQuant+**
  - CBQ
  - **SmoothQuant** - han-lab/smoothquant) ![](https://img.shields.io/github/stars/mit-han-lab/smoothquant.svg?style=social)|⭐️⭐️ |
  - ZeroQuant-V2
  - Agile-Quant
  - CBQ
  - QLLM
  - ACTIVATION SPARSITY
  - 1-bit LLMs
  - BitNet
  - **SmoothQuant+**
  - OdysseyLLM W4A8
  - **SparQ**
  - **AWQ** - awq]](https://github.com/mit-han-lab/llm-awq) ![](https://img.shields.io/github/stars/mit-han-lab/llm-awq.svg?style=social)|⭐️⭐️ |
  - SpQR
  - SqueezeLLM
  - ZeroQuant-FP
  - FP8-LM - AMP]](https://github.com/Azure/MS-AMP) ![](https://img.shields.io/github/stars/Azure/MS-AMP.svg?style=social) |⭐️ |
  - LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing) ![](https://img.shields.io/github/stars/princeton-nlp/LLM-Shearing.svg?style=social) |⭐️ |
  - LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4) ![](https://img.shields.io/github/stars/nbasyl/LLM-FP4.svg?style=social) |⭐️ |
  - **SparQ**
  - Agile-Quant
  - **W4A8KV4** - han-lab/qserve) ![](https://img.shields.io/github/stars/mit-han-lab/qserve.svg?style=social) |⭐️⭐️ |
  - 2-bit LLM
  - FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization) ![](https://img.shields.io/github/stars/Qualcomm-AI-research/FP8-quantization.svg?style=social) |⭐️ |
  - SpinQuant
  - OutlierTune
  - FP8-LM - AMP]](https://github.com/Azure/MS-AMP) ![](https://img.shields.io/github/stars/Azure/MS-AMP.svg?style=social) |⭐️ |
  - LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing) ![](https://img.shields.io/github/stars/princeton-nlp/LLM-Shearing.svg?style=social) |⭐️ |
  - LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4) ![](https://img.shields.io/github/stars/nbasyl/LLM-FP4.svg?style=social) |⭐️ |
  - GPTQT
  - **GuidedQuant** - mllab/GuidedQuant) ![](https://img.shields.io/github/stars/snu-mllab/GuidedQuant.svg?style=social)|⭐️⭐️ |
  - FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization) ![](https://img.shields.io/github/stars/Qualcomm-AI-research/FP8-quantization.svg?style=social) |⭐️ |
  - **GPTQ** - DASLab/gptq) ![](https://img.shields.io/github/stars/IST-DASLab/gptq.svg?style=social)|⭐️⭐️ |
  - QLLM
- 📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
  - **NITRO** - lab/nitro) ![](https://img.shields.io/github/stars/abdelfattah-lab/nitro.svg?style=social)|⭐️ |
  - Summary
  - FlightLLM
  - Transformer-Lite
  - **FastAttention**
  - [pdf
  - **xFasterTransformer**
  - FlexGen
  - LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers) ![](https://img.shields.io/github/stars/intel/intel-extension-for-transformers.svg?style=social) |⭐️ |
  - LinguaLinked
  - OpenVINO
- 📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
  - **FLASH-ATTENTION RNG**
  - **SpMM**
  - **TEE**
  - Microbenchmark
  - Tensor Parallel
  - **TRITONBENCH**
  - **HiFloat8**
  - **Tensor Cores**
  - Tensor Core
  - FP8
  - Tensor Cores
  - **MARLIN** - DASLab/marlin) ![](https://img.shields.io/github/stars/IST-DASLab/marlin.svg?style=social)|⭐️⭐️ |
  - **cutlass/cute**
  - QUICK
  - **flute**
  - **Triton-distributed** - distributed]](https://github.com/ByteDance-Seed/Triton-distributed) ![](https://img.shields.io/github/stars/ByteDance-Seed/Triton-distributed.svg?style=social)|⭐️⭐️ |
  - Intra-SM Parallelism
  - **LUT TENSOR CORE**
  - **HADACORE** - labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform) ![](https://img.shields.io/github/stars/pytorch-labs/applied-ai.svg?style=social)|⭐️ |
- 📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
  - **TransMLA**
  - **X-EcoMLA**
  - **DeepSeek-NSA**
  - **FlashMLA** - ai/FlashMLA.svg?style=social) |⭐️⭐️ |
  - **DualPipe** - ai/DualPipe.svg?style=social) |⭐️⭐️ |
  - **DeepGEMM** - ai/DeepGEMM.svg?style=social) |⭐️⭐️ |
  - **DeepEP** - ai/DeepEP.svg?style=social) |⭐️⭐️ |
  - **EPLB** - ai/EPLB.svg?style=social) |⭐️⭐️ |
  - **3FS** - ai/3FS.svg?style=social) |⭐️⭐️ |
  - **推理系统**
  - **MHA2MLA** - Ushio/MHA2MLA) ![](https://img.shields.io/github/stars/JT-Ushio/MHA2MLA.svg?style=social) |⭐️⭐️ |
  - **DeepSeek-R1** - R1]](https://github.com/deepseek-ai/DeepSeek-R1) ![](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-R1.svg?style=social) | ⭐️⭐️ |
- 📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
  - **LightLLM**
  - **siiRL** - research/siiRL)<br> ![](https://img.shields.io/github/stars/sii-research/siiRL.svg?style=social)|⭐️⭐️ |
  - **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
  - **TensorRT-LLM** - LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ |
  - DynamoLLM
  - Medusa
  - FastServe
  - SpecInfer
  - **PETALS** - workshop/petals) ![](https://img.shields.io/github/stars/bigscience-workshop/petals.svg?style=social)|⭐️⭐️ |
  - StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) ![](https://img.shields.io/github/stars/mit-han-lab/streaming-llm.svg?style=social)|⭐️ |
  - **Decentralized LLM**
  - NanoFlow
  - **flashinfer** - ai/flashinfer) ![](https://img.shields.io/github/stars/flashinfer-ai/flashinfer.svg?style=social)|⭐️⭐️ |
  - **MLC-LLM** - llm]](https://github.com/mlc-ai/mlc-llm) ![](https://img.shields.io/github/stars/mlc-ai/mlc-llm.svg?style=social)|⭐️⭐️ |
  - **LMDeploy**
  - **SparseInfer**
  - prima.cpp
  - **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
  - inferflow
  - **llama.cpp**
- 📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
  - **Prompt Compression**
  - **Context Distillation**
  - **Eigen Attention**
  - **AutoCompressor** - nlp/AutoCompressors) ![](https://img.shields.io/github/stars/princeton-nlp/AutoCompressors.svg?style=social)|⭐️ |
  - **500xCompressor**
  - **LongLLMLingua**
  - **LLMLingua**
  - **KV-COMPRESS** - kvcompress](https://github.com/IsaacRe/vllm-kvcompress) ![](https://img.shields.io/github/stars/IsaacRe/vllm-kvcompress.svg?style=social)|⭐️⭐️ |
  - **LLMLingua-2**
  - **LORC**
  - **CRITIPREFILL**
  - **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) ![](https://img.shields.io/github/stars/liyucheng09/Selective_Context.svg?style=social)|⭐️⭐️ |
- 📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
  - **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
  - Automatic Inference Engine Tuning
  - Splitwise
  - SpotServe
  - **SJF Scheduling**
  - Splitwise
  - SpotServe
  - **Continuous Batching**
  - **In-flight Batching** - LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ |
  - **BatchLLM**
  - LightSeq
  - LightSeq
  - **vTensor** - machine-learning/glake/tree/master/GLakeServe) ![](https://img.shields.io/github/stars/intelligent-machine-learning/glake.svg?style=social)|⭐️⭐️ |
  - **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
- 📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist))
  - **SP: BPT**
  - **SP: DEEPSPEED ULYSSES**
  - **CP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
  - **CP: Meta**
  - **SP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
  - **TP: Comm Compression**
  - **SP: TokenRing** - ring]](https://github.com/ACA-Lab-SJTU/token-ring) ![](https://img.shields.io/github/stars/ACA-Lab-SJTU/token-ring.svg?style=social)|⭐️⭐️ |
  - **MP: ZeRO**
  - **FSDP 1/2**
  - **SP: Star-Attention, 11x~ speedup** - Attention]](https://github.com/NVIDIA/Star-Attention) ![](https://img.shields.io/github/stars/NVIDIA/Star-Attention.svg?style=social)|⭐️⭐️ |
- 📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
  - Understanding LLMs
  - LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer) ![](https://img.shields.io/github/stars/hahnyuan/LLM-Viewer.svg?style=social) |⭐️⭐️ |
  - **Low-bit**
  - Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers) ![](https://img.shields.io/github/stars/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |
  - **Runtime Performance**
  - ChatGPT Anniversary
  - Algorithmic Survey
  - Security and Privacy
  - **LLMCompass**
  - Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers) ![](https://img.shields.io/github/stars/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |
  - **Runtime Performance**
  - ChatGPT Anniversary
  - Algorithmic Survey
  - **LLMCompass**
  - Security and Privacy
  - **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey) ![](https://img.shields.io/github/stars/AIoT-MLSys-Lab/Efficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |
  - **Serving Survey**
  - **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey) ![](https://img.shields.io/github/stars/AIoT-MLSys-Lab/Efficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |
  - **Serving Survey**
  - LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer) ![](https://img.shields.io/github/stars/hahnyuan/LLM-Viewer.svg?style=social) |⭐️⭐️ |
  - **LLM Inference**
  - **Internal Consistency & Self-Feedback** - Survey]](https://github.com/IAAR-Shanghai/ICSFSurvey) ![](https://img.shields.io/github/stars/IAAR-Shanghai/ICSFSurvey.svg?style=social) | ⭐️⭐️ |
- 📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
  - **Mooncake** - ai/Mooncake) ![](https://img.shields.io/github/stars/kvcache-ai/Mooncake.svg?style=social)|⭐️⭐️ |
  - Open-Sora Plan - Sora-Plan]](https://github.com/PKU-YuanGroup/Open-Sora-Plan) ![](https://img.shields.io/github/stars/PKU-YuanGroup/Open-Sora-Plan.svg?style=social)| ⭐️⭐️ |
  - Open-Sora - Sora]](https://github.com/hpcaitech/Open-Sora) ![](https://img.shields.io/github/stars/hpcaitech/Open-Sora.svg?style=social)| ⭐️⭐️ |
- 📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
  - **MegaScale-Infer**
  - **DeServe**
  - **DistServe**
  - **KVDirect**
- 📖Prompt/Context Compression ([©️back👆🏻](#paperlist))
  - **LLMLingua**
  - **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) ![](https://img.shields.io/github/stars/liyucheng09/Selective_Context.svg?style=social)|⭐️⭐️ |
  - **AutoCompressor** - nlp/AutoCompressors) ![](https://img.shields.io/github/stars/princeton-nlp/AutoCompressors.svg?style=social)|⭐️ |
  - **LLMLingua-2**
- 📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
  - RoPE
  - ByteTransformer
  - VL-CACHE
  - **Inf-MLLM**
  - **DynamicLLaVA**
- 📖Position Embed/Others ([©️back👆🏻](#paperlist))
  - RoPE
  - ByteTransformer
📙Awesome LLM Inference Papers with Codes
- 📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
  - **Medusa**
- 📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
  - FFSplit
  - **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) ![](https://img.shields.io/github/stars/fmfi-compbio/admm-pruning.svg?style=social)|⭐️ |
- 📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
  - **Mamba** - spaces/mamba) ![](https://img.shields.io/github/stars/state-spaces/mamba.svg?style=social)|⭐️⭐️ |
- 📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
  - **KVQuant**
- 📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
  - **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
  - inferflow
- 📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
  - **DistKV-LLM**
- 📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
  - FP6-LLM
- 📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
  - Understanding LLMs
🎉Awesome LLM Inference Papers with Codes
- Awesome LLM Inference for Beginners.pdf
📖 News 🔥🔥
- 2025-08-18 - free Cache Acceleration Toolbox for DiTs: Cache Acceleration with One-line Code ~ ♥️. Feel free to take a try!
- 2025-07-13 - fast](https://github.com/huggingface/flux-fast) that **makes flux-fast even faster** with **[cache-dit](https://github.com/vipshop/cache-dit)**, **3.3x** speedup on NVIDIA L20 while still maintaining **high precision**.

Programming Languages

Cuda 5 Python 4 C++ 1

Awesome-LLM-Inference

📖Contents

📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))

📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist))

📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))

📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))

📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))

📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist))

📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))

📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))

📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))

📖Non Transformer Architecture ([©️back👆🏻](#paperlist))

📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))

📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))

📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))

📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))

📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))

📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))

📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))

📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))

📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist))

📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))

📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))

📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))

📖Prompt/Context Compression ([©️back👆🏻](#paperlist))

📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))

📖Position Embed/Others ([©️back👆🏻](#paperlist))

📙Awesome LLM Inference Papers with Codes