awesome-llm-inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
https://github.com/deftruth/awesome-llm-inference
Last synced: about 7 hours ago
JSON representation
-
🎉Awesome LLM Inference Papers with Codes
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
- Awesome LLM Inference for Beginners.pdf
-
📖Contents
-
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
- **siiRL** - research/siiRL)<br> |⭐️⭐️ |
- **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- prima.cpp
- SpecInfer
- FastServe
- StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) |⭐️ |
- Medusa
- **TensorRT-LLM** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- **PETALS** - workshop/petals) |⭐️⭐️ |
- inferflow
- **LMDeploy**
- **MLC-LLM** - llm]](https://github.com/mlc-ai/mlc-llm) |⭐️⭐️ |
- **LightLLM**
- **llama.cpp**
- **flashinfer** - ai/flashinfer) |⭐️⭐️ |
- DynamoLLM
- NanoFlow
- **Decentralized LLM**
- **SparseInfer**
- **Mooncake** - ai/Mooncake) |⭐️⭐️ |
-
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
- Multi-Token Speculative Decoding
- **Speculative Sampling**
- **Speculative Sampling**
- **Medusa**
- **OSD**
- **Cascade Speculative**
- **Mamba Drafters**
- **STAND**
- **Speculative Decoding** - mad-dash/decoding-speculative-decoding)  |⭐️|
- **Parallel Decoding**
- LookaheadDecoding - ai-lab/LookaheadDecoding)  |⭐️⭐️ |
- **TriForce** - AI-Lab/TriForce) |⭐️⭐️ |
- Instructive Decoding - Decoding]](https://github.com/joonkeekim/Instructive-Decoding) |⭐️ |
- S3D
- Token Recycling
- **Speculative Decoding** - lty/ParallelSpeculativeDecoding)  |⭐️⭐️ |
- **FocusLLM**
- **MagicDec** - AI-Lab/MagicDec/) |⭐️ |
- **Speculative Decoding**
- **Hybrid Inference**
- **PARALLELSPEC**
- **Fast Best-of-N**
-
📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist))
- **SP: Star-Attention, 11x~ speedup** - Attention]](https://github.com/NVIDIA/Star-Attention) |⭐️⭐️ |
- **MP: ZeRO**
- **FSDP 1/2**
- **SP: BPT**
- **SP: DEEPSPEED ULYSSES**
- **CP: Meta**
- **SP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **CP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **TP: Comm Compression**
- **SP: TokenRing** - ring]](https://github.com/ACA-Lab-SJTU/token-ring) |⭐️⭐️ |
-
📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
-
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
- **TRITONBENCH**
- Tensor Core
- Intra-SM Parallelism
- Microbenchmark
- FP8
- Tensor Cores
- **cutlass/cute**
- QUICK
- Tensor Parallel
- **Tensor Cores**
- **FLASH-ATTENTION RNG**
- **Triton-distributed** - distributed]](https://github.com/ByteDance-Seed/Triton-distributed) |⭐️⭐️ |
- **flute**
- **LUT TENSOR CORE**
- **MARLIN** - DASLab/marlin) |⭐️⭐️ |
- **SpMM**
- **TEE**
- **HiFloat8**
- **HADACORE** - labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform) |⭐️ |
-
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
- DeepSeek-V2 - V2]](https://github.com/deepseek-ai/DeepSeek-V2) | ⭐️⭐️ |
- **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) |⭐️⭐️ |
- MoE-Mamba
- MoE Inference
- MoE
- **WINT8/4**
-
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
- QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) |⭐️ |
- LTP
- KV Cache Compress
- H2O
- **GQA**
- **PagedAttention** - project/vllm) |⭐️⭐️ |
- **KV Cache Prefetch**
- **KVzip** - mllab/KVzip) |⭐️⭐️|
- KV Cache FP8 + WINT4
- MQA
- **TensorRT-LLM KV Cache FP8** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- **Adaptive KV Cache Compress**
- CacheGen
- KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory)  |⭐️⭐️ |
- **DistKV-LLM**
- Prompt Caching
- Less
- MiKV
- **Shared Prefixes**
- **ChunkAttention** - attention]](https://github.com/microsoft/chunk-attention)  |⭐️⭐️ |
- Keyformer - matrix-ai/keyformer-llm) |⭐️⭐️ |
- FASTDECODE
- Sparsity-Aware KV Caching
- GEAR - project/GEAR) |⭐️ |
- Zero-Delay QKV Compression
- **AlignedKV**
- **LayerKV**
- **AdaKV**
- **KV Cache Recomputation**
- **ClusterKV**
- **DynamicKV**
- **CacheCraft**
- **RadixAttention** - project/sglang)  |⭐️⭐️ |
- vAttention
- **Inference-Time Hyper-Scaling**
- QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization)  |⭐️⭐️ |
- DMC
- SqueezeAttention
- SnapKV
- KVCache-1Bit
- KV-Runahead
- ZipCache
- MiniCache
- CacheBlend
- CompressKV
- MemServe
- MLKV - mlkv]](https://github.com/zaydzuhri/pythia-mlkv) |⭐️ |
- ThinK
- Palu
-
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
- PowerInfer - IPADS/PowerInfer) |⭐️ |
- SDMPrune
- **Simba**
- **LASER**
- **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) |⭐️ |
- FFSplit
- **FLAP** - IVA-Lab/FLAP) |⭐️⭐️ |
-
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
- OpenVINO
- FlexGen
- LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers)  |⭐️ |
- LinguaLinked
- FlightLLM
- Transformer-Lite
- **xFasterTransformer**
- Summary
- **FastAttention**
- **NITRO** - lab/nitro) |⭐️ |
-
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
- Open-Sora Plan - Sora-Plan]](https://github.com/PKU-YuanGroup/Open-Sora-Plan) | ⭐️⭐️ |
- Open-Sora - Sora]](https://github.com/hpcaitech/Open-Sora) | ⭐️⭐️ |
- **Mooncake** - ai/Mooncake) |⭐️⭐️ |
-
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
- **Flex Attention** - gym]](https://github.com/pytorch-labs/attention-gym)  | ⭐️⭐️ |
- **SeerAttention**
- **Slim attention** - ai/transformer-tricks)  | ⭐️⭐️⭐️ |
- **FFPA** - attn]](https://github.com/xlite-dev/ffpa-attn) |⭐️⭐️ |
- **SageAttention-3** - ml/SageAttention)  | ⭐️⭐️ |
- **FlashAttention-3** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- FlashAttention
- Online Softmax
- **FFPA** - attn-mma]](https://github.com/xlite-dev/ffpa-attn-mma) |⭐️⭐️ |
- Hash Attention
- **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Online Softmax
- FLOP, I/O
- **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- **Flash-Decoding** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Flash-Decoding++
- SparseGPT - DASLab/sparsegpt)  |⭐️ |
- **GLA**
- SCCA
- **FlashLLM**
- CHAI
- DeFT
- MoA - nics/MoA)  | ⭐️ |
- Shared Attention
- **CHESS**
- INT-FLASHATTENTION - FlashAttention]](https://github.com/INT-FlashAttention2024/INT-FlashAttention)  | ⭐️ |
- **SageAttention** - ml/SageAttention)  | ⭐️⭐️ |
- **SageAttention-2** - ml/SageAttention)  | ⭐️⭐️ |
- **Squeezed Attention**
- **TurboAttention**
- **SpargeAttention** - ml/SpargeAttn)  | ⭐️⭐️ |
- **MMInference**
- **Sparse Frontier** - frontier)  | ⭐️⭐️ |
- **Parallel Encoding** - AI-Lab/APE)  | ⭐️⭐️ |
- **Parallel Encoding** - attention]](https://github.com/TemporaryLoRA/Block-attention)  | ⭐️⭐️ |
-
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
- **RingAttention**
- **HOMER**
- **REFORM**
- **Lightning Attention** - 01]](https://github.com/MiniMax-AI/MiniMax-01)  | ⭐️⭐️ |
- **StripedAttention** - forall/striped_attention/)  |⭐️⭐️ |
- **YOCO** - YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO)  |⭐️⭐️ |
- **MInference**
- **Blockwise Attention**
- Landmark Attention - attention](https://github.com/epfml/landmark-attention/) |⭐️⭐️ |
- **LightningAttention-1**
- **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) |⭐️⭐️ |
- **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) |⭐️⭐️ |
- **Streaming Attention**
- **Prompt Cache**
- **KVQuant**
- **RelayAttention**
- Infini-attention
- RAGCache
- **KCache**
- SKVQ
- **CLA**
- LOOK-M - M]](https://github.com/SUSTechBruce/LOOK-M)  |⭐️⭐️ |
- **InfiniGen**
- **Quest** - han-lab/Quest)  |⭐️⭐️ |
- PQCache
- **SentenceVAE**
- **InstInfer**
- **RetrievalAttention**
- **ShadowKV**
-
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
- **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- DeeBERT
- FastBERT
- BERxiT
- **SkipDecode**
- **LITE**
- **FREE**
- **EE-Tuning** - Tuning]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- Skip Attention
- **KOALA**
-
📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
- **DeepSeek-R1** - R1]](https://github.com/deepseek-ai/DeepSeek-R1)  | ⭐️⭐️ |
- **DeepSeek-NSA**
- **FlashMLA** - ai/FlashMLA.svg?style=social) |⭐️⭐️ |
- **DualPipe** - ai/DualPipe.svg?style=social) |⭐️⭐️ |
- **DeepEP** - ai/DeepEP.svg?style=social) |⭐️⭐️ |
- **DeepGEMM** - ai/DeepGEMM.svg?style=social) |⭐️⭐️ |
- **EPLB** - ai/EPLB.svg?style=social) |⭐️⭐️ |
- **3FS** - ai/3FS.svg?style=social) |⭐️⭐️ |
- **推理系统**
- **MHA2MLA** - Ushio/MHA2MLA)  |⭐️⭐️ |
- **TransMLA**
- **X-EcoMLA**
-
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
- **GuidedQuant** - mllab/GuidedQuant) |⭐️⭐️ |
- ZeroQuant-V2
- **AWQ** - awq]](https://github.com/mit-han-lab/llm-awq) |⭐️⭐️ |
- SpQR
- SqueezeLLM
- ZeroQuant-FP
- FP8-LM - AMP]](https://github.com/Azure/MS-AMP)  |⭐️ |
- **ZeroQuant**
- LLM.int8()
- LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing)  |⭐️ |
- 1-bit LLMs
- ACTIVATION SPARSITY
- VPTQ
- BitNet
- FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization)  |⭐️ |
- **GPTQ** - DASLab/gptq) |⭐️⭐️ |
- **SmoothQuant** - han-lab/smoothquant) |⭐️⭐️ |
- LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4)  |⭐️ |
- 2-bit LLM
- **SmoothQuant+**
- OdysseyLLM W4A8
- **SparQ**
- Agile-Quant
- CBQ
- QLLM
- FP6-LLM
- **W4A8KV4** - han-lab/qserve)  |⭐️⭐️ |
- SpinQuant
- I-LLM
- OutlierTune
- GPTQT
- ABQ-LLM - LLM]](https://github.com/bytedance/ABQ-LLM) |⭐️ |
- **BitNet v2**
-
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
- LightSeq
- **BatchLLM**
- **Continuous Batching**
- **In-flight Batching** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- Splitwise
- SpotServe
- **vTensor** - machine-learning/glake/tree/master/GLakeServe) |⭐️⭐️ |
- Automatic Inference Engine Tuning
- **SJF Scheduling**
-
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
-
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
- **LongLLMLingua**
- **500xCompressor**
- **Eigen Attention**
- **Prompt Compression**
- **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) |⭐️⭐️ |
- **AutoCompressor** - nlp/AutoCompressors) |⭐️ |
- **LLMLingua**
- **Context Distillation**
- **CRITIPREFILL**
- **KV-COMPRESS** - kvcompress](https://github.com/IsaacRe/vllm-kvcompress) |⭐️⭐️ |
- **LORC**
-
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
- **FLA** - linear-attention]](https://github.com/sustcsonglin/flash-linear-attention) |⭐️⭐️ |
- **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) |⭐️⭐️ |
- **Mamba** - spaces/mamba) |⭐️⭐️ |
- **RWKV-CLIP** - CLIP]](https://github.com/deepglint/RWKV-CLIP) |⭐️⭐️ |
- Kraken
-
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
- Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers)  |⭐️ |
- **Runtime Performance**
- ChatGPT Anniversary
- Algorithmic Survey
- Security and Privacy
- **LLMCompass**
- **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)  |⭐️⭐️ |
- **Serving Survey**
- Understanding LLMs
- LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer)  |⭐️⭐️ |
- **Internal Consistency & Self-Feedback** - Survey]](https://github.com/IAAR-Shanghai/ICSFSurvey)  | ⭐️⭐️ |
- **Low-bit**
- **LLM Inference**
-
-
📖 News 🔥🔥
- 2025-08-18 - free Cache Acceleration Toolbox for DiTs: Cache Acceleration with One-line Code ~ ♥️. Feel free to take a try!
- 2025-07-13 - fast](https://github.com/huggingface/flux-fast) that **makes flux-fast even faster** with **[cache-dit](https://github.com/vipshop/cache-dit)**, **3.3x** speedup on NVIDIA L20 while still maintaining **high precision**.
Sub Categories
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
49
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
35
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
33
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
29
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
22
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
21
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
19
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
13
📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
12
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
11
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
11
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
10
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
10
📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist))
10
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
7
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
6
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
5
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
5
📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
4
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
3
Keywords
attention
2
cuda
2
deepseek
2
deepseek-r1
2
deepseek-v3
2
llama
2
flash-attention
2
flash-mla
2
fused-mla
2
mla
2
mlsys
2
sdpa
2
tensor-cores
2
large-language-models
1
machine-learning-systems
1
natural-language-processing
1
acceleration
1
cogvideox
1
diffusion
1
dit
1
flux
1
transformers
1
wan
1
deep-learning
1
gpt
1
llm
1
model-serving
1
nlp
1
openai-triton
1
ggml
1