awesome-llm-inference
📚A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc.
https://github.com/deftruth/awesome-llm-inference
Last synced: about 3 hours ago
JSON representation
-
🎉Awesome LLM Inference Papers with Codes
-
📖Contents
-
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
- **BitNet v2**
- **GPTQ** - DASLab/gptq) |⭐️⭐️ |
- **SmoothQuant** - han-lab/smoothquant) |⭐️⭐️ |
- ZeroQuant-V2
- **AWQ** - awq]](https://github.com/mit-han-lab/llm-awq) |⭐️⭐️ |
- SpQR
- SqueezeLLM
- ZeroQuant-FP
- FP8-LM - AMP]](https://github.com/Azure/MS-AMP)  |⭐️ |
- **ZeroQuant**
- FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization)  |⭐️ |
- LLM.int8()
- LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing)  |⭐️ |
- LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4)  |⭐️ |
- 2-bit LLM
- **SmoothQuant+**
- OdysseyLLM W4A8
- **SparQ**
- Agile-Quant
- CBQ
- QLLM
- FP6-LLM
- **W4A8KV4** - han-lab/qserve)  |⭐️⭐️ |
- SpinQuant
- I-LLM
- OutlierTune
- GPTQT
- ABQ-LLM - LLM]](https://github.com/bytedance/ABQ-LLM) |⭐️ |
- 1-bit LLMs
- ACTIVATION SPARSITY
- VPTQ
- BitNet
-
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
- **Sparse Frontier** - frontier)  | ⭐️⭐️ |
- **Flex Attention** - gym]](https://github.com/pytorch-labs/attention-gym)  | ⭐️⭐️ |
- **SpargeAttention** - ml/SpargeAttn)  | ⭐️⭐️ |
- **FlashAttention-3** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Online Softmax
- FlashAttention
- FLOP, I/O
- **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- **Flash-Decoding** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Flash-Decoding++
- SparseGPT - DASLab/sparsegpt)  |⭐️ |
- **GLA**
- SCCA
- **FlashLLM**
- Online Softmax
- Hash Attention
- CHAI
- DeFT
- MoA - nics/MoA)  | ⭐️ |
- Shared Attention
- **CHESS**
- INT-FLASHATTENTION - FlashAttention]](https://github.com/INT-FlashAttention2024/INT-FlashAttention)  | ⭐️ |
- **SageAttention** - ml/SageAttention)  | ⭐️⭐️ |
- **SageAttention-2** - ml/SageAttention)  | ⭐️⭐️ |
- **Squeezed Attention**
- **TurboAttention**
- **FFPA** - attn-mma]](https://github.com/xlite-dev/ffpa-attn-mma) |⭐️⭐️ |
- **MMInference**
-
📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist))
- **SP: Star-Attention, 11x~ speedup** - Attention]](https://github.com/NVIDIA/Star-Attention) |⭐️⭐️ |
- **MP: ZeRO**
- **SP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **FSDP 1/2**
- **SP: BPT**
- **SP: DEEPSPEED ULYSSES**
- **CP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **CP: Meta**
- **TP: Comm Compression**
- **SP: TokenRing** - ring]](https://github.com/ACA-Lab-SJTU/token-ring) |⭐️⭐️ |
-
📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
-
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
- **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **Mooncake** - ai/Mooncake) |⭐️⭐️ |
- prima.cpp
- SpecInfer
- FastServe
- StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) |⭐️ |
- Medusa
- **TensorRT-LLM** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- **PETALS** - workshop/petals) |⭐️⭐️ |
- inferflow
- **LMDeploy**
- **MLC-LLM** - llm]](https://github.com/mlc-ai/mlc-llm) |⭐️⭐️ |
- **LightLLM**
- **llama.cpp**
- **flashinfer** - ai/flashinfer) |⭐️⭐️ |
- DynamoLLM
- NanoFlow
- **Decentralized LLM**
- **SparseInfer**
-
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
- **TRITONBENCH**
- Tensor Core
- Intra-SM Parallelism
- Microbenchmark
- FP8
- Tensor Cores
- **cutlass/cute**
- QUICK
- Tensor Parallel
- **flute**
- **LUT TENSOR CORE**
- **MARLIN** - DASLab/marlin) |⭐️⭐️ |
- **SpMM**
- **TEE**
- **HiFloat8**
- **Tensor Cores**
- **HADACORE** - labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform) |⭐️ |
- **FLASH-ATTENTION RNG**
- **Triton-distributed** - distributed]](https://github.com/ByteDance-Seed/Triton-distributed) |⭐️⭐️ |
-
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
- Open-Sora Plan - Sora-Plan]](https://github.com/PKU-YuanGroup/Open-Sora-Plan) | ⭐️⭐️ |
- Open-Sora - Sora]](https://github.com/hpcaitech/Open-Sora) | ⭐️⭐️ |
- **Mooncake** - ai/Mooncake) |⭐️⭐️ |
- **microsoft**
- **OpenMachine.ai** - ai/transformer-tricks)  | ⭐️⭐️⭐️ |
-
📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
- **DeepSeek-R1** - R1]](https://github.com/deepseek-ai/DeepSeek-R1)  | ⭐️⭐️ |
- **DeepSeek-V3** - V3]](https://github.com/deepseek-ai/DeepSeek-V3)  | ⭐️⭐️ |
- **DeepSeek-NSA**
- **FlashMLA** - ai/FlashMLA.svg?style=social) |⭐️⭐️ |
- **DualPipe** - ai/DualPipe.svg?style=social) |⭐️⭐️ |
- **DeepEP** - ai/DeepEP.svg?style=social) |⭐️⭐️ |
- **DeepGEMM** - ai/DeepGEMM.svg?style=social) |⭐️⭐️ |
- **EPLB** - ai/EPLB.svg?style=social) |⭐️⭐️ |
- **3FS** - ai/3FS.svg?style=social) |⭐️⭐️ |
- **推理系统**
- **MHA2MLA** - Ushio/MHA2MLA)  |⭐️⭐️ |
- **TransMLA**
- **X-EcoMLA**
-
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
- **RingAttention**
- **Lightning Attention** - 01]](https://github.com/MiniMax-AI/MiniMax-01)  | ⭐️⭐️ |
- **Blockwise Attention**
- Landmark Attention - attention](https://github.com/epfml/landmark-attention/) |⭐️⭐️ |
- **LightningAttention-1**
- **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) |⭐️⭐️ |
- **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) |⭐️⭐️ |
- **Streaming Attention**
- **Prompt Cache**
- **KVQuant**
- **RelayAttention**
- Infini-attention
- RAGCache
- **KCache**
- SKVQ
- **CLA**
- LOOK-M - M]](https://github.com/SUSTechBruce/LOOK-M)  |⭐️⭐️ |
- **InfiniGen**
- **Quest** - han-lab/Quest)  |⭐️⭐️ |
- PQCache
- **SentenceVAE**
- **InstInfer**
- **RetrievalAttention**
- **ShadowKV**
- **StripedAttention** - forall/striped_attention/)  |⭐️⭐️ |
- **YOCO** - YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO)  |⭐️⭐️ |
- **MInference**
-
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
- LTP
- KV Cache Compress
- H2O
- Chunked Prefills
- **PagedAttention** - project/vllm) |⭐️⭐️ |
- **KV Cache Prefetch**
- MQA
- **TensorRT-LLM KV Cache FP8** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- **Adaptive KV Cache Compress**
- CacheGen
- KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory)  |⭐️⭐️ |
- **DistKV-LLM**
- Prompt Caching
- Less
- MiKV
- **Shared Prefixes**
- **ChunkAttention** - attention]](https://github.com/microsoft/chunk-attention)  |⭐️⭐️ |
- QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization)  |⭐️⭐️ |
- DMC
- Keyformer - matrix-ai/keyformer-llm) |⭐️⭐️ |
- FASTDECODE
- Sparsity-Aware KV Caching
- GEAR - project/GEAR) |⭐️ |
- SqueezeAttention
- SnapKV
- KVCache-1Bit
- KV-Runahead
- ZipCache
- MiniCache
- CacheBlend
- CompressKV
- MemServe
- MLKV - mlkv]](https://github.com/zaydzuhri/pythia-mlkv) |⭐️ |
- ThinK
- Palu
- Zero-Delay QKV Compression
- **AlignedKV**
- **LayerKV**
- **AdaKV**
- **KV Cache Recomputation**
- **ClusterKV**
- **DynamicKV**
- **CacheCraft**
- **GQA**
- QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) |⭐️ |
- KV Cache FP8 + WINT4
- **RadixAttention** - project/sglang)  |⭐️⭐️ |
- vAttention
-
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
- **LITE**
- **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- **FREE**
- **EE-Tuning** - Tuning]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- Skip Attention
- **KOALA**
- DeeBERT
- FastBERT
- BERxiT
- **SkipDecode**
-
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
- **Parallel Decoding**
- **Speculative Sampling**
- **Speculative Sampling**
- **Medusa**
- **OSD**
- **Cascade Speculative**
- LookaheadDecoding - ai-lab/LookaheadDecoding)  |⭐️⭐️ |
- Multi-Token Speculative Decoding
- **Speculative Decoding** - mad-dash/decoding-speculative-decoding)  |⭐️|
- **TriForce** - AI-Lab/TriForce) |⭐️⭐️ |
- **Hidden Transfer**
- Instructive Decoding - Decoding]](https://github.com/joonkeekim/Instructive-Decoding) |⭐️ |
- S3D
- Token Recycling
- **Speculative Decoding** - lty/ParallelSpeculativeDecoding)  |⭐️⭐️ |
- **FocusLLM**
- **MagicDec** - AI-Lab/MagicDec/) |⭐️ |
- **Speculative Decoding**
- **Hybrid Inference**
- **PARALLELSPEC**
- **Fast Best-of-N**
-
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
- LightSeq
- **BatchLLM**
- **Continuous Batching**
- **In-flight Batching** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- Splitwise
- SpotServe
- **vTensor** - machine-learning/glake/tree/master/GLakeServe) |⭐️⭐️ |
- Automatic Inference Engine Tuning
- **SJF Scheduling**
-
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
- PowerInfer - IPADS/PowerInfer) |⭐️ |
- **FLAP** - IVA-Lab/FLAP) |⭐️⭐️ |
- **LASER**
- **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) |⭐️ |
- FFSplit
-
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
- DeepSeek-V2 - V2]](https://github.com/deepseek-ai/DeepSeek-V2) | ⭐️⭐️ |
- **WINT8/4**
- **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) |⭐️⭐️ |
- MoE-Mamba
- MoE Inference
- MoE
-
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
- FlexGen
- OpenVINO
- LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers)  |⭐️ |
- LinguaLinked
- FlightLLM
- Transformer-Lite
- **xFasterTransformer**
- Summary
- **FastAttention**
- **NITRO** - lab/nitro) |⭐️ |
-
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
-
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
- Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers)  |⭐️ |
- **Runtime Performance**
- ChatGPT Anniversary
- Algorithmic Survey
- Security and Privacy
- **LLMCompass**
- **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)  |⭐️⭐️ |
- **Serving Survey**
- Understanding LLMs
- LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer)  |⭐️⭐️ |
- **Internal Consistency & Self-Feedback** - Survey]](https://github.com/IAAR-Shanghai/ICSFSurvey)  | ⭐️⭐️ |
- **Low-bit**
- **LLM Inference**
-
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
- **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) |⭐️⭐️ |
- **AutoCompressor** - nlp/AutoCompressors) |⭐️ |
- **LLMLingua**
- **LongLLMLingua**
- **LLMLingua-2**
- **500xCompressor**
- **Eigen Attention**
- **Prompt Compression**
- **Context Distillation**
- **CRITIPREFILL**
- **KV-COMPRESS** - kvcompress](https://github.com/IsaacRe/vllm-kvcompress) |⭐️⭐️ |
- **LORC**
-
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
- **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) |⭐️⭐️ |
- **Mamba** - spaces/mamba) |⭐️⭐️ |
- **RWKV-CLIP** - CLIP]](https://github.com/deepglint/RWKV-CLIP) |⭐️⭐️ |
- Kraken
- **FLA** - linear-attention]](https://github.com/sustcsonglin/flash-linear-attention) |⭐️⭐️ |
-
Sub Categories
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
48
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
32
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
29
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
27
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
21
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
20
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
19
📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
13
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
13
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
12
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
11
📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist))
10
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
10
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
10
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
6
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
5
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
5
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
5
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
5
📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
4
Keywords