Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
https://github.com/DefTruth/Awesome-LLM-Inference
Last synced: 3 days ago
JSON representation
-
📖Contents
-
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
- KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory) ![](https://img.shields.io/github/stars/snu-mllab/Context-Memory.svg?style=social) |⭐️⭐️ |
- FASTDECODE
- Sparsity-Aware KV Caching
- **AdaKV**
- ThinK
- Palu
- ZipCache
- Chunked Prefills
- **Adaptive KV Cache Compress**
- CompressKV
- **DistKV-LLM**
- Prompt Caching
- **AlignedKV**
- **Shared Prefixes**
- SqueezeAttention
- Zero-Delay QKV Compression
- Less
- MiKV
- QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization) ![](https://img.shields.io/github/stars/ClubieDong/QAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |
- DMC
- Keyformer - matrix-ai/keyformer-llm) ![](https://img.shields.io/github/stars/d-matrix-ai/keyformer-llm.svg?style=social)|⭐️⭐️ |
- GEAR - project/GEAR) ![](https://img.shields.io/github/stars/opengear-project/GEAR.svg?style=social)|⭐️ |
- **ChunkAttention** - attention]](https://github.com/microsoft/chunk-attention) ![](https://img.shields.io/github/stars/microsoft/chunk-attention.svg?style=social) |⭐️⭐️ |
- SnapKV
- KVCache-1Bit
- LTP
- **GQA**
- KV Cache Compress
- H2O
- **Adaptive KV Cache Compress**
- CacheGen
- Less
- MiKV
- FASTDECODE
- CacheGen
- Prompt Caching
- **GQA**
- MQA
- LTP
- KV Cache Compress
- H2O
- **LayerKV**
- QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) ![](https://img.shields.io/github/stars/epfml/dynamic-sparse-flash-attention.svg?style=social)|⭐️ |
- MQA
- **PagedAttention** - project/vllm) ![](https://img.shields.io/github/stars/vllm-project/vllm.svg?style=social)|⭐️⭐️ |
- DMC
- Keyformer - matrix-ai/keyformer-llm) ![](https://img.shields.io/github/stars/d-matrix-ai/keyformer-llm.svg?style=social)|⭐️⭐️ |
- QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization) ![](https://img.shields.io/github/stars/ClubieDong/QAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ |
- **RadixAttention** - project/sglang) ![](https://img.shields.io/github/stars/sgl-project/sglang.svg?style=social) |⭐️⭐️ |
- KV-Runahead
- Chunked Prefills
- KV Cache FP8 + WINT4
- SqueezeAttention
- **Shared Prefixes**
- **TensorRT-LLM KV Cache FP8** - LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ |
- KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory) ![](https://img.shields.io/github/stars/snu-mllab/Context-Memory.svg?style=social) |⭐️⭐️ |
- Sparsity-Aware KV Caching
- MiniCache
- CacheBlend
- MemServe
- QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) ![](https://img.shields.io/github/stars/epfml/dynamic-sparse-flash-attention.svg?style=social)|⭐️ |
- **PagedAttention** - project/vllm) ![](https://img.shields.io/github/stars/vllm-project/vllm.svg?style=social)|⭐️⭐️ |
- MLKV - mlkv]](https://github.com/zaydzuhri/pythia-mlkv) ![](https://img.shields.io/github/stars/zaydzuhri/pythia-mlkv.svg?style=social)|⭐️ |
- vAttention
-
📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist))
- FlightLLM
- LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers) ![](https://img.shields.io/github/stars/intel/intel-extension-for-transformers.svg?style=social) |⭐️ |
- LinguaLinked
- FlexGen
- OpenVINO
-
📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
- **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
- SpecInfer
- FastServe
- StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) ![](https://img.shields.io/github/stars/mit-han-lab/streaming-llm.svg?style=social)|⭐️ |
- **PETALS** - workshop/petals) ![](https://img.shields.io/github/stars/bigscience-workshop/petals.svg?style=social)|⭐️⭐️ |
-
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
- FLOP, I/O
- **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- Flash-Decoding++
- SparseGPT - DASLab/sparsegpt) ![](https://img.shields.io/github/stars/IST-DASLab/sparsegpt.svg?style=social) |⭐️ |
- **CHESS**
- MoA - nics/MoA) ![](https://img.shields.io/github/stars/thu-nics/MoA.svg?style=social) | ⭐️ |
- INT-FLASHATTENTION - FlashAttention]](https://github.com/INT-FlashAttention2024/INT-FlashAttention) ![](https://img.shields.io/github/stars/INT-FlashAttention2024/INT-FlashAttention.svg?style=social) | ⭐️ |
- SCCA
- **FlashLLM**
- CHAI
- **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- Flash-Decoding++
- Online Softmax
- Hash Attention
- **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- Online Softmax
- **FlashLLM**
- **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- Online Softmax
- FlashAttention
- Online Softmax
- Hash Attention
- **GLA**
- CHAI
- **Flash-Decoding** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- Flash Tree Attention
- SparseGPT - DASLab/sparsegpt) ![](https://img.shields.io/github/stars/IST-DASLab/sparsegpt.svg?style=social) |⭐️ |
- **GLA**
- SCCA
- DeFT
- **FlashAttention-3** - attention]](https://github.com/Dao-AILab/flash-attention) ![](https://img.shields.io/github/stars/Dao-AILab/flash-attention.svg?style=social)|⭐️⭐️ |
- Shared Attention
-
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
- **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) ![](https://img.shields.io/github/stars/insuhan/hyper-attn.svg?style=social)|⭐️⭐️ |
- **Streaming Attention**
- **CLA**
- **Blockwise Attention**
- **StripedAttention** - forall/striped_attention/) ![](https://img.shields.io/github/stars/exists-forall/striped_attention.svg?style=social) |⭐️⭐️ |
- **RingAttention**
- **LightningAttention-1**
- RAGCache
- LOOK-M - M]](https://github.com/SUSTechBruce/LOOK-M) ![](https://img.shields.io/github/stars/SUSTechBruce/LOOK-M.svg?style=social) |⭐️⭐️ |
- **SentenceVAE**
- **YOCO** - YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO) ![](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social) |⭐️⭐️ |
- **Blockwise Attention**
- RAGCache
- **KCache**
- **Prompt Cache**
- **Streaming Attention**
- **Prompt Cache**
- **LightningAttention-1**
- **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) ![](https://img.shields.io/github/stars/OpenNLPLab/lightning-attention.svg?style=social)|⭐️⭐️ |
- **RelayAttention**
- **RingAttention**
- **StripedAttention** - forall/striped_attention/) ![](https://img.shields.io/github/stars/exists-forall/striped_attention.svg?style=social) |⭐️⭐️ |
- **RelayAttention**
- Landmark Attention - attention](https://github.com/epfml/landmark-attention/) ![](https://img.shields.io/github/stars/epfml/landmark-attention.svg?style=social)|⭐️⭐️ |
- **RetrievalAttention**
- **InstInfer**
- **KVQuant**
- Landmark Attention - attention](https://github.com/epfml/landmark-attention/) ![](https://img.shields.io/github/stars/epfml/landmark-attention.svg?style=social)|⭐️⭐️ |
- **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) ![](https://img.shields.io/github/stars/OpenNLPLab/lightning-attention.svg?style=social)|⭐️⭐️ |
- **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) ![](https://img.shields.io/github/stars/insuhan/hyper-attn.svg?style=social)|⭐️⭐️ |
- Infini-attention
- SKVQ
- **ShadowKV**
- **MInference**
- **InfiniGen**
- **Quest** - han-lab/Quest) ![](https://img.shields.io/github/stars/mit-han-lab/Quest.svg?style=social) |⭐️⭐️ |
- PQCache
-
📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist))
-
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
- **OSD**
- **Cascade Speculative**
- Token Recycling
- **Hidden Transfer**
- **Medusa**
- LookaheadDecoding - ai-lab/LookaheadDecoding) ![](https://img.shields.io/github/stars/hao-ai-lab/LookaheadDecoding.svg?style=social) |⭐️⭐️ |
- **Speculative Decoding** - mad-dash/decoding-speculative-decoding) ![](https://img.shields.io/github/stars/uw-mad-dash/decoding-speculative-decoding.svg?style=social) |⭐️|
- **Hybrid Inference**
- **TriForce** - AI-Lab/TriForce) ![](https://img.shields.io/github/stars/Infini-AI-Lab/TriForce.svg?style=social)|⭐️⭐️ |
- **Parallel Decoding**
- **Speculative Sampling**
- **Speculative Sampling**
- **OSD**
- **Cascade Speculative**
- LookaheadDecoding - ai-lab/LookaheadDecoding) ![](https://img.shields.io/github/stars/hao-ai-lab/LookaheadDecoding.svg?style=social) |⭐️⭐️ |
- **Speculative Decoding**
- **Speculative Decoding**
- **Speculative Decoding** - lty/ParallelSpeculativeDecoding) ![](https://img.shields.io/github/stars/smart-lty/ParallelSpeculativeDecoding.svg?style=social) |⭐️⭐️ |
- **FocusLLM**
- **MagicDec** - AI-Lab/MagicDec/) ![](https://img.shields.io/github/stars/Infini-AI-Lab/MagicDec.svg?style=social)|⭐️ |
- Multi-Token Speculative Decoding
- **Parallel Decoding**
- **Speculative Sampling**
- **Speculative Sampling**
- **Hidden Transfer**
- **PARALLELSPEC**
- **Fast Best-of-N**
- Instructive Decoding - Decoding]](https://github.com/joonkeekim/Instructive-Decoding) ![](https://img.shields.io/github/stars/joonkeekim/Instructive-Decoding.svg?style=social)|⭐️ |
- S3D
-
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
- **FLAP** - IVA-Lab/FLAP) ![](https://img.shields.io/github/stars/CASIA-IVA-Lab/FLAP.svg?style=social)|⭐️⭐️ |
- **LASER**
- **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) ![](https://img.shields.io/github/stars/fmfi-compbio/admm-pruning.svg?style=social)|⭐️ |
- FFSplit
- **LASER**
- PowerInfer - IPADS/PowerInfer) ![](https://img.shields.io/github/stars/SJTU-IPADS/PowerInfer.svg?style=social)|⭐️ |
- **FLAP** - IVA-Lab/FLAP) ![](https://img.shields.io/github/stars/CASIA-IVA-Lab/FLAP.svg?style=social)|⭐️⭐️ |
-
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
- **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) ![](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg?style=social)|⭐️⭐️ |
- **WINT8/4**
- **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) ![](https://img.shields.io/github/stars/dvmazur/mixtral-offloading.svg?style=social)|⭐️⭐️ |
- MoE-Mamba
- MoE-Mamba
- **WINT8/4**
- MoE Inference
- DeepSeek-V2 - V2]](https://github.com/deepseek-ai/DeepSeek-V2) ![](https://img.shields.io/github/stars/deepseek-ai/DeepSeek-V2.svg?style=social)| ⭐️⭐️ |
- MoE
-
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
- **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) ![](https://img.shields.io/github/stars/BlinkDL/RWKV-LM.svg?style=social)|⭐️⭐️ |
- Kraken
- **RWKV-CLIP** - CLIP]](https://github.com/deepglint/RWKV-CLIP) ![](https://img.shields.io/github/stars/deepglint/RWKV-CLIP.svg?style=social)|⭐️⭐️ |
- **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) ![](https://img.shields.io/github/stars/BlinkDL/RWKV-LM.svg?style=social)|⭐️⭐️ |
- **FLA** - linear-attention]](https://github.com/sustcsonglin/flash-linear-attention) ![](https://img.shields.io/github/stars/sustcsonglin/flash-linear-attention.svg?style=social)|⭐️⭐️ |
- **Mamba** - spaces/mamba) ![](https://img.shields.io/github/stars/state-spaces/mamba.svg?style=social)|⭐️⭐️ |
-
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
- FastBERT
- **SkipDecode**
- **EE-Tuning** - Tuning]](https://github.com/pan-x-c/EE-LLM) ![](https://img.shields.io/github/stars/pan-x-c/EE-LLM.svg?style=social) |⭐️⭐️ |
- **KOALA**
- DeeBERT
- BERxiT
- **LITE**
- DeeBERT
- **LITE**
- **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM) ![](https://img.shields.io/github/stars/pan-x-c/EE-LLM.svg?style=social) |⭐️⭐️ |
- **FREE**
- Skip Attention
- **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM) ![](https://img.shields.io/github/stars/pan-x-c/EE-LLM.svg?style=social) |⭐️⭐️ |
- **FREE**
-
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
- **LightLLM**
- **llama.cpp**
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
- inferflow
- **TensorRT-LLM** - LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ |
- DynamoLLM
- Medusa
- SpecInfer
- FastServe
- StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) ![](https://img.shields.io/github/stars/mit-han-lab/streaming-llm.svg?style=social)|⭐️ |
- **PETALS** - workshop/petals) ![](https://img.shields.io/github/stars/bigscience-workshop/petals.svg?style=social)|⭐️⭐️ |
- **Decentralized LLM**
- NanoFlow
- **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
- **flashinfer** - ai/flashinfer) ![](https://img.shields.io/github/stars/flashinfer-ai/flashinfer.svg?style=social)|⭐️⭐️ |
- **Mooncake** - ai/Mooncake) ![](https://img.shields.io/github/stars/kvcache-ai/Mooncake.svg?style=social)|⭐️⭐️ |
- **LMDeploy**
- **MLC-LLM** - llm]](https://github.com/mlc-ai/mlc-llm) ![](https://img.shields.io/github/stars/mlc-ai/mlc-llm.svg?style=social)|⭐️⭐️ |
-
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
- **LUT TENSOR CORE**
- **SpMM**
- **TEE**
- QUICK
- Tensor Parallel
- Microbenchmark
- **HiFloat8**
- **Tensor Cores**
- Tensor Core
- Intra-SM Parallelism
- FP8
- Tensor Cores
- **MARLIN** - DASLab/marlin) ![](https://img.shields.io/github/stars/IST-DASLab/marlin.svg?style=social)|⭐️⭐️ |
- **cutlass/cute**
- **Tensor Product**
- **flute**
-
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
- **Prompt Compression**
- **Context Distillation**
- **LLMLingua-2**
- **Eigen Attention**
- **AutoCompressor** - nlp/AutoCompressors) ![](https://img.shields.io/github/stars/princeton-nlp/AutoCompressors.svg?style=social)|⭐️ |
- **500xCompressor**
- **LongLLMLingua**
- **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) ![](https://img.shields.io/github/stars/liyucheng09/Selective_Context.svg?style=social)|⭐️⭐️ |
- **LLMLingua**
- **KV-COMPRESS** - kvcompress](https://github.com/IsaacRe/vllm-kvcompress) ![](https://img.shields.io/github/stars/IsaacRe/vllm-kvcompress.svg?style=social)|⭐️⭐️ |
- **LORC**
- **CRITIPREFILL**
-
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
- Automatic Inference Engine Tuning
- Splitwise
- SpotServe
- LightSeq
- **SJF Scheduling**
- **Continuous Batching**
- **In-flight Batching** - LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ |
- Splitwise
- SpotServe
- LightSeq
- **vTensor** - machine-learning/glake/tree/master/GLakeServe) ![](https://img.shields.io/github/stars/intelligent-machine-learning/glake.svg?style=social)|⭐️⭐️ |
-
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
- I-LLM
- ABQ-LLM - LLM]](https://github.com/bytedance/ABQ-LLM) ![](https://img.shields.io/github/stars/bytedance/ABQ-LLM.svg?style=social)|⭐️ |
- FP6-LLM
- VPTQ
- **ZeroQuant**
- LLM.int8()
- **GPTQ** - DASLab/gptq) ![](https://img.shields.io/github/stars/IST-DASLab/gptq.svg?style=social)|⭐️⭐️ |
- 2-bit LLM
- SqueezeLLM
- ZeroQuant-FP
- **ZeroQuant**
- FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization) ![](https://img.shields.io/github/stars/Qualcomm-AI-research/FP8-quantization.svg?style=social) |⭐️ |
- LLM.int8()
- **GPTQ** - DASLab/gptq) ![](https://img.shields.io/github/stars/IST-DASLab/gptq.svg?style=social)|⭐️⭐️ |
- **SmoothQuant** - han-lab/smoothquant) ![](https://img.shields.io/github/stars/mit-han-lab/smoothquant.svg?style=social)|⭐️⭐️ |
- ZeroQuant-V2
- **SmoothQuant+**
- OdysseyLLM W4A8
- CBQ
- QLLM
- **SmoothQuant** - han-lab/smoothquant) ![](https://img.shields.io/github/stars/mit-han-lab/smoothquant.svg?style=social)|⭐️⭐️ |
- ZeroQuant-V2
- Agile-Quant
- CBQ
- QLLM
- ACTIVATION SPARSITY
- 1-bit LLMs
- BitNet
- **SmoothQuant+**
- OdysseyLLM W4A8
- **SparQ**
- **AWQ** - awq]](https://github.com/mit-han-lab/llm-awq) ![](https://img.shields.io/github/stars/mit-han-lab/llm-awq.svg?style=social)|⭐️⭐️ |
- SpQR
- SqueezeLLM
- ZeroQuant-FP
- FP8-LM - AMP]](https://github.com/Azure/MS-AMP) ![](https://img.shields.io/github/stars/Azure/MS-AMP.svg?style=social) |⭐️ |
- LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing) ![](https://img.shields.io/github/stars/princeton-nlp/LLM-Shearing.svg?style=social) |⭐️ |
- LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4) ![](https://img.shields.io/github/stars/nbasyl/LLM-FP4.svg?style=social) |⭐️ |
- 2-bit LLM
- **SparQ**
- Agile-Quant
- **W4A8KV4** - han-lab/qserve) ![](https://img.shields.io/github/stars/mit-han-lab/qserve.svg?style=social) |⭐️⭐️ |
- FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization) ![](https://img.shields.io/github/stars/Qualcomm-AI-research/FP8-quantization.svg?style=social) |⭐️ |
- SpinQuant
- OutlierTune
- FP8-LM - AMP]](https://github.com/Azure/MS-AMP) ![](https://img.shields.io/github/stars/Azure/MS-AMP.svg?style=social) |⭐️ |
- LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing) ![](https://img.shields.io/github/stars/princeton-nlp/LLM-Shearing.svg?style=social) |⭐️ |
- LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4) ![](https://img.shields.io/github/stars/nbasyl/LLM-FP4.svg?style=social) |⭐️ |
- GPTQT
-
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
- Understanding LLMs
- LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer) ![](https://img.shields.io/github/stars/hahnyuan/LLM-Viewer.svg?style=social) |⭐️⭐️ |
- **Low-bit**
- Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers) ![](https://img.shields.io/github/stars/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |
- **Runtime Performance**
- ChatGPT Anniversary
- Algorithmic Survey
- Security and Privacy
- **LLMCompass**
- Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers) ![](https://img.shields.io/github/stars/tjunlp-lab/Awesome-LLMs-Evaluation-Papers.svg?style=social) |⭐️ |
- **Runtime Performance**
- ChatGPT Anniversary
- Algorithmic Survey
- Security and Privacy
- **LLMCompass**
- **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey) ![](https://img.shields.io/github/stars/AIoT-MLSys-Lab/Efficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |
- **Serving Survey**
- **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey) ![](https://img.shields.io/github/stars/AIoT-MLSys-Lab/Efficient-LLMs-Survey.svg?style=social) |⭐️⭐️ |
- **Serving Survey**
- LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer) ![](https://img.shields.io/github/stars/hahnyuan/LLM-Viewer.svg?style=social) |⭐️⭐️ |
- **LLM Inference**
- **Internal Consistency & Self-Feedback** - Survey]](https://github.com/IAAR-Shanghai/ICSFSurvey) ![](https://img.shields.io/github/stars/IAAR-Shanghai/ICSFSurvey.svg?style=social) | ⭐️⭐️ |
-
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
- Summary
- LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers) ![](https://img.shields.io/github/stars/intel/intel-extension-for-transformers.svg?style=social) |⭐️ |
- LinguaLinked
- FlightLLM
- OpenVINO
- FlexGen
- Transformer-Lite
- **FastAttention**
- **xFasterTransformer**
-
📖Prompt/Context Compression ([©️back👆🏻](#paperlist))
- **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) ![](https://img.shields.io/github/stars/liyucheng09/Selective_Context.svg?style=social)|⭐️⭐️ |
- **AutoCompressor** - nlp/AutoCompressors) ![](https://img.shields.io/github/stars/princeton-nlp/AutoCompressors.svg?style=social)|⭐️ |
- **LLMLingua**
- **LLMLingua-2**
-
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
- Open-Sora Plan - Sora-Plan]](https://github.com/PKU-YuanGroup/Open-Sora-Plan) ![](https://img.shields.io/github/stars/PKU-YuanGroup/Open-Sora-Plan.svg?style=social)| ⭐️⭐️ |
- Open-Sora - Sora]](https://github.com/hpcaitech/Open-Sora) ![](https://img.shields.io/github/stars/hpcaitech/Open-Sora.svg?style=social)| ⭐️⭐️ |
-
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
-
📖Position Embed/Others ([©️back👆🏻](#paperlist))
-
📖Data/Model/Pipeline/Tensor/Sequence/Context Parallelism ([©️back👆🏻](#paperlist))
- **Model Parallel**
- **Sequence Parallel** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
- **Sequence Parallel**
- **Context Parallel** - LM]](https://github.com/NVIDIA/Megatron-LM) ![](https://img.shields.io/github/stars/NVIDIA/Megatron-LM.svg?style=social)|⭐️⭐️ |
- **Context Parallel**
-
-
📙Awesome LLM Inference Papers with Codes
-
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
-
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
-
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
- FFSplit
- **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) ![](https://img.shields.io/github/stars/fmfi-compbio/admm-pruning.svg?style=social)|⭐️ |
-
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
-
📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed) ![](https://img.shields.io/github/stars/microsoft/DeepSpeed.svg?style=social) |⭐️⭐️ |
- inferflow
-
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
-
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
-
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
-
Sub Categories
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
65
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
50
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
38
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
32
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
30
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
23
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
18
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
16
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
14
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
12
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
12
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
10
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
9
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
9
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
8
📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
7
📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist))
6
📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist))
5
📖Data/Model/Pipeline/Tensor/Sequence/Context Parallelism ([©️back👆🏻](#paperlist))
5
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
4
📖Prompt/Context Compression ([©️back👆🏻](#paperlist))
4
📖Position Embed/Others ([©️back👆🏻](#paperlist))
2
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
2