An open API service indexing awesome lists of open source software.

Awesome-LLM-Inference

📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
https://github.com/xlite-dev/Awesome-LLM-Inference

Last synced: 3 days ago
JSON representation

Programming Languages
Sub Categories
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist)) 72 📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist)) 52 📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist)) 48 📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist)) 40 📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist)) 32 📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist)) 23 📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist)) 20 📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist)) 19 📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist)) 14 📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist)) 14 📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist)) 12 📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist)) 12 📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist)) 11 📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist)) 11 📖Multi-GPUs/Multi-Nodes Parallelism ([©️back👆🏻](#paperlist)) 10 📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist)) 9 📖Non Transformer Architecture ([©️back👆🏻](#paperlist)) 7 📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist)) 6 📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist)) 6 📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist)) 5 📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist)) 5 📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist)) 4 📖Prompt/Context Compression ([©️back👆🏻](#paperlist)) 4 📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist)) 3 📖Position Embed/Others ([©️back👆🏻](#paperlist)) 2