Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism etc. 🎉🎉
https://github.com/xlite-dev/Awesome-LLM-Inference
Last synced: 4 days ago
JSON representation
-
📖Contents
-
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
- KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory)  |⭐️⭐️ |
- FASTDECODE
- Sparsity-Aware KV Caching
- H2O
- **AdaKV**
- **KV Cache Recomputation**
- **DynamicKV**
- ThinK
- Palu
- ZipCache
- Chunked Prefills
- **Adaptive KV Cache Compress**
- CompressKV
- **DistKV-LLM**
- Prompt Caching
- **AlignedKV**
- **Shared Prefixes**
- SqueezeAttention
- Zero-Delay QKV Compression
- Less
- MiKV
- QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization)  |⭐️⭐️ |
- DMC
- GEAR - project/GEAR) |⭐️ |
- **ChunkAttention** - attention]](https://github.com/microsoft/chunk-attention)  |⭐️⭐️ |
- SnapKV
- KVCache-1Bit
- LTP
- **GQA**
- KV Cache Compress
- **Adaptive KV Cache Compress**
- CacheGen
- Less
- MiKV
- FASTDECODE
- CacheGen
- Prompt Caching
- **GQA**
- MQA
- LTP
- KV Cache Compress
- H2O
- **LayerKV**
- QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) |⭐️ |
- DMC
- Keyformer - matrix-ai/keyformer-llm) |⭐️⭐️ |
- QAQ - KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization)  |⭐️⭐️ |
- **RadixAttention** - project/sglang)  |⭐️⭐️ |
- KV-Runahead
- Chunked Prefills
- **ClusterKV**
- SqueezeAttention
- **Shared Prefixes**
- **TensorRT-LLM KV Cache FP8** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- KV Cache Compress with LoRA - Context-Memory]](https://github.com/snu-mllab/Context-Memory)  |⭐️⭐️ |
- Sparsity-Aware KV Caching
- MiniCache
- CacheBlend
- MemServe
- QK-Sparse/Dropping Attention - sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) |⭐️ |
- **PagedAttention** - project/vllm) |⭐️⭐️ |
- MLKV - mlkv]](https://github.com/zaydzuhri/pythia-mlkv) |⭐️ |
- vAttention
- **PagedAttention** - project/vllm) |⭐️⭐️ |
- MQA
- KV Cache FP8 + WINT4
- **CacheCraft**
- Keyformer - matrix-ai/keyformer-llm) |⭐️⭐️ |
-
📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist))
- FlightLLM
- LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers)  |⭐️ |
- LinguaLinked
- FlexGen
- OpenVINO
-
📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
- **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- SpecInfer
- FastServe
- StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) |⭐️ |
- **PETALS** - workshop/petals) |⭐️⭐️ |
-
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
- FLOP, I/O
- **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Flash-Decoding++
- SparseGPT - DASLab/sparsegpt)  |⭐️ |
- **TurboAttention**
- **CHESS**
- MoA - nics/MoA)  | ⭐️ |
- INT-FLASHATTENTION - FlashAttention]](https://github.com/INT-FlashAttention2024/INT-FlashAttention)  | ⭐️ |
- SCCA
- **FlashLLM**
- CHAI
- **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Flash-Decoding++
- Online Softmax
- Hash Attention
- **FlashAttention** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Online Softmax
- **FlashLLM**
- **FlashAttention-2** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Online Softmax
- FlashAttention
- Online Softmax
- Hash Attention
- **GLA**
- CHAI
- **Flash-Decoding** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Flash Tree Attention
- SparseGPT - DASLab/sparsegpt)  |⭐️ |
- **GLA**
- SCCA
- DeFT
- **FFPA** - attn-mma]](https://github.com/DefTruth/ffpa-attn-mma) |⭐️⭐️ |
- **FlashAttention-3** - attention]](https://github.com/Dao-AILab/flash-attention) |⭐️⭐️ |
- Shared Attention
- **SageAttention** - ml/SageAttention)  | ⭐️⭐️ |
- **SageAttention-2** - ml/SageAttention)  | ⭐️⭐️ |
- **Squeezed Attention**
- **FFPA** - py]](https://github.com/DefTruth/cuffpa-py) |⭐️⭐️ |
- **SpargeAttention** - ml/SpargeAttn)  | ⭐️⭐️ |
- **FFPA** - attn-mma]](https://github.com/xlite-dev/ffpa-attn-mma) |⭐️⭐️ |
-
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
- **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) |⭐️⭐️ |
- **Streaming Attention**
- **CLA**
- **StripedAttention** - forall/striped_attention/)  |⭐️⭐️ |
- **RingAttention**
- **LightningAttention-1**
- RAGCache
- LOOK-M - M]](https://github.com/SUSTechBruce/LOOK-M)  |⭐️⭐️ |
- **SentenceVAE**
- **YOCO** - YOCO]](https://github.com/microsoft/unilm/tree/master/YOCO)  |⭐️⭐️ |
- **Blockwise Attention**
- RAGCache
- **KCache**
- **Prompt Cache**
- **Streaming Attention**
- **Prompt Cache**
- **LightningAttention-1**
- **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) |⭐️⭐️ |
- **RelayAttention**
- **RelayAttention**
- Landmark Attention - attention](https://github.com/epfml/landmark-attention/) |⭐️⭐️ |
- **RetrievalAttention**
- **InstInfer**
- **KVQuant**
- Landmark Attention - attention](https://github.com/epfml/landmark-attention/) |⭐️⭐️ |
- **LightningAttention-2** - attention](https://github.com/OpenNLPLab/lightning-attention) |⭐️⭐️ |
- **HyperAttention** - attn](https://github.com/insuhan/hyper-attn) |⭐️⭐️ |
- Infini-attention
- SKVQ
- **ShadowKV**
- **Lightning Attention** - 01]](https://github.com/MiniMax-AI/MiniMax-01)  | ⭐️⭐️ |
- **MInference**
- **InfiniGen**
- **Quest** - han-lab/Quest)  |⭐️⭐️ |
- PQCache
- **RingAttention**
- **StripedAttention** - forall/striped_attention/)  |⭐️⭐️ |
-
📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist))
-
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
- **OSD**
- **Cascade Speculative**
- Token Recycling
- **Hidden Transfer**
- **Medusa**
- LookaheadDecoding - ai-lab/LookaheadDecoding)  |⭐️⭐️ |
- **Speculative Decoding** - mad-dash/decoding-speculative-decoding)  |⭐️|
- **Hybrid Inference**
- **TriForce** - AI-Lab/TriForce) |⭐️⭐️ |
- **Parallel Decoding**
- **Speculative Sampling**
- **Speculative Sampling**
- **OSD**
- **Cascade Speculative**
- LookaheadDecoding - ai-lab/LookaheadDecoding)  |⭐️⭐️ |
- **Speculative Decoding**
- **Speculative Decoding**
- **Speculative Decoding** - lty/ParallelSpeculativeDecoding)  |⭐️⭐️ |
- **FocusLLM**
- **MagicDec** - AI-Lab/MagicDec/) |⭐️ |
- **Parallel Decoding**
- **Speculative Sampling**
- **Speculative Sampling**
- **Hidden Transfer**
- **PARALLELSPEC**
- **Fast Best-of-N**
- Instructive Decoding - Decoding]](https://github.com/joonkeekim/Instructive-Decoding) |⭐️ |
- S3D
- Multi-Token Speculative Decoding
-
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
- **FLAP** - IVA-Lab/FLAP) |⭐️⭐️ |
- **LASER**
- **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) |⭐️ |
- FFSplit
- **LASER**
- **FLAP** - IVA-Lab/FLAP) |⭐️⭐️ |
- PowerInfer - IPADS/PowerInfer) |⭐️ |
-
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
- **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) |⭐️⭐️ |
- **WINT8/4**
- **Mixtral Offloading** - offloading]](https://github.com/dvmazur/mixtral-offloading) |⭐️⭐️ |
- MoE-Mamba
- MoE-Mamba
- **WINT8/4**
- MoE Inference
- DeepSeek-V2 - V2]](https://github.com/deepseek-ai/DeepSeek-V2) | ⭐️⭐️ |
- MoE
-
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
- **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) |⭐️⭐️ |
- Kraken
- **RWKV-CLIP** - CLIP]](https://github.com/deepglint/RWKV-CLIP) |⭐️⭐️ |
- **RWKV** - LM]](https://github.com/BlinkDL/RWKV-LM) |⭐️⭐️ |
- **Mamba** - spaces/mamba) |⭐️⭐️ |
- **FLA** - linear-attention]](https://github.com/sustcsonglin/flash-linear-attention) |⭐️⭐️ |
-
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
- FastBERT
- **SkipDecode**
- **EE-Tuning** - Tuning]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- **KOALA**
- DeeBERT
- BERxiT
- **LITE**
- DeeBERT
- **LITE**
- **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- **FREE**
- Skip Attention
- **EE-LLM** - LLM]](https://github.com/pan-x-c/EE-LLM)  |⭐️⭐️ |
- **FREE**
-
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
- **NITRO** - lab/nitro) |⭐️ |
- Summary
- LLM CPU Inference - extension-for-transformers]](https://github.com/intel/intel-extension-for-transformers)  |⭐️ |
- LinguaLinked
- FlightLLM
- Transformer-Lite
- **FastAttention**
- **xFasterTransformer**
- FlexGen
- OpenVINO
-
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
- **FLASH-ATTENTION RNG**
- QUICK
- **LUT TENSOR CORE**
- **SpMM**
- **TEE**
- Tensor Parallel
- Microbenchmark
- **TRITONBENCH**
- **HiFloat8**
- **Tensor Cores**
- Tensor Core
- Intra-SM Parallelism
- FP8
- Tensor Cores
- **MARLIN** - DASLab/marlin) |⭐️⭐️ |
- **cutlass/cute**
- **HADACORE** - labs/applied-ai/tree/main/kernels/cuda/inference/hadamard_transform) |⭐️ |
- **flute**
-
📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
- **TransMLA**
- **X-EcoMLA**
- **DeepSeek-NSA**
- **DeepSeek-R1** - R1]](https://github.com/deepseek-ai/DeepSeek-R1)  | ⭐️⭐️ |
- **FlashMLA** - ai/FlashMLA.svg?style=social) |⭐️⭐️ |
- **DeepSeek-V3** - V3]](https://github.com/deepseek-ai/DeepSeek-V3)  | ⭐️⭐️ |
- **DualPipe** - ai/DualPipe.svg?style=social) |⭐️⭐️ |
- **DeepEP** - ai/DeepEP.svg?style=social) |⭐️⭐️ |
- **DeepGEMM** - ai/DeepGEMM.svg?style=social) |⭐️⭐️ |
- **EPLB** - ai/EPLB.svg?style=social) |⭐️⭐️ |
- **3FS** - ai/3FS.svg?style=social) |⭐️⭐️ |
- **推理系统**
- **MHA2MLA** - Ushio/MHA2MLA)  |⭐️⭐️ |
-
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
- **LightLLM**
- **llama.cpp**
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- **TensorRT-LLM** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- DynamoLLM
- Medusa
- SpecInfer
- FastServe
- StreamingLLM - llm]](https://github.com/mit-han-lab/streaming-llm) |⭐️ |
- **PETALS** - workshop/petals) |⭐️⭐️ |
- **Decentralized LLM**
- NanoFlow
- **flashinfer** - ai/flashinfer) |⭐️⭐️ |
- **Mooncake** - ai/Mooncake) |⭐️⭐️ |
- **LMDeploy**
- **MLC-LLM** - llm]](https://github.com/mlc-ai/mlc-llm) |⭐️⭐️ |
- **SparseInfer**
- **Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- inferflow
-
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
- **Prompt Compression**
- **Context Distillation**
- **LLMLingua-2**
- **Eigen Attention**
- **AutoCompressor** - nlp/AutoCompressors) |⭐️ |
- **500xCompressor**
- **LongLLMLingua**
- **LLMLingua**
- **KV-COMPRESS** - kvcompress](https://github.com/IsaacRe/vllm-kvcompress) |⭐️⭐️ |
- **LORC**
- **CRITIPREFILL**
- **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) |⭐️⭐️ |
-
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- Automatic Inference Engine Tuning
- Splitwise
- SpotServe
- LightSeq
- **SJF Scheduling**
- **Continuous Batching**
- **In-flight Batching** - LLM]](https://github.com/NVIDIA/TensorRT-LLM)  |⭐️⭐️ |
- Splitwise
- SpotServe
- **BatchLLM**
- LightSeq
- **vTensor** - machine-learning/glake/tree/master/GLakeServe) |⭐️⭐️ |
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
-
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
- I-LLM
- ABQ-LLM - LLM]](https://github.com/bytedance/ABQ-LLM) |⭐️ |
- FP6-LLM
- VPTQ
- **ZeroQuant**
- LLM.int8()
- **GPTQ** - DASLab/gptq) |⭐️⭐️ |
- 2-bit LLM
- SqueezeLLM
- ZeroQuant-FP
- **ZeroQuant**
- FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization)  |⭐️ |
- LLM.int8()
- **GPTQ** - DASLab/gptq) |⭐️⭐️ |
- **SmoothQuant** - han-lab/smoothquant) |⭐️⭐️ |
- ZeroQuant-V2
- **SmoothQuant+**
- OdysseyLLM W4A8
- CBQ
- QLLM
- **SmoothQuant** - han-lab/smoothquant) |⭐️⭐️ |
- ZeroQuant-V2
- Agile-Quant
- CBQ
- QLLM
- ACTIVATION SPARSITY
- 1-bit LLMs
- BitNet
- **SmoothQuant+**
- OdysseyLLM W4A8
- **SparQ**
- **AWQ** - awq]](https://github.com/mit-han-lab/llm-awq) |⭐️⭐️ |
- SpQR
- SqueezeLLM
- ZeroQuant-FP
- FP8-LM - AMP]](https://github.com/Azure/MS-AMP)  |⭐️ |
- LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing)  |⭐️ |
- LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4)  |⭐️ |
- 2-bit LLM
- **SparQ**
- Agile-Quant
- **W4A8KV4** - han-lab/qserve)  |⭐️⭐️ |
- FP8-Quantization - quantization]](https://github.com/Qualcomm-AI-research/FP8-quantization)  |⭐️ |
- SpinQuant
- OutlierTune
- FP8-LM - AMP]](https://github.com/Azure/MS-AMP)  |⭐️ |
- LLM-Shearing - Shearing]](https://github.com/princeton-nlp/LLM-Shearing)  |⭐️ |
- LLM-FP4 - FP4]](https://github.com/nbasyl/LLM-FP4)  |⭐️ |
- GPTQT
-
📖DP/MP/PP/TP/SP/CP Parallelism ([©️back👆🏻](#paperlist))
- **SP: BPT**
- **MP: ZeRO**
- **SP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **SP: DEEPSPEED ULYSSES**
- **CP: Megatron-LM** - LM]](https://github.com/NVIDIA/Megatron-LM) |⭐️⭐️ |
- **CP: Meta**
- **TP: Comm Compression**
- **SP: Star-Attention, 11x~ speedup** - Attention]](https://github.com/NVIDIA/Star-Attention) |⭐️⭐️ |
- **SP: TokenRing** - ring]](https://github.com/ACA-Lab-SJTU/token-ring) |⭐️⭐️ |
-
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
- Understanding LLMs
- LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer)  |⭐️⭐️ |
- **Low-bit**
- Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers)  |⭐️ |
- **Runtime Performance**
- ChatGPT Anniversary
- Algorithmic Survey
- Security and Privacy
- **LLMCompass**
- Evaluating - LLMs-Evaluation]](https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers)  |⭐️ |
- **Runtime Performance**
- ChatGPT Anniversary
- Algorithmic Survey
- Security and Privacy
- **LLMCompass**
- **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)  |⭐️⭐️ |
- **Serving Survey**
- **Efficient LLMs** - LLMs-Survey]](https://github.com/AIoT-MLSys-Lab/Efficient-LLMs-Survey)  |⭐️⭐️ |
- **Serving Survey**
- LLM-Viewer - Viewer]](https://github.com/hahnyuan/LLM-Viewer)  |⭐️⭐️ |
- **LLM Inference**
- **Internal Consistency & Self-Feedback** - Survey]](https://github.com/IAAR-Shanghai/ICSFSurvey)  | ⭐️⭐️ |
-
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
- **Mooncake** - ai/Mooncake) |⭐️⭐️ |
- Open-Sora Plan - Sora-Plan]](https://github.com/PKU-YuanGroup/Open-Sora-Plan) | ⭐️⭐️ |
- Open-Sora - Sora]](https://github.com/hpcaitech/Open-Sora) | ⭐️⭐️ |
-
📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
-
📖Prompt/Context Compression ([©️back👆🏻](#paperlist))
- **Selective-Context** - Context](https://github.com/liyucheng09/Selective_Context) |⭐️⭐️ |
- **AutoCompressor** - nlp/AutoCompressors) |⭐️ |
- **LLMLingua**
- **LLMLingua-2**
-
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
-
📖Position Embed/Others ([©️back👆🏻](#paperlist))
-
-
📙Awesome LLM Inference Papers with Codes
-
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
-
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
-
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
- FFSplit
- **Admm Pruning** - pruning]](https://github.com/fmfi-compbio/admm-pruning) |⭐️ |
-
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
-
📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
- **DeepSpeed-FastGen 2x vLLM?** - fastgen]](https://github.com/microsoft/DeepSpeed)  |⭐️⭐️ |
- inferflow
-
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
-
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
-
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
-
-
🎉Awesome LLM Inference Papers with Codes
Categories
Sub Categories
📖KV Cache Scheduling/Quantize/Dropping ([©️back👆🏻](#paperlist))
69
📖Weight/Activation Quantize/Compress ([©️back👆🏻](#paperlist))
50
📖IO/FLOPs-Aware/Sparse Attention ([©️back👆🏻](#paperlist))
40
📖Long Context Attention/KV Cache Optimization ([©️back👆🏻](#paperlist))
38
📖Parallel Decoding/Sampling ([©️back👆🏻](#paperlist))
30
📖LLM Algorithmic/Eval Survey ([©️back👆🏻](#paperlist))
23
📖LLM Train/Inference Framework/Design ([©️back👆🏻](#paperlist))
19
📖GEMM/Tensor Cores/MMA/Parallel ([©️back👆🏻](#paperlist))
18
📖Continuous/In-flight Batching ([©️back👆🏻](#paperlist))
14
📖Early-Exit/Intermediate Layer Decoding ([©️back👆🏻](#paperlist))
14
📖DeepSeek/Multi-head Latent Attention(MLA) ([©️back👆🏻](#paperlist))
13
📖Prompt/Context/KV Compression ([©️back👆🏻](#paperlist))
12
📖CPU/Single GPU/FPGA/NPU/Mobile Inference ([©️back👆🏻](#paperlist))
11
📖DP/MP/PP/TP/SP/CP Parallelism ([©️back👆🏻](#paperlist))
9
📖Structured Prune/KD/Weight Sparse ([©️back👆🏻](#paperlist))
9
📖Mixture-of-Experts(MoE) LLM Inference ([©️back👆🏻](#paperlist))
9
📖Non Transformer Architecture ([©️back👆🏻](#paperlist))
8
📖LLM Train/Inference Framework ([©️back👆🏻](#paperlist))
7
📖GEMM/Tensor Cores/WMMA/Parallel ([©️back👆🏻](#paperlist))
6
📖CPU/Single GPU/FPGA/Mobile Inference ([©️back👆🏻](#paperlist))
5
📖VLM/Position Embed/Others ([©️back👆🏻](#paperlist))
5
📖Disaggregating Prefill and Decoding ([©️back👆🏻](#paperlist))
4
📖Prompt/Context Compression ([©️back👆🏻](#paperlist))
4
📖Trending LLM/VLM Topics ([©️back👆🏻](#paperlist))
3
📖Position Embed/Others ([©️back👆🏻](#paperlist))
2
Keywords