https://github.com/vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
https://github.com/vllm-project/vllm
amd cuda deepseek gpt hpu inference inferentia llama llm llm-serving llmops mlops model-serving pytorch qwen rocm tpu trainium transformer xpu
Last synced: 1 day ago
JSON representation
A high-throughput and memory-efficient inference and serving engine for LLMs
- Host: GitHub
- URL: https://github.com/vllm-project/vllm
- Owner: vllm-project
- License: apache-2.0
- Created: 2023-02-09T11:23:20.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2025-05-12T13:28:59.000Z (9 months ago)
- Last Synced: 2025-05-12T14:48:57.742Z (9 months ago)
- Topics: amd, cuda, deepseek, gpt, hpu, inference, inferentia, llama, llm, llm-serving, llmops, mlops, model-serving, pytorch, qwen, rocm, tpu, trainium, transformer, xpu
- Language: Python
- Homepage: https://docs.vllm.ai
- Size: 49.5 MB
- Stars: 47,114
- Watchers: 384
- Forks: 7,358
- Open Issues: 2,426
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
Awesome Lists containing this project
- awesome-opensource-alternatives-plus - vLLM - throughput and memory-efficient LLM serving engine | [OpenAI API](https://platform.openai.com/), [Anthropic API](https://www.anthropic.com/api) | (公司列表 / 🔥 2025 新增热门领域 (AI, Cloud, Science))
- Awesome-LLM-Productization - vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs (Models and Tools / LLM Deployment)
- Awesome-RAG-Production - vLLM
- Ultimate-AI-Resources - vLLM
- Awesome_Multimodel_LLM - vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs (Tools for deploying LLM)
- awesome-llm - vLLM - 生产环境的首选推理引擎,支持 PagedAttention,吞吐量极高。 (LLM部署与本地运行 / LLM 评估与数据)
- awesome-llm-list - vLLM
- awesome-ml-python-packages - vLLM
- awesome-llmops - vllm - throughput and memory-efficient inference and serving engine for LLMs. |  | (Serving / Large Model Serving)
- StarryDivineSky - vllm-project/vllm
- awesome-llm-services - vLLM
- awesome-llm-projects - vLLM - throughput and memory-efficient inference and serving engine for LLMs. (Projects / 🤯 LLMs Inference and Serving)
- Awesome-LLM - vLLM - A high-throughput and memory-efficient inference and serving engine for LLMs. (LLM Deployment)
- Awesome-LLM-Compression - [Code
- awesome-production-machine-learning - vLLM - project/vllm.svg?style=social) - vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. (Deployment and Serving)
- alan_awesome_llm - vllm - throughput and memory-efficient inference and serving engine for LLMs. (推理 Inference)
- awesome-private-ai - vLLM - High-throughput, low-latency inference engine for LLMs. (Inference Runtimes & Backends)
- awesome - vllm-project/vllm - A high-throughput and memory-efficient inference and serving engine for LLMs (Python)
- awesome-llm-and-aigc - vLLM - project/vllm?style=social"/> : A high-throughput and memory-efficient inference and serving engine for LLMs. [docs.vllm.ai](https://docs.vllm.ai/) (Summary)
- awesome-cuda-and-hpc - vLLM - project/vllm?style=social"/> : A high-throughput and memory-efficient inference and serving engine for LLMs. [docs.vllm.ai](https://docs.vllm.ai/) (Frameworks)
- awesome-llm-tools - vLLM
- Awesome-LLMs-on-device - [Github
- awesome-data-analysis - vLLM - High-throughput and memory-efficient inference library for LLMs. (🚀 MLOps / Tools)
- Awesome-Local-LLM - _Git_
- awesome-thesis-tools - vLLM
- awesome-LLM-resources - vllm (`🔥`) - throughput and memory-efficient inference and serving engine for LLMs. (推理 Inference)
- awesome-repositories - vllm-project/vllm - A high-throughput and memory-efficient inference and serving engine for LLMs (Python)
- AiTreasureBox - vllm-project/vllm - 11-09_62568_41](https://img.shields.io/github/stars/vllm-project/vllm.svg)|A high-throughput and memory-efficient inference and serving engine for LLMs| (Repos)
- awesome-local-llms - vllm - throughput and memory-efficient inference and serving engine for LLMs | 57,434 | 9,971 | 2,949 | 458 | 73 | Apache License 2.0 | 0 days, 9 hrs, 27 mins | (Open-Source Local LLM Projects)
- best-of-ai-open-source - GitHub - 19% open · ⏱️ 07.01.2025): (LLM Frameworks & Libraries)
- Awesome-LLMOps - vLLM - throughput and memory-efficient inference and serving engine for LLMs.    (Inference / Inference Engine)
- awesome - vllm-project/vllm - A high-throughput and memory-efficient inference and serving engine for LLMs (Python)
- Awesome-LLM-VLM-Foundation-Models - vLLM
- awesome-hacking-lists - vllm-project/vllm - A high-throughput and memory-efficient inference and serving engine for LLMs (Python)
- awesome-local-ai - vLLM - vLLM is a fast and easy-to-use library for LLM inference and serving. | GGML/GGUF | Both | ❌ | Python | Text-Gen | (Inference Engine)
- awesome-gpu-engineering - vLLM - Inference and serving engine for LLMs (⚙️ Systems and Multi-GPU Engineering)
- awesome - vllm-project/vllm - A high-throughput and memory-efficient inference and serving engine for LLMs (Python)
- awesome-genai - vllm-Efficient LLM Serving
- awesome-local-ai - vLLM - throughput serving with PagedAttention (Inference Engines & Backends (22))
- awesome-production-llm - vllm - project/vllm.svg?style=social) A high-throughput and memory-efficient inference and serving engine for LLMs (LLM Serving / Inference)
- awesome-local-llm - vllm - a high-throughput and memory-efficient inference and serving engine for LLMs (Inference engines)
- awesome-ai-engineering - vLLM - "vLLM is a fast and easy-to-use library for LLM inference and serving". (Tools / Inference)
- awesome-llmops - vLLM - efficient inference for LLMs with continuous batching. (Serving & Inference)
README
Easy, fast, and cheap LLM serving for everyone
| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |
🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.
For events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.
---
## About
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).
## Getting Started
Install vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):
```bash
pip install vllm
```
Visit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.
- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)
- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)
- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
## Contributing
We welcome and value any contributions and collaborations.
Please check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.
## Citation
If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
## Contact Us
- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)
- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)
- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)
- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature
- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)
## Media Kit
- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)