{"id":13409248,"url":"https://github.com/vllm-project/vllm","last_synced_at":"2026-01-29T12:02:57.878Z","repository":{"id":176349946,"uuid":"599547518","full_name":"vllm-project/vllm","owner":"vllm-project","description":"A high-throughput and memory-efficient inference and serving engine for LLMs","archived":false,"fork":false,"pushed_at":"2025-05-12T13:28:59.000Z","size":51878,"stargazers_count":47114,"open_issues_count":2426,"forks_count":7358,"subscribers_count":384,"default_branch":"main","last_synced_at":"2025-05-12T14:48:57.742Z","etag":null,"topics":["amd","cuda","deepseek","gpt","hpu","inference","inferentia","llama","llm","llm-serving","llmops","mlops","model-serving","pytorch","qwen","rocm","tpu","trainium","transformer","xpu"],"latest_commit_sha":null,"homepage":"https://docs.vllm.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vllm-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["vllm-project"],"open_collective":"vllm"}},"created_at":"2023-02-09T11:23:20.000Z","updated_at":"2025-05-12T14:39:13.000Z","dependencies_parsed_at":"2023-09-28T04:33:26.629Z","dependency_job_id":"de85f8f0-6fd4-4d32-b9f5-021ba85120ec","html_url":"https://github.com/vllm-project/vllm","commit_stats":{"total_commits":4106,"total_committers":701,"mean_commits":5.85734664764622,"dds":0.8816366293229421,"last_synced_commit":"7a3a83e3b87f50fe9c0985a5c5bcc1d4cf2e95cd"},"previous_names":["vllm-project/vllm"],"tags_count":62,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vllm-project","download_url":"https://codeload.github.com/vllm-project/vllm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253759774,"owners_count":21959818,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amd","cuda","deepseek","gpt","hpu","inference","inferentia","llama","llm","llm-serving","llmops","mlops","model-serving","pytorch","qwen","rocm","tpu","trainium","transformer","xpu"],"created_at":"2024-07-30T20:00:59.195Z","updated_at":"2026-01-24T22:30:01.831Z","avatar_url":"https://github.com/vllm-project.png","language":"Python","readme":"\u003c!-- markdownlint-disable MD001 MD041 --\u003e\n\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png\"\u003e\n    \u003cimg alt=\"vLLM\" src=\"https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png\" width=55%\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\nEasy, fast, and cheap LLM serving for everyone\n\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n| \u003ca href=\"https://docs.vllm.ai\"\u003e\u003cb\u003eDocumentation\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://blog.vllm.ai/\"\u003e\u003cb\u003eBlog\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://arxiv.org/abs/2309.06180\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://x.com/vllm_project\"\u003e\u003cb\u003eTwitter/X\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://discuss.vllm.ai\"\u003e\u003cb\u003eUser Forum\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://slack.vllm.ai\"\u003e\u003cb\u003eDeveloper Slack\u003c/b\u003e\u003c/a\u003e |\n\u003c/p\u003e\n\n🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.\nFor events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.\n\n---\n\n## About\n\nvLLM is a fast and easy-to-use library for LLM inference and serving.\n\nOriginally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.\n\nvLLM is fast with:\n\n- State-of-the-art serving throughput\n- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)\n- Continuous batching of incoming requests\n- Fast model execution with CUDA/HIP graph\n- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8\n- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer\n- Speculative decoding\n- Chunked prefill\n\nvLLM is flexible and easy to use with:\n\n- Seamless integration with popular Hugging Face models\n- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more\n- Tensor, pipeline, data and expert parallelism support for distributed inference\n- Streaming outputs\n- OpenAI-compatible API server\n- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.\n- Prefix caching support\n- Multi-LoRA support\n\nvLLM seamlessly supports most popular open-source models on HuggingFace, including:\n\n- Transformer-like LLMs (e.g., Llama)\n- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)\n- Embedding Models (e.g., E5-Mistral)\n- Multi-modal LLMs (e.g., LLaVA)\n\nFind the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).\n\n## Getting Started\n\nInstall vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):\n\n```bash\npip install vllm\n```\n\nVisit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.\n\n- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)\n- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)\n- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)\n\n## Contributing\n\nWe welcome and value any contributions and collaborations.\nPlease check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.\n\n## Citation\n\nIf you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):\n\n```bibtex\n@inproceedings{kwon2023efficient,\n  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},\n  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},\n  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},\n  year={2023}\n}\n```\n\n## Contact Us\n\n\u003c!-- --8\u003c-- [start:contact-us] --\u003e\n- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)\n- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)\n- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)\n- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature\n- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)\n\u003c!-- --8\u003c-- [end:contact-us] --\u003e\n\n## Media Kit\n\n- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)\n","funding_links":["https://github.com/sponsors/vllm-project","https://opencollective.com/vllm"],"categories":["Models and Tools","Python","🚀 Deployment \u0026 Serving","Uncategorized","🎯 Tool Categories","Tools for deploying LLM","Software","LLM部署与本地运行","INFERENCING FRAMEWORKS","LLM","Serving","MLOps \u0026 Deployment","Large Model Serving","A01_文本生成_文本对话","\u003cimg src=\"./assets/cpu.svg\" width=\"16\" height=\"16\" style=\"vertical-align: middle;\"\u003e Backends","Projects","LLM Deployment","Tools","NLP","Inference \u0026 Deployment","HarmonyOS","Deployment and Serving","推理 Inference","openai compatible inference engines","Inference Runtimes \u0026 Backends","Deep Learning","Summary","Frameworks","Awesome Open-Sourced LLMSys Projects","Hardware Acceleration and Deployment Strategies","Collections","🚀 MLOps","Local Inference","Generative KI","pytorch","🔓 Open Source Inference Engines","🏠 Local and Self-Hosted AI","Inference Engines","LLM Inference","Repos","⚙️ Systems and Multi-GPU Engineering","Model Serving Frameworks","Inference","开发者工具 \u0026 AI Infra","Language Models","🚀 Model Serving \u0026 Deployment","Inference Engine","Open-Source Local LLM Projects","Inference engines","🛠️ AI 工具与框架","LLM Serving / Inference","Model Serving","🖥 Local Deployment Tools","🧠 Large Language Models (LLMs)","Inference Engines \u0026 Backends (22)","LLM Frameworks \u0026 Libraries","Inference \u0026 Serving","Runtime Engines","Tools for Deployment","Serving \u0026 Inference","LLMs Backend","公司列表"],"sub_categories":["LLM Deployment","3. The Enterprise / High-Scale Stack (The 1%)","Uncategorized","🤖 LLMOps \u0026 GenAI (2024-2025)","vLLM","LLM 评估与数据","Large Model Serving","Model Serving","大语言对话模型及数据","🤯 LLMs Inference and Serving","Other","High-Performance Inference","Windows Manager","Distributed Systems","Popular On-Device LLMs Framework","Useful Repositories","Tools","Local LLM Runners","Platform Guides","LangManus","Inference Engine","推理与部署","Inference \u0026 Serving","LLM 推理与部署","UI/Interface Tutorials","Server Deployment \u0026 High-Performance Inference","🛠️ Self-Hosted Solutions","Inference","Inference Engines","🔥 2025 新增热门领域 (AI, Cloud, Science)"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fvllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvllm-project%2Fvllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fvllm/lists"}