{"id":13409248,"url":"https://github.com/vllm-project/vllm","last_synced_at":"2026-01-29T12:02:57.878Z","repository":{"id":176349946,"uuid":"599547518","full_name":"vllm-project/vllm","owner":"vllm-project","description":"A high-throughput and memory-efficient inference and serving engine for LLMs","archived":false,"fork":false,"pushed_at":"2025-05-12T13:28:59.000Z","size":51878,"stargazers_count":47114,"open_issues_count":2426,"forks_count":7358,"subscribers_count":384,"default_branch":"main","last_synced_at":"2025-05-12T14:48:57.742Z","etag":null,"topics":["amd","cuda","deepseek","gpt","hpu","inference","inferentia","llama","llm","llm-serving","llmops","mlops","model-serving","pytorch","qwen","rocm","tpu","trainium","transformer","xpu"],"latest_commit_sha":null,"homepage":"https://docs.vllm.ai","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vllm-project.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["vllm-project"],"open_collective":"vllm"}},"created_at":"2023-02-09T11:23:20.000Z","updated_at":"2025-05-12T14:39:13.000Z","dependencies_parsed_at":"2023-09-28T04:33:26.629Z","dependency_job_id":"de85f8f0-6fd4-4d32-b9f5-021ba85120ec","html_url":"https://github.com/vllm-project/vllm","commit_stats":{"total_commits":4106,"total_committers":701,"mean_commits":5.85734664764622,"dds":0.8816366293229421,"last_synced_commit":"7a3a83e3b87f50fe9c0985a5c5bcc1d4cf2e95cd"},"previous_names":["vllm-project/vllm"],"tags_count":62,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vllm-project%2Fvllm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vllm-project","download_url":"https://codeload.github.com/vllm-project/vllm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253759774,"owners_count":21959818,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amd","cuda","deepseek","gpt","hpu","inference","inferentia","llama","llm","llm-serving","llmops","mlops","model-serving","pytorch","qwen","rocm","tpu","trainium","transformer","xpu"],"created_at":"2024-07-30T20:00:59.195Z","updated_at":"2026-01-24T22:30:01.831Z","avatar_url":"https://github.com/vllm-project.png","language":"Python","funding_links":["https://github.com/sponsors/vllm-project","https://opencollective.com/vllm"],"categories":["Models and Tools","Python","🚀 Deployment \u0026 Serving","🔬 AI 研究ツール","Uncategorized","*Ops for AI","🎯 Tool Categories","Tools for deploying LLM","LLM部署与本地运行","INFERENCING FRAMEWORKS","LLM","Serving","MLOps \u0026 Deployment","Software","Large Model Serving","A01_文本生成_文本对话","\u003cimg src=\"./assets/cpu.svg\" width=\"16\" height=\"16\" style=\"vertical-align: middle;\"\u003e Backends","Projects","LLM Deployment","Tools","3. Inference Engines \u0026 Serving","NLP","Inference \u0026 Deployment","🏗️ Reference Implementations \u0026 Case Studies","HarmonyOS","Deployment and Serving","Tooling \u0026 Infrastructure","推理 Inference","openai compatible inference engines","公司列表","Deep Learning","Agent Infrastructure","Summary","Frameworks","⚡ LLM Inference \u0026 Hosting","Awesome Open-Sourced LLMSys Projects","Hardware Acceleration and Deployment Strategies","Collections","🚀 MLOps","Local Inference","Generative KI","pytorch","🔓 Open Source Inference Engines","🏠 Local and Self-Hosted AI","Inference Engines","LLM Inference","Repos","⚙️ Systems and Multi-GPU Engineering","Model Serving Frameworks","开发者工具 \u0026 AI Infra","Language Models","🚀 Model Serving \u0026 Deployment","Inference Engine","Open-Source Local LLM Projects","Inference engines","🛠️ AI 工具与框架","LLM Serving / Inference","Model Serving","🔢 Papers - Parametric Memory","🧠 Large Language Models (LLMs)","2. **Production Tools**","Inference Engines \u0026 Backends (22)","LLMs Backend","10. Agent Deployment","LLM Frameworks \u0026 Libraries","8. Inference Engines","Language Models for NLP","Inference \u0026 Serving","Inference Runtimes \u0026 Backends","🖥 Local Deployment Tools","Runtime Engines","Tools for Deployment","Serving \u0026 Inference","AI \u0026 LLM","Local Inference and Serving","Supporting Infrastructure","Research \u0026 Data Analysis","Local \u0026 Open AI","Inference","Model Inference","Caching"],"sub_categories":["LLM Deployment","3. The Enterprise / High-Scale Stack (The 1%)","自動運転","Uncategorized","Model Serving \u0026 Inference","🤖 LLMOps \u0026 GenAI (2024-2025)","LLM 评估与数据","Large Model Serving","Model Serving","vLLM","大语言对话模型及数据","🤯 LLMs Inference and Serving","Other","High-Performance Inference","T18 · Inference \u0026 Serving","Windows Manager","Deployment \u0026 Optimization","🔥 2025 新增热门领域 (AI, Cloud, Science)","Embedding Models","Distributed Systems","Popular On-Device LLMs Framework","Useful Repositories","Tools","Local LLM Runners","Platform Guides","LangManus","推理与部署","Inference \u0026 Serving","LLM 推理与部署","UI/Interface Tutorials","🎥 Multimodal Memory (for Generation)","🛠️ Self-Hosted Solutions","Self-Hosted","Server / Production","Efficient and Small Language Models","Inference Engines","Server Deployment \u0026 High-Performance Inference","LLM Apps \u0026 Interfaces","Serve at scale","Inference","Inference Engine","Inference infrastructure KV cache"],"readme":"\u003c!-- markdownlint-disable MD001 MD041 --\u003e\n\u003cp align=\"center\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-dark.png\"\u003e\n    \u003cimg alt=\"vLLM\" src=\"https://raw.githubusercontent.com/vllm-project/vllm/main/docs/assets/logos/vllm-logo-text-light.png\" width=55%\u003e\n  \u003c/picture\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003e\nEasy, fast, and cheap LLM serving for everyone\n\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n| \u003ca href=\"https://docs.vllm.ai\"\u003e\u003cb\u003eDocumentation\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://blog.vllm.ai/\"\u003e\u003cb\u003eBlog\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://arxiv.org/abs/2309.06180\"\u003e\u003cb\u003ePaper\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://x.com/vllm_project\"\u003e\u003cb\u003eTwitter/X\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://discuss.vllm.ai\"\u003e\u003cb\u003eUser Forum\u003c/b\u003e\u003c/a\u003e | \u003ca href=\"https://slack.vllm.ai\"\u003e\u003cb\u003eDeveloper Slack\u003c/b\u003e\u003c/a\u003e |\n\u003c/p\u003e\n\n🔥 We have built a vllm website to help you get started with vllm. Please visit [vllm.ai](https://vllm.ai) to learn more.\nFor events, please visit [vllm.ai/events](https://vllm.ai/events) to join us.\n\n---\n\n## About\n\nvLLM is a fast and easy-to-use library for LLM inference and serving.\n\nOriginally developed in the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.\n\nvLLM is fast with:\n\n- State-of-the-art serving throughput\n- Efficient management of attention key and value memory with [**PagedAttention**](https://blog.vllm.ai/2023/06/20/vllm.html)\n- Continuous batching of incoming requests\n- Fast model execution with CUDA/HIP graph\n- Quantizations: [GPTQ](https://arxiv.org/abs/2210.17323), [AWQ](https://arxiv.org/abs/2306.00978), [AutoRound](https://arxiv.org/abs/2309.05516), INT4, INT8, and FP8\n- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer\n- Speculative decoding\n- Chunked prefill\n\nvLLM is flexible and easy to use with:\n\n- Seamless integration with popular Hugging Face models\n- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more\n- Tensor, pipeline, data and expert parallelism support for distributed inference\n- Streaming outputs\n- OpenAI-compatible API server\n- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.\n- Prefix caching support\n- Multi-LoRA support\n\nvLLM seamlessly supports most popular open-source models on HuggingFace, including:\n\n- Transformer-like LLMs (e.g., Llama)\n- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)\n- Embedding Models (e.g., E5-Mistral)\n- Multi-modal LLMs (e.g., LLaVA)\n\nFind the full list of supported models [here](https://docs.vllm.ai/en/latest/models/supported_models.html).\n\n## Getting Started\n\nInstall vLLM with `pip` or [from source](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source):\n\n```bash\npip install vllm\n```\n\nVisit our [documentation](https://docs.vllm.ai/en/latest/) to learn more.\n\n- [Installation](https://docs.vllm.ai/en/latest/getting_started/installation.html)\n- [Quickstart](https://docs.vllm.ai/en/latest/getting_started/quickstart.html)\n- [List of Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)\n\n## Contributing\n\nWe welcome and value any contributions and collaborations.\nPlease check out [Contributing to vLLM](https://docs.vllm.ai/en/latest/contributing/index.html) for how to get involved.\n\n## Citation\n\nIf you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):\n\n```bibtex\n@inproceedings{kwon2023efficient,\n  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},\n  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},\n  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},\n  year={2023}\n}\n```\n\n## Contact Us\n\n\u003c!-- --8\u003c-- [start:contact-us] --\u003e\n- For technical questions and feature requests, please use GitHub [Issues](https://github.com/vllm-project/vllm/issues)\n- For discussing with fellow users, please use the [vLLM Forum](https://discuss.vllm.ai)\n- For coordinating contributions and development, please use [Slack](https://slack.vllm.ai)\n- For security disclosures, please use GitHub's [Security Advisories](https://github.com/vllm-project/vllm/security/advisories) feature\n- For collaborations and partnerships, please contact us at [collaboration@vllm.ai](mailto:collaboration@vllm.ai)\n\u003c!-- --8\u003c-- [end:contact-us] --\u003e\n\n## Media Kit\n\n- If you wish to use vLLM's logo, please refer to [our media kit repo](https://github.com/vllm-project/media-kit)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fvllm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvllm-project%2Fvllm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvllm-project%2Fvllm/lists"}