{"id":27149024,"url":"https://github.com/mahshid1378/production-stack","last_synced_at":"2026-06-23T18:31:13.387Z","repository":{"id":283467792,"uuid":"951863274","full_name":"mahshid1378/production-stack","owner":"mahshid1378","description":"vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization","archived":false,"fork":false,"pushed_at":"2025-03-20T13:21:14.000Z","size":1943,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-05T02:30:03.953Z","etag":null,"topics":["artificial-intelligence","image-classification","image-processing","vllm"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mahshid1378.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-03-20T11:09:56.000Z","updated_at":"2025-03-20T13:21:17.000Z","dependencies_parsed_at":"2025-03-21T14:03:27.596Z","dependency_job_id":null,"html_url":"https://github.com/mahshid1378/production-stack","commit_stats":null,"previous_names":["mahshid1378/production-stack"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mahshid1378/production-stack","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahshid1378%2Fproduction-stack","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahshid1378%2Fproduction-stack/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahshid1378%2Fproduction-stack/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahshid1378%2Fproduction-stack/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mahshid1378","download_url":"https://codeload.github.com/mahshid1378/production-stack/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mahshid1378%2Fproduction-stack/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34702910,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-23T02:00:07.161Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["artificial-intelligence","image-classification","image-processing","vllm"],"created_at":"2025-04-08T12:35:16.752Z","updated_at":"2026-06-23T18:31:13.356Z","avatar_url":"https://github.com/mahshid1378.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# vLLM Production Stack: reference stack for production vLLM deployment\n\n## Introduction\n\n**vLLM Production Stack** project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:\n\n- 🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code\n- 💻 Monitor the  through a web dashboard\n- 😄 Enjoy the performance benefits brought by request routing and KV cache offloading\n\n## Step-By-Step Tutorials\n\n0. How To [*Install Kubernetes (kubectl, helm, minikube, etc)*]?\n1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs, Azure)*]?\n2. How To [*Setup a Minimal vLLM Production Stack*]?\n3. How To [*Customize vLLM Configs (optional)*]?\n4. How to [*Load Your LLM Weights*]?\n5. How to [*Launch Different LLMs in vLLM Production Stack*]?\n6. How to [*Enable KV Cache Offloading with LMCache*]?\n\n## Architecture\n\ncontains the following key parts:\n\n- **Serving engine**: The vLLM engines that run different LLMs\n- **Request router**: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.\n- **Observability stack**: monitors the metrics of the backends through [Prometheus] + [Grafana]\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/user-attachments/assets/8f05e7b9-0513-40a9-9ba9-2d3acca77c0c\" alt=\"Architecture of the stack\" width=\"80%\"/\u003e\n\u003c/p\u003e\n\n## Roadmap\n\nWe are actively working on this project and will release the following features soon. Please stay tuned!\n\n- **Autoscaling** based on vLLM-specific metrics\n- Support for **disaggregated prefill**\n- **Router improvements** (e.g., more performant router using non-python languages, KV-cache-aware routing algorithm, better fault tolerance, etc)\n\n## Deploying the stack via Helm\n\n### Prerequisites\n\n- A running Kubernetes (K8s) environment with GPUs\n  - Run `cd utils \u0026\u0026 bash install-minikube-cluster.sh`\n  - Or follow our [tutorial](tutorials/00-install-kubernetes-env.md)\n\n### Deployment\n\nvLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:\n\n```bash\ngit clone https://github.com/vllm-project/production-stack.git\ncd production-stack/\nhelm repo add vllm https://vllm-project.github.io/production-stack\nhelm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml\n```\n\n### Uninstall\n\n```bash\nhelm uninstall vllm\n```\n\n## Grafana Dashboard\n\n### Features\n\nThe Grafana dashboard provides the following insights:\n\n1. **Available vLLM Instances**: Displays the number of healthy instances.\n2. **Request Latency Distribution**: Visualizes end-to-end request latency.\n3. **Time-to-First-Token (TTFT) Distribution**: Monitors response times for token generation.\n4. **Number of Running Requests**: Tracks the number of active requests per instance.\n5. **Number of Pending Requests**: Tracks requests waiting to be processed.\n6. **GPU KV Usage Percent**: Monitors GPU KV cache usage.\n7. **GPU KV Cache Hit Rate**: Displays the hit rate for the GPU KV cache.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/user-attachments/assets/05766673-c449-4094-bdc8-dea6ac28cb79\" alt=\"Grafana dashboard to monitor the deployment\" width=\"80%\"/\u003e\n\u003c/p\u003e\n\n### Configuration\n\nSee the details in [`observability/README.md`](./observability/README.md)\n\n## Router\n\nThe router ensures efficient request distribution among backends. It supports:\n\n- Routing to endpoints that run different models\n- Exporting observability metrics for each serving engine instance, including QPS, time-to-first-token (TTFT), number of pending/running/finished requests, and uptime\n- Automatic service discovery and fault tolerance by Kubernetes API\n- Multiple different routing algorithms\n  - Round-robin routing\n  - Session-ID based routing\n  - (WIP) prefix-aware routing\n\nPlease refer to the [router documentation](./src/vllm_router/README.md) for more details.\n\n## Contributing\n\nWe welcome and value any contributions and collaborations.  Please check out [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.\n\n## License\n\nThis project is licensed under Apache License 2.0. See the `LICENSE` file for details.\n\n---\n\nFor any issues or questions, feel free to open an issue or contact us ([@ApostaC], [@YuhanLiu11], [@Shaoting-Feng]).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahshid1378%2Fproduction-stack","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmahshid1378%2Fproduction-stack","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmahshid1378%2Fproduction-stack/lists"}