https://github.com/mahshid1378/production-stack
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
https://github.com/mahshid1378/production-stack
artificial-intelligence image-classification image-processing vllm
Last synced: about 19 hours ago
JSON representation
vLLM’s reference system for K8S-native cluster-wide deployment with community-driven performance optimization
- Host: GitHub
- URL: https://github.com/mahshid1378/production-stack
- Owner: mahshid1378
- License: apache-2.0
- Created: 2025-03-20T11:09:56.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-20T13:21:14.000Z (over 1 year ago)
- Last Synced: 2025-11-05T02:30:03.953Z (8 months ago)
- Topics: artificial-intelligence, image-classification, image-processing, vllm
- Language: Python
- Homepage:
- Size: 1.85 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# vLLM Production Stack: reference stack for production vLLM deployment
## Introduction
**vLLM Production Stack** project provides a reference implementation on how to build an inference stack on top of vLLM, which allows you to:
- 🚀 Scale from single vLLM instance to distributed vLLM deployment without changing any application code
- 💻 Monitor the through a web dashboard
- 😄 Enjoy the performance benefits brought by request routing and KV cache offloading
## Step-By-Step Tutorials
0. How To [*Install Kubernetes (kubectl, helm, minikube, etc)*]?
1. How to [*Deploy Production Stack on Major Cloud Platforms (AWS, GCP, Lambda Labs, Azure)*]?
2. How To [*Setup a Minimal vLLM Production Stack*]?
3. How To [*Customize vLLM Configs (optional)*]?
4. How to [*Load Your LLM Weights*]?
5. How to [*Launch Different LLMs in vLLM Production Stack*]?
6. How to [*Enable KV Cache Offloading with LMCache*]?
## Architecture
contains the following key parts:
- **Serving engine**: The vLLM engines that run different LLMs
- **Request router**: Directs requests to appropriate backends based on routing keys or session IDs to maximize KV cache reuse.
- **Observability stack**: monitors the metrics of the backends through [Prometheus] + [Grafana]
## Roadmap
We are actively working on this project and will release the following features soon. Please stay tuned!
- **Autoscaling** based on vLLM-specific metrics
- Support for **disaggregated prefill**
- **Router improvements** (e.g., more performant router using non-python languages, KV-cache-aware routing algorithm, better fault tolerance, etc)
## Deploying the stack via Helm
### Prerequisites
- A running Kubernetes (K8s) environment with GPUs
- Run `cd utils && bash install-minikube-cluster.sh`
- Or follow our [tutorial](tutorials/00-install-kubernetes-env.md)
### Deployment
vLLM Production Stack can be deployed via helm charts. Clone the repo to local and execute the following commands for a minimal deployment:
```bash
git clone https://github.com/vllm-project/production-stack.git
cd production-stack/
helm repo add vllm https://vllm-project.github.io/production-stack
helm install vllm vllm/vllm-stack -f tutorials/assets/values-01-minimal-example.yaml
```
### Uninstall
```bash
helm uninstall vllm
```
## Grafana Dashboard
### Features
The Grafana dashboard provides the following insights:
1. **Available vLLM Instances**: Displays the number of healthy instances.
2. **Request Latency Distribution**: Visualizes end-to-end request latency.
3. **Time-to-First-Token (TTFT) Distribution**: Monitors response times for token generation.
4. **Number of Running Requests**: Tracks the number of active requests per instance.
5. **Number of Pending Requests**: Tracks requests waiting to be processed.
6. **GPU KV Usage Percent**: Monitors GPU KV cache usage.
7. **GPU KV Cache Hit Rate**: Displays the hit rate for the GPU KV cache.
### Configuration
See the details in [`observability/README.md`](./observability/README.md)
## Router
The router ensures efficient request distribution among backends. It supports:
- Routing to endpoints that run different models
- Exporting observability metrics for each serving engine instance, including QPS, time-to-first-token (TTFT), number of pending/running/finished requests, and uptime
- Automatic service discovery and fault tolerance by Kubernetes API
- Multiple different routing algorithms
- Round-robin routing
- Session-ID based routing
- (WIP) prefix-aware routing
Please refer to the [router documentation](./src/vllm_router/README.md) for more details.
## Contributing
We welcome and value any contributions and collaborations. Please check out [CONTRIBUTING.md](CONTRIBUTING.md) for how to get involved.
## License
This project is licensed under Apache License 2.0. See the `LICENSE` file for details.
---
For any issues or questions, feel free to open an issue or contact us ([@ApostaC], [@YuhanLiu11], [@Shaoting-Feng]).