https://github.com/AlibabaPAI/llumnix
Efficient and easy multi-instance LLM serving
https://github.com/AlibabaPAI/llumnix
Last synced: 19 days ago
JSON representation
Efficient and easy multi-instance LLM serving
- Host: GitHub
- URL: https://github.com/AlibabaPAI/llumnix
- Owner: AlibabaPAI
- License: apache-2.0
- Created: 2024-05-20T06:13:59.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-03-25T12:58:04.000Z (21 days ago)
- Last Synced: 2025-03-25T13:44:10.244Z (21 days ago)
- Language: Python
- Homepage:
- Size: 3.55 MB
- Stars: 343
- Watchers: 11
- Forks: 27
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Support: docs/Supported_Models.md
Awesome Lists containing this project
- Awesome-LLMOps - Llumnix - instance LLM serving.    (Inference / Inference Engine)
README
Llumnix
Efficient and easy multi-instance LLM serving---
## 🔥 Latest News
- [2025.1] We updated vLLM to version v0.6.3.post1.
- [2024.11] Llumnix v0.1.0 launched!
- [2024.7] We officially released the first version of Llumnix.
- [2024.6] We released our OSDI '24 [research paper](https://arxiv.org/abs/2406.03243) on arxiv.## 🚀 Why Llumnix
Llumnix is a cross-instance request scheduling layer built on top of LLM inference engines such as [vLLM](https://github.com/vllm-project/vllm).
Llumnix provides optimized multi-instance serving performance in terms of:
- *Low latency*
- **Reduced time-to-first-token** (TTFT) and queuing delays with less memory fragmentation
- **Reduced time-between-tokens** (TBT) and preemption stalls with better load balancing
- *High throughput*
- Integration with state-of-the-art inference engines
- Support for techniques like prefill-decode disaggregationLlumnix achieves this with:
- Dynamic, fine-grained, KV-cache-aware scheduling
- Continuous **rescheduling** across instances
- Enabled by a KV cache migration mechanism with near-zero overhead
- Exploited for continuous load balancing, de-fragmentation, and prefill-decode disaggregationLlumnix is easy to use with:
- Minimal code changes required for vanilla vLLM deployments
- Seamless integration with existing multi-instance deployment platforms
- Fault tolerance, elasticity, and high service availability
- Extensibility to more inference engines and scheduling policies
## Getting Started
Llumnix provides two entrypoints `api_server` and `serve` for deploying Llumnix. The `api_server` entrypoint provides a compatible deployment method with the default single-instance vLLM. By contrast, using the `serve` entrypoint, user can easily deploy Llumnix via Ray job submission API.
If you are already utilizing vLLM for multi-instance LLM serving deployments, simply replace the vLLM serving deployment command `python -m entrypoints.vllm.api_server ...` for each instance with the command provided below:
```
python -m llumnix.entrypoints.vllm.api_server \
--host $HOST \
--port $PORT \
...
```
During the serving deployment execution, Llumnix will automatically configure itself and serve as the request scheduling layer on top of the multiple vLLM engine instances.For deploying Llumnix using the `serve` module, please refer to [QuickStart](./docs/Quickstart.md).
Visit our [documentation](./docs/) to get started:
- [Quick Start](./docs/Quickstart.md)
- [Supported Models](./docs/Supported_Models.md)
- [Fault Tolerance](./docs/Fault_Tolerance.md)
- [Simulator](./docs/Simulator.md)
- [Prefill-decode Disaggregation](./docs/Prefill-decode_Disaggregation.md)## Performance
We evaluate the performance of the KV-cache-aware load-balancing scheduler and migration mechanism of Llumnix with 16 Qwen2.5-7B instances (each using an A10-24GB GPU) and 16 Llama2-13B instances (each using an A800-80GB GPU).We use Poisson distributions with different request rates to generate request arrivals. For the input/output lengths of requests, we use ShareGPT dataset.
![]()
Llumnix outperforms a simple round-robin scheduler in TTFT (prefill) by up to 6.4x and 12.1x for mean and P99, and 12% for P99 TBT (decode). Llumnix also shows significantly shorter average preemption stalls (by two orders of magnitude).
With the KV-cache-aware load-balancing scheduler and the migration mechanism, Llumnix also outperforms a simple load balancing scheduler based on queue sizes in TTFT (prefill) by up to 4.6x and 9.1x for mean and P99, and 15% for P99 TBT (decode).
## Roadmap
Llumnix is currently in an alpha stage. Moving forward, we have work items planned including but not limited to:
- Architectural improvement: improving the scalability and efficiency of distributed serving and coordination;
- Policy optimization: better dispatching, migration, auto-scaling policies;
- New features: incorporating more inference engine features;
- Engineering: testing, CI/CD, etc.## Publication
Please cite our paper if you use Llumnix in your research:
```bibtex
@inproceedings{sun2024llumnix,
title={Llumnix: Dynamic Scheduling for Large Language Model Serving},
author={Biao Sun and Ziming Huang and Hanyu Zhao and Wencong Xiao and Xinyi Zhang and Yong Li and Wei Lin},
booktitle={18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24)},
year={2024}
}
```## License
Llumnix is licensed under the Apache 2.0 License.