Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/zhihu/ZhiLight
A highly optimized inference acceleration engine for Llama and its variants.
https://github.com/zhihu/ZhiLight
cuda gpt inference-engine llama llm llm-serving pytorch
Last synced: about 1 month ago
JSON representation
A highly optimized inference acceleration engine for Llama and its variants.
- Host: GitHub
- URL: https://github.com/zhihu/ZhiLight
- Owner: zhihu
- License: apache-2.0
- Created: 2024-12-06T11:28:17.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-06T12:05:46.000Z (about 2 months ago)
- Last Synced: 2024-12-06T13:39:15.113Z (about 2 months ago)
- Topics: cuda, gpt, inference-engine, llama, llm, llm-serving, pytorch
- Language: C++
- Homepage:
- Size: 689 KB
- Stars: 3
- Watchers: 6
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
- awesome-LLM-resourses - ZhiLight
README
# ZhiLight大模型推理引擎
✨ __ZhiLight__ ✨is a highly optimized LLM inference engine developed by Zhihu and ModelBest Inc. The "Zhi" in its name stands for **Z**hihu. ZhiLight can accelerate the inference of models like Llama and its variants, especially on PCIe-based GPUs. Compared to mainstream open-source inference engines, for example, vllm, it has significant performance advantages.
## 🎉🎉 Main Features
* Asynchronous OpenAI compatible interface adapted from vllm
* Custom defined tensor and unified global memory management
* 🔥 Encode and all-reduce overlap, we named "dual streams"
* Support Int8-quantized all-reduce to further reduce all-reduce cost.
* Host all-reduce based on SIMD instructions
* Optimized fused kernels, qkv, residual & layernorm etc.
* 🔥 Fused batch attention for decoding based on tensor core instructions
* Support TP and PP on one node, TP is recommended
* Support dynamic batch
* Support flashatten prefill
* Support chunked prefill
* Support prefix cache
* Support Native INT8/SmoothQuant/FP8/AWQ/GPTQ quantization
* Support Marlin kernel for GPTQ
* Support MoE, DeepseekV2 MoE and DeepseekV2 MLA
* Support Llama/Llama2, Mixtral, Qwen2 series and similar models
## 🔧 Basic Usage
```bash
# Concurrently compile the wheel package, and turn off the unit test
CMAKE_BUILD_PARALLEL_LEVEL=32 TESTING=0 python setup.py bdist_wheel# Compile with ninja backend
CMAKE_GENERATER="Ninja" python setup.py bdist_wheel# Install directly
cd ./ZhiLight && pip install -e .# Start OpenAI compatible server
python -m zhilight.server.openai.entrypoints.api_server [options]
```
## ✈️ Docker Image
ZhiLight only depends on the CUDA runtime, cuBLAS, NCCL, and a few Python packages in requirements.txt. You can use the image below for running or building it. You can also directly refer to docker/Dockerfile.
```bash
docker pull ghcr.io/zhihu/zhilight/zhilight:0.4.8-cu124
```## 📈 Performance Notes
We conducted performance reviews on various mainstream NVIDIA GPUs with different model sizes and precisions. For dense models ranging from 2B to 110B parameters on PCIe devices, ZhiLight demonstrates significant performance advantages compared to mainstream open-source inference engines.
Test Description:
- Test purpose is to demonstrate performance, applicable scenarios and limitations
- Test dataset contains approximately 3.7k prompts
- Test metrics include:
- QPS: Queries Per Second
- TTFT (Time To First Token): First token generation latency
- TPOT (Time Per Output Token): Generation latency per output token
- Test environments include:
- AD102 PCIe : Consumer-grade GPU for experimental research
- A800: Data center GPU for production deployment
- Test models include:
- Large-scale models: Qwen1.5-110B, Qwen2-72B, LLama-3.1-70B
- Medium-scale models: Qwen2.5-14B, Llama-3.1-8B, Minicpm-2B
- Compared inference engines include:
- vLLM
- SGLang
- ZhiLight### MiniCPM-2B-sft-bf16
- **NVIDIA AD102 PCIe * 1**
| Inference Engine | QPS | TTFT Mean | TTFT P95 | TPOT Mean| TPOT P95|
| :---: | :---: | :---: | :----: | :---: | :---: |
| vLLM | 1.67 | 527.55 | 1062.96 | 16.71 | 31.95 |
| SGLang |1.67 | 466.19 | 1181.5 | 33.96 | 59.44 |
| **ZhiLight** | **1.67**|**434.64** | **989.03**|**26.1** |**61.14**|### Qwen2-72B-Instruct-GPTQ-Int4
- **NVIDIA AD102 PCIe * 4**
| Inference Engine | QPS | TTFT Mean | TTFT P95 | TPOT Mean| TPOT P95|
| :---: | :---: | :---: | :----: | :---: | :---: |
| vLLM | 0.18 | 3493.97 | 6852.07 | 35.47 | 61.74|
| SGLang | 0.18 | 2276.1 | 3820.7 | 38.12 | 65.16|
| **ZhiLight** |**0.18** | **1111.8**|**1882.5**| **26.75**|**41.81**|- **NVIDIA A800 * 4**
| Inference Engine | QPS | TTFT Mean | TTFT P95 | TPOT Mean | TPOT P95|
| :---: | :---: | :---: | :----: | :---: | :---: |
| vLLM | 0.18 | 1457.65 | 2136.5 | 22.14 | 28.96|
| **SGLang** |**0.36** |**1113.06**|**1850.57**| **30.41**|**43.65**|
| ZhiLight | 0.18 | 1227.37 | 1968.95 | 31.95 | 48.53 |
### Qwen1.5-110B-Chat-GPTQ-Int4- **NVIDIA AD102 PCIe * 4**
| Inference Engine | QPS | TTFT Mean | TTFT P95 | TPOT Mean| TPOT P95|
| :---: | :---: | :---: | :----: | :---: | :---: |
| vLLM | 0.09 | 3085.74 | 4274.03 | 30.34 | 44.08|
| SGLang | 0.09 |2418.56 | 3187.73 | 31.39 | 53.1 |
| **ZhiLight** |**0.18** |**1671.38**|**2669.82**| **39.68**|**64.35**|- **NVIDIA A800 * 4**
| Inference Engine | QPS | TTFT Mean | TTFT P95 | TPOT Mean| TPOT P95|
| :---: | :---: | :---: | :----: | :---: | :---: |
| vLLM | 0.09 | 1899.07 | 2719.59 | 23.8 | 33.02|
| **SGLang** |**0.18** |**1514.49**|**2135.75**| **28.5**|**47.28**|
| ZhiLight | 0.1 | 1574.85 | 2086.8 | 27.07 | 38.82 |more benchmarks can be found in [benchmarks.md](docs/benchmarks.md)
## License
Apache License 2.0## Contributors
- [@a710128](https://github.com/a710128)
- [@spetrel](https://github.com/spetrel)
- [@unix1986](https://github.com/unix1986)
- [@gnap](https://github.com/gnap)