https://github.com/jina-ai/rungpt
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
https://github.com/jina-ai/rungpt
flamingo gpt-4 large-language-models large-multimadality-models llama llm-hosting llm-serve lmm-serve multi-modality opengpt self-hosting transformers
Last synced: 6 months ago
JSON representation
An open-source cloud-native of large multi-modal models (LMMs) serving framework.
- Host: GitHub
- URL: https://github.com/jina-ai/rungpt
- Owner: jina-ai
- License: apache-2.0
- Created: 2023-04-04T07:57:35.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-09-05T03:32:18.000Z (about 2 years ago)
- Last Synced: 2025-04-02T23:54:55.738Z (7 months ago)
- Topics: flamingo, gpt-4, large-language-models, large-multimadality-models, llama, llm-hosting, llm-serve, lmm-serve, multi-modality, opengpt, self-hosting, transformers
- Language: Python
- Homepage:
- Size: 5.29 MB
- Stars: 161
- Watchers: 21
- Forks: 22
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# ☄️ RunGPT
> "A playful and whimsical vector art of a Stochastic Tigger, wearing a t-shirt with a "GPT" text printed logo, surrounded by colorful geometric shapes. –ar 1:1 –upbeta"
>
> — Prompts and logo art was produced with [PromptPerfect](https://promptperfect.jina.ai/) & [Stable Diffusion X](https://clipdrop.co/stable-diffusion)
[](https://pypi.org/project/run_gpt_torch/)
[](https://pypi.org/project/run_gpt_torch/)**RunGPT** is an open-source _cloud-native_ **_large-scale language models_** (LLMs) serving framework.
It is designed to simplify the deployment and management of large language models, on a distributed cluster of GPUs.
We aim to make it a one-stop solution for a centralized and accessible place to gather techniques for optimizing LLM and make them easy to use for everyone.## Table of contents
- [Features](#features)
- [Get started](#get-started)
- [Build a model serving in one line](#build-a-model-serving-in-one-line)
- [Cloud-native deployment](#cloud-native-deployment)
- [Roadmap](#roadmap)## Features
RunGPT provides the following features to make it easy to deploy and serve **large language models** (LLMs) at scale:
- Scalable architecture for handling high traffic loads
- Optimized for low-latency inference
- Automatic model partitioning and distribution across multiple GPUs
- Centralized model management and monitoring
- REST API for easy integration with existing applications## Updates
- **2023-08-22**: The OpenGPT is now renamed to RunGPT. We have also released the first version `v0.1.0` of RunGPT. You can install it with `pip install rungpt`.
- **2023-05-12**: 🎉We have released the first version `v0.0.1` of OpenGPT. You can install it with `pip install open_gpt_torch`.## Get Started
### Installation
Install the package with `pip`:
```bash
pip install rungpt
```### Quickstart
```python
import run_gptmodel = run_gpt.create_model(
'stabilityai/stablelm-tuned-alpha-3b', device='cuda', precision='fp16'
)prompt = "The quick brown fox jumps over the lazy dog."
output = model.generate(
prompt,
max_length=100,
temperature=0.9,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
do_sample=True,
num_return_sequences=1,
)
```We use the [stabilityai/stablelm-tuned-alpha-3b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b) as the open example model as it is relatively small and fast to download.
> **Warning**
> In the above example, we use `precision='fp16'` to reduce the memory usage and speed up the inference with some loss in accuracy on text generation tasks.
> You can also use `precision='fp32'` instead as you like for better performance.> **Note**
> It usually takes a while (several minutes) for the first time to download and load the model into the memory.In most cases of large model serving, the model cannot fit into a single GPU. To solve this problem, we also provide a `device_map` option (supported by `accecleate` package) to automatically partition the model and distribute it across multiple GPUs:
```python
model = run_gpt.create_model(
'stabilityai/stablelm-tuned-alpha-3b', precision='fp16', device_map='balanced'
)
```In the above example, `device_map="balanced"` evenly split the model on all available GPUs, making it possible for you to serve large models.
> **Note**
> The `device_map` option is supported by the [accelerate](https://github.com/huggingface/accelerate) package.See [examples on how to use rungpt with different models.](./examples) 🔥
## Build a model serving in one line
To do so, you can use the `serve` command:
```bash
rungpt serve stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced
```💡 **Tip**: you can inspect the available options with `rungpt serve --help`.
This will start a gRPC and HTTP server listening on port `51000` and `52000` respectively.
Once the server is ready, as shown below:Click to expand
You can then send requests to the server:
```python
import requestsprompt = "Once upon a time,"
response = requests.post(
"http://localhost:51000/generate",
json={
"prompt": prompt,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
)
```What's more, we also provide a [Python client](https://github.com/jina-ai/inference-client/) (`inference-client`) for you to easily interact with the server:
```python
from run_gpt import Clientclient = Client()
# connect to the model server
model = client.get_model(endpoint='grpc://0.0.0.0:51000')prompt = "Once upon a time,"
output = model.generate(
prompt,
max_length=100,
temperature=0.9,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
do_sample=True,
num_return_sequences=1,
)
```The output has the same format as the one from the OpenAI's Python API:
```
{ "id": "18d92585-7b66-4b7c-b818-71287c122c50",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "Once upon a time, there was an old man who lived in the forest. He had no children",
"finish_reason": "length",
"index": 0.0}],
"prompt": "Once upon a time,",
"usage": {"completion_tokens": 21, "total_tokens": 27, "prompt_tokens": 6}}
```For the streaming output, you can install `sseclient-py` first:
```bash
pip install sseclient-py
```And send the request to `http://localhost:51000/generate_stream` with the same payload.
```python
import sseclient
import requestsprompt = "Once upon a time,"
response = requests.post(
"http://localhost:51000/generate_stream",
json={
"prompt": prompt,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
stream=True,
)
client = sseclient.SSEClient(response)
for event in client.events():
print(event.data)
```And the output will be streamed back to you (only show 3 iterations here):
```
{ "id": "18d92585-7b66-4b7c-b818-71287c122c51",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": " there", "finish_reason": None, "index": 0.0}],
"prompt": "Once upon a time,",
"usage": {"completion_tokens": 1, "total_tokens": 7, "prompt_tokens": 6}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c52",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "was", "finish_reason": None, "index": 0.0}],
"prompt": None,
"usage": {"completion_tokens": 2, "total_tokens": 9, "prompt_tokens": 7}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c53",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "an", "finish_reason": None, "index": 0.0}],
"prompt": None,
"usage": {"completion_tokens": 3, "total_tokens": 11, "prompt_tokens": 8}}
```We also support chat mode, which is useful for interactive applications. The inputs for `chat` should be a list of
dictionaries which contain role and content. For example:```python
import requestsmessages = [
{"role": "user", "content": "Hello!"},
]response = requests.post(
"http://localhost:51000/chat",
json={
"messages": messages,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
)
```The response will be:
```
{"id": "18d92585-7b66-4b7c-b818-71287c122c57",
"object": "chat.completion",
"created": 1692610173,
"choices": [{"message": {
"role": "assistant",
"content": "\n\nHello there, how may I assist you today?",
},
"finish_reason": "stop", "index": 0.0}],
"prompt": "Hello there!",
"usage": {"completion_tokens": 12, "total_tokens": 15, "prompt_tokens": 3}}
```You can also replace the `chat` with `chat_stream` to get the streaming output.
## Cloud-native deployment
You can also deploy the server to a cloud provider like Jina Cloud or AWS.
To do so, you can use `deploy` command:### Jina Cloud
using predefined executor
```bash
rungpt deploy stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced --cloud jina --replicas 1
```It will give you a HTTP url and a gRPC url by default:
```bash
https://{random-host-name}-http.wolf.jina.ai
grpcs://{random-host-name}-grpc.wolf.jina.ai
```### AWS
TBD
## Benchmark
We have done some benchmarking on different model architectures and different configurations (whether to use
quantization, torch.compile and page attention ...), regards to the latency, throughput (prefill stage && the whole
decoding process) and perplexity.The script for benchmarking locates at `scripts/benchmark.py`. You can run the scripts to get the benchmarking results.
### Environment Setting
We use a single RTX3090 (cuda version is 11.8) for all benchmarking except for Llama-2-13b (2*RTX3090). We use:
```
torch==2.0.1 (without torch.compile) / torch==2.1.0.dev20230803 (with torch.compile)
bitsandbytes==0.41.0
transformers==4.31.0
triton==2.0.0
```### Model Candidates
| Model_Name |
|:----------------------------------:|
| meta-llama/Llama-2-7b-hf |
| mosaicml/mpt-7b |
| stabilityai/stablelm-base-alpha-7b |
| EleutherAI/gpt-j-6B |### Benchmarking Results
- **Latency/throughput for different models** (precision: fp16)
| Model_Name | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |
|:----------------------------------:|:---------------------------------:|:-----------------------------------:|:--------------------------------:|:----------------------------------:|
| meta-llama/Llama-2-7b-hf | 49 | 20.619 | 49.4 | 20.054 |
| meta-llama/Llama-2-13b-hf | 175 | 5.727 | 188.27 | 4.836 |
| mosaicml/mpt-7b | 27 | 37.527 | 28.04 | 35.312 |
| stabilityai/stablelm-base-alpha-7b | 50 | 20.09 | 45.73 | 21.878 |
| EleutherAI/gpt-j-6B | 75 | 13.301 | 76.15 | 11.181 |- **Latency/throughput for different models using torch.compile** (precision: fp16)
> **Warning**
> torch.compile doesn't support Flash-Attention based model like MPT. Also, it cannot be used in multi-GPUs environment.| Model_Name | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |
|:----------------------------------:|:---------------------------------:|:-----------------------------------:|:--------------------------------:|:----------------------------------:|
| meta-llama/Llama-2-7b-hf | 25 | 40.644 | 26.54 | 37.75 |
| meta-llama/Llama-2-13b-hf | - | - | - | - |
| mosaicml/mpt-7b | - | - | - | - |
| stabilityai/stablelm-base-alpha-7b | 44 | 22.522 | 42.97 | 21.413 |
| EleutherAI/gpt-j-6B | 32 | 31.488 | 33.89 | 25.105 |- **Latency/throughput for different models using quantization** (precision: fp16 / bit8 / bit4)
prefill latency (ms/token)
prefill throughput (tokens/s)
decode latency (ms/token)
decode throughput (tokens/s)
fp16
bit8
bit4
fp16
bit8
bit4
fp16
bit8
bit4
fp16
bit8
bit4
meta-llama/Llama-2-7b-hf
49
301
125
20.619
3.325
8.015
49.4
256.44
112.22
20.054
3.9
8.918
meta-llama/Llama-2-13b-hf
175
974
376
5.727
1.027
2.662
182.27
796.32
349.93
4.836
1.144
2.662
mosaicml/mpt-7b
27
139
86
37.527
7.222
11.6
28.04
141.04
94.22
35.312
7.021
10.507
stabilityai/stablelm-base-alpha-7b
50
164
156
20.09
6.134
6.408
45.73
148.53
147.56
21.878
6.947
6.994
EleutherAI/gpt-j-6B
75
368
162
13.301
2.724
6.195
76.15
365.51
138.44
11.181
2.327
5.642
- **Perplexity for different models using quantization** (precision: fp16 / bit8 / bit4)
> **Notice**
> From this benchmark we see that quantization doesn't affect the perplexity of the model too much.
wikitext2
ptb
c4
fp16
bit8
bit4
fp16
bit8
bit4
fp16
bit8
bit4
meta-llama/Llama-2-7b-hf
5.4721
5.506
5.6437
22.9483
23.8797
25.0556
6.9727
7.0098
7.1623
meta-llama/Llama-2-13b-hf
4.8837
4.9229
4.9811
27.6802
27.9665
28.8417
6.4677
6.4884
6.566
mosaicml/mpt-7b
7.6829
7.7256
7.9869
10.6002
10.6743
10.9486
9.6001
9.6457
9.879
stabilityai/stablelm-base-alpha-7b
14.1886
14.268
15.9817
19.2968
19.4904
21.3513
48.222
48.3384
57.022
EleutherAI/gpt-j-6B
8.8563
8.8786
9.0301
13.5946
13.6137
13.784
11.7114
11.7293
11.8929
- **Latency/throughput for different models using vllm** (precision: fp16)
> **Warning**
> vllm brings a significant improvement in latency and throughput, but it is not compatible with streaming output, so we don't release it yet.
prefill latency (ms/token)
prefill throughput (tokens/s)
decode latency (ms/token)
decode throughput (tokens/s)
using vllm
baseline
using vllm
baseline
using vllm
baseline
using vllm
baseline
meta-llama/Llama-2-7b-hf
29
49
34.939
20.619
20.34
49.40
48.67
20.054
## Contributing
We welcome contributions from the community! To contribute, please submit a pull request following our contributing guidelines.
## License
RunGPT is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.