An open API service indexing awesome lists of open source software.

https://github.com/jina-ai/rungpt

An open-source cloud-native of large multi-modal models (LMMs) serving framework.
https://github.com/jina-ai/rungpt

flamingo gpt-4 large-language-models large-multimadality-models llama llm-hosting llm-serve lmm-serve multi-modality opengpt self-hosting transformers

Last synced: 6 months ago
JSON representation

An open-source cloud-native of large multi-modal models (LMMs) serving framework.

Awesome Lists containing this project

README

          

# ☄️ RunGPT


RunGPT: An open-source cloud-native large-scale multimodal model serving framework


> "A playful and whimsical vector art of a Stochastic Tigger, wearing a t-shirt with a "GPT" text printed logo, surrounded by colorful geometric shapes. –ar 1:1 –upbeta"
>
> — Prompts and logo art was produced with [PromptPerfect](https://promptperfect.jina.ai/) & [Stable Diffusion X](https://clipdrop.co/stable-diffusion)

![](https://img.shields.io/badge/Made%20with-JinaAI-blueviolet?style=flat)
[![PyPI](https://img.shields.io/pypi/v/run_gpt_torch)](https://pypi.org/project/run_gpt_torch/)
[![PyPI - License](https://img.shields.io/pypi/l/run_gpt_torch)](https://pypi.org/project/run_gpt_torch/)

**RunGPT** is an open-source _cloud-native_ **_large-scale language models_** (LLMs) serving framework.
It is designed to simplify the deployment and management of large language models, on a distributed cluster of GPUs.
We aim to make it a one-stop solution for a centralized and accessible place to gather techniques for optimizing LLM and make them easy to use for everyone.

## Table of contents

- [Features](#features)
- [Get started](#get-started)
- [Build a model serving in one line](#build-a-model-serving-in-one-line)
- [Cloud-native deployment](#cloud-native-deployment)
- [Roadmap](#roadmap)

## Features

RunGPT provides the following features to make it easy to deploy and serve **large language models** (LLMs) at scale:

- Scalable architecture for handling high traffic loads
- Optimized for low-latency inference
- Automatic model partitioning and distribution across multiple GPUs
- Centralized model management and monitoring
- REST API for easy integration with existing applications

## Updates

- **2023-08-22**: The OpenGPT is now renamed to RunGPT. We have also released the first version `v0.1.0` of RunGPT. You can install it with `pip install rungpt`.
- **2023-05-12**: 🎉We have released the first version `v0.0.1` of OpenGPT. You can install it with `pip install open_gpt_torch`.

## Get Started

### Installation

Install the package with `pip`:

```bash
pip install rungpt
```

### Quickstart

```python
import run_gpt

model = run_gpt.create_model(
'stabilityai/stablelm-tuned-alpha-3b', device='cuda', precision='fp16'
)

prompt = "The quick brown fox jumps over the lazy dog."

output = model.generate(
prompt,
max_length=100,
temperature=0.9,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
do_sample=True,
num_return_sequences=1,
)
```

We use the [stabilityai/stablelm-tuned-alpha-3b](https://huggingface.co/stabilityai/stablelm-tuned-alpha-3b) as the open example model as it is relatively small and fast to download.

> **Warning**
> In the above example, we use `precision='fp16'` to reduce the memory usage and speed up the inference with some loss in accuracy on text generation tasks.
> You can also use `precision='fp32'` instead as you like for better performance.

> **Note**
> It usually takes a while (several minutes) for the first time to download and load the model into the memory.

In most cases of large model serving, the model cannot fit into a single GPU. To solve this problem, we also provide a `device_map` option (supported by `accecleate` package) to automatically partition the model and distribute it across multiple GPUs:

```python
model = run_gpt.create_model(
'stabilityai/stablelm-tuned-alpha-3b', precision='fp16', device_map='balanced'
)
```

In the above example, `device_map="balanced"` evenly split the model on all available GPUs, making it possible for you to serve large models.

> **Note**
> The `device_map` option is supported by the [accelerate](https://github.com/huggingface/accelerate) package.

See [examples on how to use rungpt with different models.](./examples) 🔥

## Build a model serving in one line

To do so, you can use the `serve` command:

```bash
rungpt serve stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced
```

💡 **Tip**: you can inspect the available options with `rungpt serve --help`.

This will start a gRPC and HTTP server listening on port `51000` and `52000` respectively.
Once the server is ready, as shown below:

Click to expand

You can then send requests to the server:

```python
import requests

prompt = "Once upon a time,"

response = requests.post(
"http://localhost:51000/generate",
json={
"prompt": prompt,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
)
```

What's more, we also provide a [Python client](https://github.com/jina-ai/inference-client/) (`inference-client`) for you to easily interact with the server:

```python
from run_gpt import Client

client = Client()

# connect to the model server
model = client.get_model(endpoint='grpc://0.0.0.0:51000')

prompt = "Once upon a time,"

output = model.generate(
prompt,
max_length=100,
temperature=0.9,
top_k=50,
top_p=0.95,
repetition_penalty=1.2,
do_sample=True,
num_return_sequences=1,
)
```

The output has the same format as the one from the OpenAI's Python API:

```
{ "id": "18d92585-7b66-4b7c-b818-71287c122c50",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "Once upon a time, there was an old man who lived in the forest. He had no children",
"finish_reason": "length",
"index": 0.0}],
"prompt": "Once upon a time,",
"usage": {"completion_tokens": 21, "total_tokens": 27, "prompt_tokens": 6}}
```

For the streaming output, you can install `sseclient-py` first:
```bash
pip install sseclient-py
```

And send the request to `http://localhost:51000/generate_stream` with the same payload.

```python
import sseclient
import requests

prompt = "Once upon a time,"

response = requests.post(
"http://localhost:51000/generate_stream",
json={
"prompt": prompt,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
stream=True,
)
client = sseclient.SSEClient(response)
for event in client.events():
print(event.data)
```

And the output will be streamed back to you (only show 3 iterations here):

```
{ "id": "18d92585-7b66-4b7c-b818-71287c122c51",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": " there", "finish_reason": None, "index": 0.0}],
"prompt": "Once upon a time,",
"usage": {"completion_tokens": 1, "total_tokens": 7, "prompt_tokens": 6}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c52",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "was", "finish_reason": None, "index": 0.0}],
"prompt": None,
"usage": {"completion_tokens": 2, "total_tokens": 9, "prompt_tokens": 7}},
{ "id": "18d92585-7b66-4b7c-b818-71287c122c53",
"object": "text_completion",
"created": 1692610173,
"choices": [{"text": "an", "finish_reason": None, "index": 0.0}],
"prompt": None,
"usage": {"completion_tokens": 3, "total_tokens": 11, "prompt_tokens": 8}}
```

We also support chat mode, which is useful for interactive applications. The inputs for `chat` should be a list of
dictionaries which contain role and content. For example:

```python
import requests

messages = [
{"role": "user", "content": "Hello!"},
]

response = requests.post(
"http://localhost:51000/chat",
json={
"messages": messages,
"max_length": 100,
"temperature": 0.9,
"top_k": 50,
"top_p": 0.95,
"repetition_penalty": 1.2,
"do_sample": True,
"num_return_sequences": 1,
},
)
```

The response will be:

```
{"id": "18d92585-7b66-4b7c-b818-71287c122c57",
"object": "chat.completion",
"created": 1692610173,
"choices": [{"message": {
"role": "assistant",
"content": "\n\nHello there, how may I assist you today?",
},
"finish_reason": "stop", "index": 0.0}],
"prompt": "Hello there!",
"usage": {"completion_tokens": 12, "total_tokens": 15, "prompt_tokens": 3}}
```

You can also replace the `chat` with `chat_stream` to get the streaming output.

## Cloud-native deployment

You can also deploy the server to a cloud provider like Jina Cloud or AWS.
To do so, you can use `deploy` command:

### Jina Cloud

using predefined executor

```bash
rungpt deploy stabilityai/stablelm-tuned-alpha-3b --precision fp16 --device_map balanced --cloud jina --replicas 1
```

It will give you a HTTP url and a gRPC url by default:
```bash
https://{random-host-name}-http.wolf.jina.ai
grpcs://{random-host-name}-grpc.wolf.jina.ai
```

### AWS

TBD

## Benchmark
We have done some benchmarking on different model architectures and different configurations (whether to use
quantization, torch.compile and page attention ...), regards to the latency, throughput (prefill stage && the whole
decoding process) and perplexity.

The script for benchmarking locates at `scripts/benchmark.py`. You can run the scripts to get the benchmarking results.

### Environment Setting
We use a single RTX3090 (cuda version is 11.8) for all benchmarking except for Llama-2-13b (2*RTX3090). We use:
```
torch==2.0.1 (without torch.compile) / torch==2.1.0.dev20230803 (with torch.compile)
bitsandbytes==0.41.0
transformers==4.31.0
triton==2.0.0
```

### Model Candidates
| Model_Name |
|:----------------------------------:|
| meta-llama/Llama-2-7b-hf |
| mosaicml/mpt-7b |
| stabilityai/stablelm-base-alpha-7b |
| EleutherAI/gpt-j-6B |

### Benchmarking Results

- **Latency/throughput for different models** (precision: fp16)

| Model_Name | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |
|:----------------------------------:|:---------------------------------:|:-----------------------------------:|:--------------------------------:|:----------------------------------:|
| meta-llama/Llama-2-7b-hf | 49 | 20.619 | 49.4 | 20.054 |
| meta-llama/Llama-2-13b-hf | 175 | 5.727 | 188.27 | 4.836 |
| mosaicml/mpt-7b | 27 | 37.527 | 28.04 | 35.312 |
| stabilityai/stablelm-base-alpha-7b | 50 | 20.09 | 45.73 | 21.878 |
| EleutherAI/gpt-j-6B | 75 | 13.301 | 76.15 | 11.181 |

- **Latency/throughput for different models using torch.compile** (precision: fp16)

> **Warning**
> torch.compile doesn't support Flash-Attention based model like MPT. Also, it cannot be used in multi-GPUs environment.

| Model_Name | average_prefill_latency(ms/token) | average_prefill_throughput(token/s) | average_decode_latency(ms/token) | average_decode_throughput(token/s) |
|:----------------------------------:|:---------------------------------:|:-----------------------------------:|:--------------------------------:|:----------------------------------:|
| meta-llama/Llama-2-7b-hf | 25 | 40.644 | 26.54 | 37.75 |
| meta-llama/Llama-2-13b-hf | - | - | - | - |
| mosaicml/mpt-7b | - | - | - | - |
| stabilityai/stablelm-base-alpha-7b | 44 | 22.522 | 42.97 | 21.413 |
| EleutherAI/gpt-j-6B | 32 | 31.488 | 33.89 | 25.105 |

- **Latency/throughput for different models using quantization** (precision: fp16 / bit8 / bit4)



prefill latency (ms/token)
prefill throughput (tokens/s)
decode latency (ms/token)
decode throughput (tokens/s)



fp16
bit8
bit4
fp16
bit8
bit4
fp16
bit8
bit4
fp16
bit8
bit4


meta-llama/Llama-2-7b-hf
49
301
125
20.619
3.325
8.015
49.4
256.44
112.22
20.054
3.9
8.918


meta-llama/Llama-2-13b-hf
175
974
376
5.727
1.027
2.662
182.27
796.32
349.93
4.836
1.144
2.662


mosaicml/mpt-7b
27
139
86
37.527
7.222
11.6
28.04
141.04
94.22
35.312
7.021
10.507


stabilityai/stablelm-base-alpha-7b
50
164
156
20.09
6.134
6.408
45.73
148.53
147.56
21.878
6.947
6.994


EleutherAI/gpt-j-6B
75
368
162
13.301
2.724
6.195
76.15
365.51
138.44
11.181
2.327
5.642

- **Perplexity for different models using quantization** (precision: fp16 / bit8 / bit4)

> **Notice**
> From this benchmark we see that quantization doesn't affect the perplexity of the model too much.



wikitext2
ptb
c4



fp16
bit8
bit4
fp16
bit8
bit4
fp16
bit8
bit4


meta-llama/Llama-2-7b-hf
5.4721
5.506
5.6437
22.9483
23.8797
25.0556
6.9727
7.0098
7.1623


meta-llama/Llama-2-13b-hf
4.8837
4.9229
4.9811
27.6802
27.9665
28.8417
6.4677
6.4884
6.566


mosaicml/mpt-7b
7.6829
7.7256
7.9869
10.6002
10.6743
10.9486
9.6001
9.6457
9.879


stabilityai/stablelm-base-alpha-7b
14.1886
14.268
15.9817
19.2968
19.4904
21.3513
48.222
48.3384
57.022


EleutherAI/gpt-j-6B
8.8563
8.8786
9.0301
13.5946
13.6137
13.784
11.7114
11.7293
11.8929

- **Latency/throughput for different models using vllm** (precision: fp16)

> **Warning**
> vllm brings a significant improvement in latency and throughput, but it is not compatible with streaming output, so we don't release it yet.



prefill latency (ms/token)
prefill throughput (tokens/s)
decode latency (ms/token)
decode throughput (tokens/s)



using vllm
baseline
using vllm
baseline
using vllm
baseline
using vllm
baseline


meta-llama/Llama-2-7b-hf
29
49
34.939
20.619
20.34
49.40
48.67
20.054

## Contributing

We welcome contributions from the community! To contribute, please submit a pull request following our contributing guidelines.

## License

RunGPT is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.