https://github.com/l1aoxingyu/llm-infer-bench

Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/l1aoxingyu/llm-infer-bench
Owner: L1aoXingyu
Created: 2023-08-22T06:56:30.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-09-01T15:07:15.000Z (almost 2 years ago)
Last Synced: 2025-03-26T10:36:33.234Z (4 months ago)
Language: Python
Size: 1.05 MB
Stars: 11
Watchers: 1
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # llm-infer-bench

This repository provides benchmarking tools for different LLM serving backends. Below are the steps to set up each server and run the benchmarks.

## TGI

### Setup

Follow the [TGI repo](https://github.com/huggingface/text-generation-inference/tree/main)[^1] for initial setup. Then, launch the server using Docker:

13B Model with 1 GPU

```bash

docker run -itd --name tgi_test \

    --env CUDA_VISIBLE_DEVICES=7 \

    --gpus all \

    --shm-size 1g \

    -p 18080:80 \

    -v /data0/home/models:/data \

    ghcr.io/huggingface/text-generation-inference:1.0.2 \

    --model-id /data/Llama-2-13b-hf \

    --sharded false \

    --max-input-length 1024 \

    --max-total-tokens 68700 \

    --max-concurrent-requests 10000

```

70B Model with 4 GPUs

```bash

# 70B Model

docker run -itd --name tgi_test \

    --env CUDA_VISIBLE_DEVICES=4,5,6,7 \

    --gpus all \

    --shm-size 1g \

    -p 18080:80 \

    -v /data0/home/models:/data \

    ghcr.io/huggingface/text-generation-inference:1.0.2 \

    --model-id /data/Llama-2-70b-chat-hf \

    --sharded true \

    --num-shard 4 \

    --max-input-length 1024 \

    --max-total-tokens 387000 \

    --max-concurrent-requests 10000

```

### Benchmarking

```bash

# throughput

bash benchmark_configs/tgi_variable_size_throughput.sh

# latency6

bash benchmark_configs/tgi_variable_size_latency.sh

```

## vLLM

### Setup

Follow the [vLLM repo](https://github.com/vllm-project/vllm/tree/791d79de3261402fae1b9d0b1650655071a68095)[^2] for initial setup. Then start the server:

13B Model with 1 GPU

```bash

CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.api_server \

        --port 18081 \

        --max-num-batched-tokens 18700 \

        --model /data/Llama-2-13b-hf

```

70B Model with 4 GPUs

```bash

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m vllm.entrypoints.api_server \

        --port 18081 \

        --model /data/Llama-2-70b-chat-hf \

        --max-num-batched-tokens 8700 \

        --tensor-parallel-size 4

```

### Benchmark

```bash

# through

bash benchmark_configs/vllm_variable_size_throughput.sh

# latency

bash benchmark_configs/vllm_variable_size_latency.sh

```

## LightLLM

### Setup

Follow the [lightllm repo](https://github.com/ModelTC/lightllm)[^3] for initial setup. Then start the server:

13B Model with 1 GPU

```bash

CUDA_VISIBLE_DEVICES=0 python3 -m lightllm.server.api_server \

	--port 18082 \

	--model_dir /data/Llama-2-13b-hf/ \

	--max_total_token_num 68700

```

70B Model with 4 GPUs

```bash

CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m lightllm.server.api_server \

        --port 18082 \

        --model_dir /data/Llama-2-70b-chat-hf \

        --max_total_token_num 387000 \

        --tp 4

```

### Benchmark

```bash

# throughput

bash benchmark_configs/lightllm_variable_size_throughput.sh

# latency

bash benchmark_configs/lightllm_variable_size_latency.sh

```

## Benchmarking Results

All benchmarks were conducted on an NVIDIA A800-SXM4-80GB GPU connected via PCIe. We evaluated both 13B and 70B llama models.

### Throughput

We used the dummy dataset consisting of 1,000 sequences, each containing 512 input tokens. The model was configured to generate per-sequences with a maximum length (max_tokens) by ignoring the End-of-Sequence (EoS) token.

To introduce variability in the generated sequence lengths, we sampled the response length from an [exponential distribution](https://en.wikipedia.org/wiki/Exponential_distribution) with a mean of 128 tokens. We conducted tests with truncated lengths, considering only samples from the exponential distribution that are less than or equal to 32, 128, 512, and 1536. The total output sequence length is then, at most, 512+32=544, 512+128=640, 512+512=1024, and 512+1536=2048 (the maximum sequence length of model setting).

We employed a Python benchmarking script using [asyncio](https://docs.python.org/3/library/asyncio.html) to send HTTP requests to the model server. All requests were submitted in a burst to saturate the compute resources.

The reuslts are as follows:

13B Model

| Throughput (token/s) vs. variance in length | max 32 tokens | max 128 tokens | max 512 tokens | max 1536 tokens |

|:--|:--|:--|:--|:--|

| TGI | 4937.88 | 3837.65 | 3391.885 | 3307.23 |

| vLLM | 6841.49 | 5086.82 | 3533.67 | 3325.49 |

| lightLLM | 6821.80 | 6063.72 | 4404.93 | 4102.28 |



70B Model

| Throughput (token/s) vs. variance in length | max 32 tokens | max 128 tokens | max 512 tokens | max 1536 tokens |

|:--|:--|:--|:--|:--|

| TGI | 3197.48 | 2532.61 | 2422.75 | 2325.35 |

| vLLM | 4514.55 | 3615.29 | 2429.91 | 2300.71 |

| lightLLM | 5024.78 | 4562.66 | 3568.23 | 3297.87 |



### Latency

Latency is a critical metric for live-inference endpoints as it directly impacts user experience. We measure how the [cumulative distribution function](https://en.wikipedia.org/wiki/Cumulative_distribution_function) of latencies varies across different serving frameworks.

As with throughput testing, the model is configured to ignore the `EoS` token and generate sequences of a specified length. We use a dataset of 1,000 randomly sampled prompts, with lengths sampled from a [uniform distribution](https://en.wikipedia.org/wiki/Discrete_uniform_distribution) ranging from 1 to 512 tokens. The output lengths are sampled from a capped exponential distribution with a mean of 128 and a maximum size of 512 tokens. These parameters were selected for their realistic representation of typical workloads.

Unlike the throughput benchmark, where all requests are submitted simultaneously, here we stagger the requests. The time between each request is determined by sampling a [Poisson distribution](https://en.wikipedia.org/wiki/Poisson_distribution). This setup allows us to evaluate how each serving framework performs under varying Query Per Second (QPS) loads.

13B Model

| generation request latency vs. QPS (median token latency / median request latency) |  QPS=2 | QPS=4 | QPS=8 |

|:--|:--|:--|:--|

| TGI | 0.0256 / 2.3892 | 0.0322 / 2.9769 | 0.0542 / 4.8202 |

| vLLM | 0.0260 / 2.4723 | 0.0348 / 3.2986 | 0.0825 / 7.4093 |

| lightLLM | 0.0337 / 2.7406 | 0.0453 / 3.7370 | 0.1257 / 9.5245 |



  

  

  



70B Model

| generation request latency vs. QPS (median token latency / median request latency)| QPS=1 | QPS=2 | QPS=4 |

|:--|:--|:--|:--|

| TGI | 0.0456 / 4.1999 | 0.0515 / 4.7381 | 0.0633 / 5.7956  |

| vLLM | 0.0578 / 5.4908  | 0.0835 / 7.7701 | 0.2079 / 18.6611 |

| lightLLM | 0.0627 / 5.1785 | 0.0712 / 5.8945 | 0.1077 / 8.7133 |



  

  

  



## Acknowledgements

- [llm-continuous-batching-benchmarks](https://github.com/anyscale/llm-continuous-batching-benchmarks)

- [vLLM](https://github.com/vllm-project/vllm/tree/791d79de3261402fae1b9d0b1650655071a68095)

- [TGI](https://github.com/huggingface/text-generation-inference/tree/main)

- [LightLLM](https://github.com/ModelTC/lightllm)

[^1]: docker image: ghcr.io/huggingface/text-generation-inference:1.0.2

[^2]: commit_id: 28873a2799ddfdd0624edd4619e6fbeeb49cd02c

[^3]: commit_id: 718e6d6dfffc75e7bbfd7ea80ba4afb77aa27726
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/l1aoxingyu/llm-infer-bench

Awesome Lists containing this project

README