https://github.com/curt-park/serving-codegen-gptj-triton
Serving Example of CodeGen-350M-Mono-GPTJ on Triton Inference Server with Docker and Kubernetes
https://github.com/curt-park/serving-codegen-gptj-triton
codegen docker fastertransformer huggingface-transformers kubernetes pytorch triton-inference-server
Last synced: 11 months ago
JSON representation
Serving Example of CodeGen-350M-Mono-GPTJ on Triton Inference Server with Docker and Kubernetes
- Host: GitHub
- URL: https://github.com/curt-park/serving-codegen-gptj-triton
- Owner: Curt-Park
- License: apache-2.0
- Created: 2023-05-29T23:21:45.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-05-30T16:07:17.000Z (about 3 years ago)
- Last Synced: 2025-07-25T04:51:41.979Z (12 months ago)
- Topics: codegen, docker, fastertransformer, huggingface-transformers, kubernetes, pytorch, triton-inference-server
- Language: Python
- Homepage:
- Size: 5.47 MB
- Stars: 20
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Serving codegen-350M-mono-gptj on Triton Inference Server

## Contents
- PyTorch model conversion to [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) (See [Artifacts](https://huggingface.co/curt-park/codegen-350M-mono-gptj)).
- [Triton](https://github.com/triton-inference-server/server) serving with [FasterTransformer Backend](https://github.com/triton-inference-server/fastertransformer_backend).
- Load test on Triton server (Locust)
- A simple chatbot with [Gradio](https://github.com/gradio-app/gradio).
- Docker compose for the server and client.
- Kubernetes helm charts for the server and client.
- Monitoring on K8s (Promtail + Loki & Prometheus & Grafana).
- Autoscaling Triton (gRPC) on K8s (Triton Metrics & Traefik)
## How to Run
### Option1. Docker
```bash
docker compose up # Run the server & client.
```
Open http://localhost:7860
### Option2. Kubernetes (K3S)
Before you start,
- Install [Helm](https://helm.sh/docs/intro/install/)
#### Create a Service Cluster
```bash
make cluster
make charts
```
After a while, `kubectl get pods` will show:
```bash
NAME READY STATUS RESTARTS AGE
dcgm-exporter-ltftk 1/1 Running 0 2m26s
prometheus-kube-prometheus-operator-7958587c67-wxh8c 1/1 Running 0 96s
prometheus-prometheus-node-exporter-vgx65 1/1 Running 0 96s
traefik-677c7d64f8-8zlh9 1/1 Running 0 115s
prometheus-grafana-694f868865-58c2k 3/3 Running 0 96s
alertmanager-prometheus-kube-prometheus-alertmanager-0 2/2 Running 0 94s
prometheus-kube-state-metrics-85c858f4b-8rkzv 1/1 Running 0 96s
client-codegen-client-5d6df644f5-slcm8 1/1 Running 0 87s
prometheus-prometheus-kube-prometheus-prometheus-0 2/2 Running 0 94s
client-codegen-client-5d6df644f5-tms9j 1/1 Running 0 72s
triton-57d47d448c-hkf57 1/1 Running 0 88s
triton-prometheus-adapter-674d9855f-g9d6j 1/1 Running 0 88s
loki-0 1/1 Running 0 113s
promtail-qzvrz 1/1 Running 0 112s
```
#### Access to Client
Open http://localhost:7860
#### Access to Grafana
```bash
kubectl port-forward svc/prometheus-grafana 3000:80
```
Open http://localhost:3000
- id: admin
- pw: prom-operator
If you want to configure loki as data sources to monitor the service logs:
1. Configuration -> Data sources -> Add data sources
2. Select Loki
3. Add URL: http://loki.default.svc.cluster.local:3100
4. Click Save & test on the bottom.
5. Explore -> Select Loki
6. job -> default/client-codegen-client -> Show logs
#### Triton Auto-Scaling
To enable auto-scaling, you need to increase `maxReplicas` in `charts/triton/values.yaml`.
```bash
# For example,
autoscaling:
minReplicas: 1
maxReplicas: 2
```
By default, the autoscaling metric is average queuing time 50 ms for 30 seconds.
You can set the target value as you need.
```bash
autoscaling:
...
metrics:
- type: Pods
pods:
metric:
name: avg_time_queue_us
target:
type: AverageValue
averageValue: 50000 # 1,000 us == 1 ms
```
#### Finalization
```bash
make remove-charts
make finalize
```
## Artifacts
- CodeGen-350M-mono-gptj (for Triton): https://huggingface.co/curt-park/codegen-350M-mono-gptj
## For Developer
```bash
make setup # Install packages for execution.
make setup-dev # Install packages for development.
make format # Format the code.
make lint # Lint the code.
make load-test # Load test (`make setup-dev` is required).
```
## Experiments: Load Test
Device Info:
- CPU: AMD EPYC Processor (with IBPB)
- GPU: A100-SXM4-80GB x 1
- RAM: 1.857TB
Experimental Setups:
- Single Triton instance.
- Dynamic batching.
- Triton docker server.
- Output Length: 8 vs 32 vs 128 vs 512
### Output Length: 8


```bash
# metrics
nv_inference_count{model="ensemble",version="1"} 391768
nv_inference_count{model="postprocessing",version="1"} 391768
nv_inference_count{model="codegen-350M-mono-gptj",version="1"} 391768
nv_inference_count{model="preprocessing",version="1"} 391768
nv_inference_exec_count{model="ensemble",version="1"} 391768
nv_inference_exec_count{model="postprocessing",version="1"} 391768
nv_inference_exec_count{model="codegen-350M-mono-gptj",version="1"} 20439
nv_inference_exec_count{model="preprocessing",version="1"} 391768
nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 6368616649
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 51508744
nv_inference_compute_infer_duration_us{model="codegen-350M-mono-gptj",version="1"} 6148437063
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 168281250
```
- RPS (Response per Second) reaches around 1,715.
- The average response time is 38 ms.
- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)
- Preprocessing spends 2.73% of the model inference time.
- Postprocessing spends 0.83% of the model inference time.
### Output Length: 32


```bash
# metrics
nv_inference_count{model="ensemble",version="1"} 118812
nv_inference_count{model="codegen-350M-mono-gptj",version="1"} 118812
nv_inference_count{model="postprocessing",version="1"} 118812
nv_inference_count{model="preprocessing",version="1"} 118812
nv_inference_exec_count{model="ensemble",version="1"} 118812
nv_inference_exec_count{model="codegen-350M-mono-gptj",version="1"} 6022
nv_inference_exec_count{model="postprocessing",version="1"} 118812
nv_inference_exec_count{model="preprocessing",version="1"} 118812
nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 7163210716
nv_inference_compute_infer_duration_us{model="codegen-350M-mono-gptj",version="1"} 7090601211
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 18416946
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 54073590
```
- RPS (Response per Second) reaches around 500.
- The average response time is 122 ms.
- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)
- Preprocessing spends 0.76% of the model inference time.
- Postprocessing spends 0.26% of the model inference time.
### Output Length: 128


```bash
nv_inference_count{model="ensemble",version="1"} 14286
nv_inference_count{model="codegen-350M-mono-gptj",version="1"} 14286
nv_inference_count{model="preprocessing",version="1"} 14286
nv_inference_count{model="postprocessing",version="1"} 14286
nv_inference_exec_count{model="ensemble",version="1"} 14286
nv_inference_exec_count{model="codegen-350M-mono-gptj",version="1"} 1121
nv_inference_exec_count{model="preprocessing",version="1"} 14286
nv_inference_exec_count{model="postprocessing",version="1"} 14286
nference_compute_infer_duration_us{model="ensemble",version="1"} 4509635072
nv_inference_compute_infer_duration_us{model="codegen-350M-mono-gptj",version="1"} 4498667310
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 7348176
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 3605100
```
- RPS (Response per Second) reaches around 65.
- The average response time is 620 ms.
- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)
- Preprocessing spends 0.16% of the model inference time.
- Postprocessing spends 0.08% of the model inference time.
### Output Length: 512


```bash
nv_inference_count{model="ensemble",version="1"} 7183
nv_inference_count{model="codegen-350M-mono-gptj",version="1"} 7183
nv_inference_count{model="preprocessing",version="1"} 7183
nv_inference_count{model="postprocessing",version="1"} 7183
nv_inference_exec_count{model="ensemble",version="1"} 7183
nv_inference_exec_count{model="codegen-350M-mono-gptj",version="1"} 465
nv_inference_exec_count{model="preprocessing",version="1"} 7183
nv_inference_exec_count{model="postprocessing",version="1"} 7183
nv_inference_compute_infer_duration_us{model="ensemble",version="1"} 5764391176
nv_inference_compute_infer_duration_us{model="codegen-350M-mono-gptj",version="1"} 5757320649
nv_inference_compute_infer_duration_us{model="preprocessing",version="1"} 3678517
nv_inference_compute_infer_duration_us{model="postprocessing",version="1"} 3384699
```
- RPS (Response per Second) reaches around 40.
- The average response time is 1,600 ms.
- The metric shows dynamic batching works (`nv_inference_count` vs `nv_inference_exec_count`)
- Preprocessing spends 0.06% of the model inference time.
- Postprocessing spends 0.06% of the model inference time.
## NOTE
#### NVIDIA-Docker Configurations for K8s
Set `default-runtime` in `/etc/docker/daemon.json`.
```bash
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
```
After configuring, restart docker: `sudo systemctl restart docker`
## References
- https://github.com/NVIDIA/FasterTransformer
- https://github.com/triton-inference-server/fastertransformer_backend
- https://github.com/triton-inference-server/python_backend