https://github.com/ganochenkodg/vllm-token-stats

Proxy for vLLM to expose token usage metrics.
https://github.com/ganochenkodg/vllm-token-stats

fastify prometheus vllm

Last synced: about 2 months ago
JSON representation

Proxy for vLLM to expose token usage metrics.

Host: GitHub
URL: https://github.com/ganochenkodg/vllm-token-stats
Owner: ganochenkodg
License: mit
Created: 2025-05-20T11:24:06.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-20T13:45:53.000Z (about 1 year ago)
Last Synced: 2025-05-20T14:33:51.338Z (about 1 year ago)
Topics: fastify, prometheus, vllm
Language: JavaScript
Homepage:
Size: 17.6 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # vllm-token-stats

Proxy for vLLM to expose token usage metrics.

# architecture

vllm-token-stats is a solution to proxify incoming requests to vLLM and collect statistics of used tokens by different clients. 

It requires next RBAC permissions to get clients name and namespace:

```

- apiGroups: [""]

  resources: ["pods"]

  verbs: ["get", "list"]

```

```mermaid

graph TD

    subgraph Deployment

        direction LR

        vllm[vLLM
Port: 8000]

        vllm_stats[vllm-token-stats
Port: 3000]

    end

    user[Incoming Requests]

    prometheus[Prometheus Metrics
]

    user -- Requests
/v1 --> vllm_stats

    vllm_stats -- Proxies to
/v1 --> vllm

    vllm_stats -- /metrics --> prometheus

```

It provides metrics in Prometheus format, example:

```

# HELP vllm_prompt_tokens Prompt tokens used by VLLM

# TYPE vllm_prompt_tokens counter

vllm_prompt_tokens{namespace="default",client_name="sh",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 6

vllm_prompt_tokens{namespace="test",client_name="benchmark",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 406149

# HELP vllm_completion_tokens Completion tokens produced by VLLM

# TYPE vllm_completion_tokens counter

vllm_completion_tokens{namespace="default",client_name="sh",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 100

vllm_completion_tokens{namespace="test",client_name="benchmark",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 359592

# HELP vllm_total_tokens Total tokens processed by VLLM

# TYPE vllm_total_tokens counter

vllm_total_tokens{namespace="default",client_name="sh",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 106

vllm_total_tokens{namespace="test",client_name="benchmark",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 765741

```

# installation

You can install example yaml manifest with all required components (deployment with vLLM and proxy, service, rbac, PodMonitor) in GKE Autopilot cluster:

```

kubectl apply -f https://raw.githubusercontent.com/ganochenkodg/vllm-token-stats/refs/heads/main/vllm-l4.yaml

```

Example output:

```

serviceaccount/log-proxy-sa created

clusterrole.rbac.authorization.k8s.io/log-proxy-cluster-role created

clusterrolebinding.rbac.authorization.k8s.io/log-proxy-cluster-role-binding created

deployment.apps/vllm created

service/vllm-endpoint created

podmonitoring.monitoring.googleapis.com/vllm-token-stats created

```

# performance

The difference in performance between a direct connection to vLLM and through vllm-token-stats is insignificant.

Benchmark results for g2-standard-4 node, one Nvidia L4 GPU and Lllama-3.1-8b-Instruct model:

```

python3 benchmark_serving.py \

  --backend openai \

//--base-url http://vllm-endpoint.default.svc:8000 \

  --base-url http://vllm-endpoint.default.svc:3000 \

  --model unsloth/Meta-Llama-3.1-8B-Instruct \

  --dataset-name sharegpt \

  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json

```

Direct connection:

```

============ Serving Benchmark Result ============

Successful requests:                     1000

Benchmark duration (s):                  343.76

Total input tokens:                      215196

Total generated tokens:                  197107

Request throughput (req/s):              2.91

Output token throughput (tok/s):         573.38

Total Token throughput (tok/s):          1199.39

---------------Time to First Token----------------

Mean TTFT (ms):                          140519.21

Median TTFT (ms):                        138940.03

P99 TTFT (ms):                           294126.89

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms):                          159.13

Median TPOT (ms):                        139.51

P99 TPOT (ms):                           603.73

---------------Inter-token Latency----------------

Mean ITL (ms):                           138.68

Median ITL (ms):                         95.72

P99 ITL (ms):                            617.65

==================================================

```

Through the proxy:

```

============ Serving Benchmark Result ============

Successful requests:                     1000

Benchmark duration (s):                  357.18

Total input tokens:                      215196

Total generated tokens:                  198054

Request throughput (req/s):              2.80

Output token throughput (tok/s):         554.49

Total Token throughput (tok/s):          1156.96

---------------Time to First Token----------------

Mean TTFT (ms):                          138294.06

Median TTFT (ms):                        126404.99

P99 TTFT (ms):                           300286.19

-----Time per Output Token (excl. 1st token)------

Mean TPOT (ms):                          152.34

Median TPOT (ms):                        138.11

P99 TPOT (ms):                           455.29

---------------Inter-token Latency----------------

Mean ITL (ms):                           135.52

Median ITL (ms):                         95.46

P99 ITL (ms):                            608.47

==================================================

```

~96.5% performance of direct benchmarking

Typical resource consumption under load:

```bash

$ kubectl top pod --containers=true vllm-66855dfbf7-m5njg

POD                     NAME               CPU(cores)   MEMORY(bytes)

vllm-66855dfbf7-m5njg   inference-server   850m         7732Mi

vllm-66855dfbf7-m5njg   vllm-token-stats   83m          284Mi

```

# vLLM monitoring

Use [dashboard.json](dashboard.json) In Google Cloud Monitoring to see token usage.

![Dashboard](dashboard.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ganochenkodg/vllm-token-stats

Awesome Lists containing this project

README