https://github.com/ganochenkodg/vllm-token-stats
Proxy for vLLM to expose token usage metrics.
https://github.com/ganochenkodg/vllm-token-stats
fastify prometheus vllm
Last synced: about 2 months ago
JSON representation
Proxy for vLLM to expose token usage metrics.
- Host: GitHub
- URL: https://github.com/ganochenkodg/vllm-token-stats
- Owner: ganochenkodg
- License: mit
- Created: 2025-05-20T11:24:06.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-20T13:45:53.000Z (about 1 year ago)
- Last Synced: 2025-05-20T14:33:51.338Z (about 1 year ago)
- Topics: fastify, prometheus, vllm
- Language: JavaScript
- Homepage:
- Size: 17.6 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# vllm-token-stats
Proxy for vLLM to expose token usage metrics.
# architecture
vllm-token-stats is a solution to proxify incoming requests to vLLM and collect statistics of used tokens by different clients.
It requires next RBAC permissions to get clients name and namespace:
```
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list"]
```
```mermaid
graph TD
subgraph Deployment
direction LR
vllm[vLLM
Port: 8000]
vllm_stats[vllm-token-stats
Port: 3000]
end
user[Incoming Requests]
prometheus[Prometheus Metrics
]
user -- Requests
/v1 --> vllm_stats
vllm_stats -- Proxies to
/v1 --> vllm
vllm_stats -- /metrics --> prometheus
```
It provides metrics in Prometheus format, example:
```
# HELP vllm_prompt_tokens Prompt tokens used by VLLM
# TYPE vllm_prompt_tokens counter
vllm_prompt_tokens{namespace="default",client_name="sh",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 6
vllm_prompt_tokens{namespace="test",client_name="benchmark",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 406149
# HELP vllm_completion_tokens Completion tokens produced by VLLM
# TYPE vllm_completion_tokens counter
vllm_completion_tokens{namespace="default",client_name="sh",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 100
vllm_completion_tokens{namespace="test",client_name="benchmark",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 359592
# HELP vllm_total_tokens Total tokens processed by VLLM
# TYPE vllm_total_tokens counter
vllm_total_tokens{namespace="default",client_name="sh",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 106
vllm_total_tokens{namespace="test",client_name="benchmark",full_path="/v1/completions",hostname="vllm-66855dfbf7-m5njg"} 765741
```
# installation
You can install example yaml manifest with all required components (deployment with vLLM and proxy, service, rbac, PodMonitor) in GKE Autopilot cluster:
```
kubectl apply -f https://raw.githubusercontent.com/ganochenkodg/vllm-token-stats/refs/heads/main/vllm-l4.yaml
```
Example output:
```
serviceaccount/log-proxy-sa created
clusterrole.rbac.authorization.k8s.io/log-proxy-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/log-proxy-cluster-role-binding created
deployment.apps/vllm created
service/vllm-endpoint created
podmonitoring.monitoring.googleapis.com/vllm-token-stats created
```
# performance
The difference in performance between a direct connection to vLLM and through vllm-token-stats is insignificant.
Benchmark results for g2-standard-4 node, one Nvidia L4 GPU and Lllama-3.1-8b-Instruct model:
```
python3 benchmark_serving.py \
--backend openai \
//--base-url http://vllm-endpoint.default.svc:8000 \
--base-url http://vllm-endpoint.default.svc:3000 \
--model unsloth/Meta-Llama-3.1-8B-Instruct \
--dataset-name sharegpt \
--dataset-path ShareGPT_V3_unfiltered_cleaned_split.json
```
Direct connection:
```
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 343.76
Total input tokens: 215196
Total generated tokens: 197107
Request throughput (req/s): 2.91
Output token throughput (tok/s): 573.38
Total Token throughput (tok/s): 1199.39
---------------Time to First Token----------------
Mean TTFT (ms): 140519.21
Median TTFT (ms): 138940.03
P99 TTFT (ms): 294126.89
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 159.13
Median TPOT (ms): 139.51
P99 TPOT (ms): 603.73
---------------Inter-token Latency----------------
Mean ITL (ms): 138.68
Median ITL (ms): 95.72
P99 ITL (ms): 617.65
==================================================
```
Through the proxy:
```
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 357.18
Total input tokens: 215196
Total generated tokens: 198054
Request throughput (req/s): 2.80
Output token throughput (tok/s): 554.49
Total Token throughput (tok/s): 1156.96
---------------Time to First Token----------------
Mean TTFT (ms): 138294.06
Median TTFT (ms): 126404.99
P99 TTFT (ms): 300286.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 152.34
Median TPOT (ms): 138.11
P99 TPOT (ms): 455.29
---------------Inter-token Latency----------------
Mean ITL (ms): 135.52
Median ITL (ms): 95.46
P99 ITL (ms): 608.47
==================================================
```
~96.5% performance of direct benchmarking
Typical resource consumption under load:
```bash
$ kubectl top pod --containers=true vllm-66855dfbf7-m5njg
POD NAME CPU(cores) MEMORY(bytes)
vllm-66855dfbf7-m5njg inference-server 850m 7732Mi
vllm-66855dfbf7-m5njg vllm-token-stats 83m 284Mi
```
# vLLM monitoring
Use [dashboard.json](dashboard.json) In Google Cloud Monitoring to see token usage.
