{"id":28898839,"url":"https://github.com/ganochenkodg/vllm-token-stats","last_synced_at":"2026-04-29T19:32:42.160Z","repository":{"id":294456139,"uuid":"986961124","full_name":"ganochenkodg/vllm-token-stats","owner":"ganochenkodg","description":"Proxy for vLLM to expose token usage metrics.","archived":false,"fork":false,"pushed_at":"2025-05-20T13:45:53.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-20T14:33:51.338Z","etag":null,"topics":["fastify","prometheus","vllm"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ganochenkodg.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-20T11:24:06.000Z","updated_at":"2025-05-20T13:45:56.000Z","dependencies_parsed_at":"2025-05-20T14:33:56.535Z","dependency_job_id":"ccb997d1-7acf-4709-af59-f40070ecfc8a","html_url":"https://github.com/ganochenkodg/vllm-token-stats","commit_stats":null,"previous_names":["ganochenkodg/vllm-token-stats"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ganochenkodg/vllm-token-stats","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ganochenkodg%2Fvllm-token-stats","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ganochenkodg%2Fvllm-token-stats/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ganochenkodg%2Fvllm-token-stats/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ganochenkodg%2Fvllm-token-stats/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ganochenkodg","download_url":"https://codeload.github.com/ganochenkodg/vllm-token-stats/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ganochenkodg%2Fvllm-token-stats/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261088330,"owners_count":23107676,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["fastify","prometheus","vllm"],"created_at":"2025-06-21T08:00:29.922Z","updated_at":"2026-04-29T19:32:42.154Z","avatar_url":"https://github.com/ganochenkodg.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# vllm-token-stats\n\nProxy for vLLM to expose token usage metrics.\n\n# architecture\n\nvllm-token-stats is a solution to proxify incoming requests to vLLM and collect statistics of used tokens by different clients. \nIt requires next RBAC permissions to get clients name and namespace:\n\n```\n- apiGroups: [\"\"]\n  resources: [\"pods\"]\n  verbs: [\"get\", \"list\"]\n```\n\n```mermaid\ngraph TD\n    subgraph Deployment\n        direction LR\n        vllm[vLLM\u003cbr\u003ePort: 8000]\n        vllm_stats[vllm-token-stats\u003cbr\u003ePort: 3000]\n    end\n    user[Incoming Requests]\n    prometheus[Prometheus Metrics\u003cbr\u003e]\n\n    user -- Requests\u003cbr\u003e/v1 --\u003e vllm_stats\n    vllm_stats -- Proxies to\u003cbr\u003e/v1 --\u003e vllm\n    vllm_stats -- /metrics --\u003e prometheus\n```\n\nIt provides metrics in Prometheus format, example:\n\n```\n# HELP vllm_prompt_tokens Prompt tokens used by VLLM\n# TYPE vllm_prompt_tokens counter\nvllm_prompt_tokens{namespace=\"default\",client_name=\"sh\",full_path=\"/v1/completions\",hostname=\"vllm-66855dfbf7-m5njg\"} 6\nvllm_prompt_tokens{namespace=\"test\",client_name=\"benchmark\",full_path=\"/v1/completions\",hostname=\"vllm-66855dfbf7-m5njg\"} 406149\n\n# HELP vllm_completion_tokens Completion tokens produced by VLLM\n# TYPE vllm_completion_tokens counter\nvllm_completion_tokens{namespace=\"default\",client_name=\"sh\",full_path=\"/v1/completions\",hostname=\"vllm-66855dfbf7-m5njg\"} 100\nvllm_completion_tokens{namespace=\"test\",client_name=\"benchmark\",full_path=\"/v1/completions\",hostname=\"vllm-66855dfbf7-m5njg\"} 359592\n\n# HELP vllm_total_tokens Total tokens processed by VLLM\n# TYPE vllm_total_tokens counter\nvllm_total_tokens{namespace=\"default\",client_name=\"sh\",full_path=\"/v1/completions\",hostname=\"vllm-66855dfbf7-m5njg\"} 106\nvllm_total_tokens{namespace=\"test\",client_name=\"benchmark\",full_path=\"/v1/completions\",hostname=\"vllm-66855dfbf7-m5njg\"} 765741\n```\n\n# installation\n\nYou can install example yaml manifest with all required components (deployment with vLLM and proxy, service, rbac, PodMonitor) in GKE Autopilot cluster:\n\n```\nkubectl apply -f https://raw.githubusercontent.com/ganochenkodg/vllm-token-stats/refs/heads/main/vllm-l4.yaml\n```\n\nExample output:\n\n```\nserviceaccount/log-proxy-sa created\nclusterrole.rbac.authorization.k8s.io/log-proxy-cluster-role created\nclusterrolebinding.rbac.authorization.k8s.io/log-proxy-cluster-role-binding created\ndeployment.apps/vllm created\nservice/vllm-endpoint created\npodmonitoring.monitoring.googleapis.com/vllm-token-stats created\n```\n\n# performance\n\nThe difference in performance between a direct connection to vLLM and through vllm-token-stats is insignificant.\n\nBenchmark results for g2-standard-4 node, one Nvidia L4 GPU and Lllama-3.1-8b-Instruct model:\n\n```\npython3 benchmark_serving.py \\\n  --backend openai \\\n//--base-url http://vllm-endpoint.default.svc:8000 \\\n  --base-url http://vllm-endpoint.default.svc:3000 \\\n  --model unsloth/Meta-Llama-3.1-8B-Instruct \\\n  --dataset-name sharegpt \\\n  --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json\n```\n\nDirect connection:\n\n```\n============ Serving Benchmark Result ============\nSuccessful requests:                     1000\nBenchmark duration (s):                  343.76\nTotal input tokens:                      215196\nTotal generated tokens:                  197107\nRequest throughput (req/s):              2.91\nOutput token throughput (tok/s):         573.38\nTotal Token throughput (tok/s):          1199.39\n---------------Time to First Token----------------\nMean TTFT (ms):                          140519.21\nMedian TTFT (ms):                        138940.03\nP99 TTFT (ms):                           294126.89\n-----Time per Output Token (excl. 1st token)------\nMean TPOT (ms):                          159.13\nMedian TPOT (ms):                        139.51\nP99 TPOT (ms):                           603.73\n---------------Inter-token Latency----------------\nMean ITL (ms):                           138.68\nMedian ITL (ms):                         95.72\nP99 ITL (ms):                            617.65\n==================================================\n```\n\nThrough the proxy:\n\n```\n============ Serving Benchmark Result ============\nSuccessful requests:                     1000\nBenchmark duration (s):                  357.18\nTotal input tokens:                      215196\nTotal generated tokens:                  198054\nRequest throughput (req/s):              2.80\nOutput token throughput (tok/s):         554.49\nTotal Token throughput (tok/s):          1156.96\n---------------Time to First Token----------------\nMean TTFT (ms):                          138294.06\nMedian TTFT (ms):                        126404.99\nP99 TTFT (ms):                           300286.19\n-----Time per Output Token (excl. 1st token)------\nMean TPOT (ms):                          152.34\nMedian TPOT (ms):                        138.11\nP99 TPOT (ms):                           455.29\n---------------Inter-token Latency----------------\nMean ITL (ms):                           135.52\nMedian ITL (ms):                         95.46\nP99 ITL (ms):                            608.47\n==================================================\n```\n\n~96.5% performance of direct benchmarking\n\nTypical resource consumption under load:\n\n```bash\n$ kubectl top pod --containers=true vllm-66855dfbf7-m5njg\nPOD                     NAME               CPU(cores)   MEMORY(bytes)\nvllm-66855dfbf7-m5njg   inference-server   850m         7732Mi\nvllm-66855dfbf7-m5njg   vllm-token-stats   83m          284Mi\n```\n\n# vLLM monitoring\n\nUse [dashboard.json](dashboard.json) In Google Cloud Monitoring to see token usage.\n\n![Dashboard](dashboard.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fganochenkodg%2Fvllm-token-stats","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fganochenkodg%2Fvllm-token-stats","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fganochenkodg%2Fvllm-token-stats/lists"}