https://github.com/ictup/llm-serving-gateway-vllm

OpenAI-compatible LLM serving gateway with vLLM, FastAPI, Redis quotas, Prometheus/Grafana, GPU metrics, benchmarks, Docker, Kubernetes, Helm, GitOps, Terraform, and CI/CD.
https://github.com/ictup/llm-serving-gateway-vllm
ai-gateway ai-infrastructure argocd benchmarking docker fastapi gitops grafana helm kubernetes llm llm-gateway llmops model-serving openai-compatible openai-proxy prometheus redis terraform vllm
Last synced: about 1 month ago
JSON representation
OpenAI-compatible LLM serving gateway with vLLM, FastAPI, Redis quotas, Prometheus/Grafana, GPU metrics, benchmarks, Docker, Kubernetes, Helm, GitOps, Terraform, and CI/CD.
Host: GitHub
URL: https://github.com/ictup/llm-serving-gateway-vllm
Owner: ictup
Created: 2026-05-18T08:34:08.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-05-19T23:19:06.000Z (2 months ago)
Last Synced: 2026-05-20T02:32:45.602Z (2 months ago)
Topics: ai-gateway, ai-infrastructure, argocd, benchmarking, docker, fastapi, gitops, grafana, helm, kubernetes, llm, llm-gateway, llmops, model-serving, openai-compatible, openai-proxy, prometheus, redis, terraform, vllm
Language: Python
Homepage:
Size: 288 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Security: docs/security.md
Awesome Lists containing this project

README

          # OpenAI-Compatible LLM Serving Gateway

[![CI](https://github.com/ictup/llm-serving-gateway-vllm/actions/workflows/ci.yml/badge.svg)](https://github.com/ictup/llm-serving-gateway-vllm/actions/workflows/ci.yml)

[![Security](https://github.com/ictup/llm-serving-gateway-vllm/actions/workflows/security.yml/badge.svg)](https://github.com/ictup/llm-serving-gateway-vllm/actions/workflows/security.yml)

[![Release](https://github.com/ictup/llm-serving-gateway-vllm/actions/workflows/release.yml/badge.svg)](https://github.com/ictup/llm-serving-gateway-vllm/actions/workflows/release.yml)

[![Latest Release](https://img.shields.io/github/v/release/ictup/llm-serving-gateway-vllm?sort=semver)](https://github.com/ictup/llm-serving-gateway-vllm/releases)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

[![Python 3.11+](https://img.shields.io/badge/Python-3.11%2B-3776AB?logo=python&logoColor=white)](pyproject.toml)

[![FastAPI](https://img.shields.io/badge/FastAPI-OpenAI--compatible-009688?logo=fastapi&logoColor=white)](gateway/app/main.py)

[![vLLM](https://img.shields.io/badge/vLLM-GPU%20serving-6A5ACD)](docker-compose.gpu.yml)

[![Kubernetes + Helm](https://img.shields.io/badge/Kubernetes%20%2B%20Helm-ready-326CE5?logo=kubernetes&logoColor=white)](deploy/helm)

[![Observability](https://img.shields.io/badge/Observability-Prometheus%20%2B%20Grafana-orange)](monitoring)

Production-style FastAPI gateway for OpenAI-compatible LLM serving. It sits in

front of mock or vLLM backends and adds the platform layer that a raw model

server does not own: API keys, request IDs, model aliases, weighted routing,

Redis-backed RPM/TPM/concurrency limits, streaming metrics, Prometheus,

Grafana, GPU telemetry, Docker, Kubernetes, Helm, GitOps, Terraform,

supply-chain checks, release automation, and repeatable direct-vs-gateway

benchmarks.

This is a portfolio-grade AI infrastructure project. It is designed to show

how an LLM serving gateway is built, operated, benchmarked, secured, and

released without pretending to be a full enterprise GPU scheduler.

## Why This Project Exists

vLLM already exposes an OpenAI-compatible server. This project answers the

next platform question: what do you put around that server when teams need a

stable API contract, quotas, routing, metrics, deployment automation, and

release discipline?

The Gateway keeps those concerns outside model execution:

- Clients call one OpenAI-compatible `/v1` API.

- vLLM or the mock backend handles model responses.

- Redis enforces RPM, TPM, and concurrent request limits.

- Prometheus and Grafana expose request, streaming, vLLM, and GPU behavior.

- Docker, Kubernetes, Helm, Argo CD, and Terraform describe deployment paths.

- CI, security scans, SBOM/provenance, GHCR publishing, and SemVer releases

  make the repository behave like a maintained production project.

## Recruiter Snapshot

| Signal | Evidence |

| --- | --- |

| AI platform engineering | OpenAI-compatible Gateway, vLLM backend, streaming SSE, model routing, token-aware quotas |

| Production operations | Redis rate limits, readiness/warmup, structured logs, Prometheus metrics, Grafana dashboards, alert rules |

| Performance discipline | Direct-vs-Gateway benchmark runner, TTFT, ITL, TPOT, p95/p99, error-rate, tokenizer-level output token metrics |

| Deployment maturity | Docker Compose, K8s overlays, Helm chart, Argo CD examples, Terraform entry point |

| Delivery hygiene | CI, GHCR image publishing, Trivy/pip-audit, SBOM/provenance, Dependabot, SemVer release workflow |

In short: this is not only an inference demo. It shows benchmarking,

observability, and the platform scope recruiters expect:

GitOps, Terraform, supply-chain checks, release automation.

## Feature Overview

| Area | Implemented |

| --- | --- |

| API surface | `/v1/models`, `/v1/chat/completions`, OpenAI-compatible request/response schemas |

| Streaming | Server-Sent Events proxying with time-to-first-token measurement |

| Backends | No-GPU mock backend for CI and local demos, vLLM OpenAI server for CUDA serving |

| Routing | Model aliases, backend model mapping, weighted canary routes, fallback targets |

| Auth and safety | API key auth, request IDs, request body limits, chat message limits, normalized errors |

| Rate limiting | Redis-backed RPM, tokenizer-aware TPM, and concurrent request limits |

| Observability | Prometheus metrics, structured JSON logs, Grafana dashboards, alert rules |

| GPU telemetry | DCGM exporter wiring for GPU utilization and framebuffer memory |

| Benchmarking | Async direct-vLLM vs Gateway runs with RPS, latency, TTFT, ITL, TPOT, output tokens/sec, p95/p99, error rate |

| Deployment | Docker Compose, Kubernetes base and GPU overlays, Helm, Argo CD, Terraform skeleton |

| Supply chain | CI, Trivy, pip-audit, SBOM, provenance, Dependabot, GHCR image publishing |

| Release engineering | SemVer validation, changelog, release workflow, versioned GitHub Releases |

## Architecture

```mermaid

flowchart LR

    Client["OpenAI SDK
curl
RAG app
agent runtime"] --> Gateway["FastAPI Gateway
OpenAI-compatible /v1 API"]

    Gateway --> Platform["Platform controls
auth, request ID
limits, logs, metrics"]

    Platform --> Router["Model router
aliases, weights, fallback"]

    Router --> Mock["Mock backend
CI and no-GPU demo"]

    Router --> VLLM["vLLM OpenAI server
CUDA serving"]

    Platform --> Redis["Redis
RPM, TPM, concurrency"]

    Platform --> Prometheus["Prometheus
Gateway and vLLM metrics"]

    VLLM --> DCGM["DCGM exporter
GPU util and memory"]

    DCGM --> Prometheus

    Prometheus --> Grafana["Grafana dashboards
Gateway, vLLM, GPU"]

```

The important design choice is separation of concerns. vLLM executes the model;

the Gateway owns client-facing policy, routing, limits, observability, and the

operational contract.

## Quick Start: No GPU

The mock backend makes the platform reproducible on a laptop and in CI.

Print the portfolio demo walkthrough:

```bash

uv run python scripts/demo_portfolio.py

```

Run the no-GPU live demo:

```bash

uv run python scripts/demo_portfolio.py --execute-local

```

```bash

uv sync --frozen --all-groups

uv run python scripts/local_e2e.py

```

Run the quality gate:

```bash

uv run ruff check .

uv run pytest

```

Start the full no-GPU local stack:

```bash

docker compose up --build

```

| Service | URL |

| --- | --- |

| Gateway | http://localhost:8080 |

| Mock backend | http://localhost:9000 |

| Prometheus | http://localhost:9090 |

| Grafana | http://localhost:3000 |

| Redis | `localhost:6379` |

Grafana defaults to `admin` / `admin`.

## Quick Start: GPU vLLM

Use the GPU override when Docker can access an NVIDIA runtime:

```powershell

$env:VLLM_MODEL="Qwen/Qwen2.5-0.5B-Instruct"

$env:VLLM_IMAGE_TAG="v0.8.5.post1"

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

```

Warm up the Gateway and run an OpenAI SDK smoke test:

```powershell

uv run python scripts/warmup_gateway.py --model qwen-small

$env:OPENAI_BASE_URL="http://localhost:8080/v1"

$env:OPENAI_API_KEY="dev-key"

$env:LLM_MODEL="qwen-small"

uv run python benchmark/client_smoke_test.py

```

The default GPU model is intentionally small because it has been validated on

an 8GB RTX 4060 Laptop GPU. Larger models can be selected by overriding

`VLLM_MODEL` on machines with enough free GPU memory.

## API Example

```bash

curl http://localhost:8080/v1/chat/completions \

  -H "Authorization: Bearer dev-key" \

  -H "Content-Type: application/json" \

  -d '{

    "model": "mock",

    "messages": [

      {

        "role": "user",

        "content": "Explain TTFT in one sentence."

      }

    ],

    "stream": false

  }'

```

For streaming examples, model listing, error shapes, and health checks, see

[docs/api_usage.md](docs/api_usage.md).

## Benchmark Snapshot

Portfolio profile on a local RTX 4060 Laptop GPU with 100 measured streaming

requests per concurrency level:

| Concurrency | Direct RPS | Gateway RPS | Direct P95 Latency | Gateway P95 Latency | Gateway P50 TTFT |

| ---: | ---: | ---: | ---: | ---: | ---: |

| 1 | 1.73 | 1.98 | 1043.08 ms | 896.34 ms | 39.27 ms |

| 4 | 5.71 | 6.31 | 1256.95 ms | 1119.99 ms | 50.63 ms |

| 8 | 9.45 | 10.14 | 1474.63 ms | 1368.80 ms | 54.08 ms |

| 16 | 13.58 | 14.19 | 2004.51 ms | 1902.34 ms | 70.16 ms |

| 32 | 17.87 | 16.11 | 3121.88 ms | 3529.36 ms | 169.88 ms |

Both direct and Gateway paths completed with zero errors. Gateway-faster rows

should be read as local run variance and "no obvious Gateway bottleneck", not

as proof that the Gateway accelerates vLLM.

Full report:

[docs/gateway_overhead_report.md](docs/gateway_overhead_report.md)

## Run Direct vs Gateway Benchmarks

Run direct vLLM:

```bash

uv run python benchmark/run_benchmark.py \

  --profile portfolio \

  --base-url http://localhost:8000/v1 \

  --api-key local-vllm-key \

  --model Qwen/Qwen2.5-0.5B-Instruct \

  --prompts benchmark/prompts/short_prompts.jsonl \

  --output-tokenizer-path D:/models/qwen-tokenizer.json \

  --timeout-seconds 120 \

  --stream true

```

For Gateway serving-capacity runs, raise the demo quota before starting the

Docker stack:

```powershell

$env:RATE_LIMIT_RPM="10000"

$env:RATE_LIMIT_TPM="2000000"

$env:RATE_LIMIT_CONCURRENT_REQUESTS="64"

docker compose -f docker-compose.yml -f docker-compose.gpu.yml up --build

```

Run through the Gateway:

```bash

uv run python benchmark/run_benchmark.py \

  --profile portfolio \

  --base-url http://localhost:8080/v1 \

  --api-key dev-key \

  --model qwen-small \

  --prompts benchmark/prompts/short_prompts.jsonl \

  --output-tokenizer-path D:/models/qwen-tokenizer.json \

  --timeout-seconds 120 \

  --stream true

```

Generate the comparison report:

```bash

uv run python benchmark/compare_results.py \

  --direct-result benchmark/results/.json \

  --gateway-result benchmark/results/.json \

  --prometheus-snapshot benchmark/results/.json \

  --prometheus-timeseries benchmark/results/.json \

  --output docs/gateway_overhead_report.md

```

The `portfolio` profile runs concurrency `1, 4, 8, 16, 32` with 100 measured

requests per level and 10 warmup requests. Supplying

`--output-tokenizer-path` adds tokenizer-level output tokens/sec and TPOT. Use

`--profile stress` for 1000 requests per level after the local GPU path is

stable.

See [docs/performance_benchmarking.md](docs/performance_benchmarking.md).

## Observability

The local and deployment assets expose three dashboard layers:

| Dashboard | What it shows |

| --- | --- |

| Gateway Overview | request rate, latency, errors, rejections, streaming TTFT, streaming duration |

| vLLM Engine Overview | running requests, waiting requests, KV cache pressure, prompt/generation tokens/sec |

| GPU Overview | DCGM GPU utilization and framebuffer memory usage |

Prometheus alert examples cover elevated Gateway error rate, p95 latency,

streaming TTFT, rejection rate, vLLM queued requests, and vLLM KV cache usage.

## Deployment Paths

| Target | Entry point | Purpose |

| --- | --- | --- |

| Docker Compose, no GPU | `docker-compose.yml` | Reproducible local demo |

| Docker Compose, vLLM | `docker-compose.gpu.yml` | Local CUDA-backed serving plus DCGM GPU metrics |

| Kubernetes base | `deploy/k8s` | Gateway, mock backend, Redis, Prometheus |

| Kubernetes GPU overlay | `deploy/k8s-gpu` | vLLM backend, vLLM metrics, DCGM scraping |

| Helm | `deploy/helm` | Parameterized mock or vLLM deployment |

| GitOps / Argo CD | `deploy/gitops` | Continuous sync examples for Helm releases |

| Terraform IaC | `deploy/terraform` | Namespace, Secret boundary, Argo CD Application entry point |

Validate manifests:

```bash

kubectl kustomize deploy/k8s

kubectl kustomize deploy/k8s-gpu

helm lint deploy/helm

helm template mini-llm deploy/helm --namespace mini-llm-serving

helm template mini-llm deploy/helm \

  --namespace mini-llm-serving \

  --set vllm.enabled=true \

  --set mockBackend.enabled=false \

  --set dcgmExporter.enabled=true

```

## Repository Guide

| Path | Purpose |

| --- | --- |

| `gateway/app` | FastAPI Gateway, auth, rate limiting, proxying, metrics |

| `serving/mock_backend` | OpenAI-compatible mock backend |

| `benchmark` | SDK smoke tests, async benchmark runner, report tools |

| `monitoring` | Prometheus config, alert rules, Grafana dashboards |

| `deploy/k8s` | No-GPU Kubernetes manifests |

| `deploy/k8s-gpu` | vLLM and DCGM Kubernetes overlay |

| `deploy/helm` | Helm chart for mock and vLLM modes |

| `deploy/gitops` | Argo CD Applications for mock and vLLM modes |

| `deploy/terraform` | Terraform root module for GitOps cluster entry points |

| `docs` | API, configuration, design decisions, operations, reports |

## Verified State

| Area | Status |

| --- | --- |

| No-GPU local path | Verified with mock backend and SDK smoke test |

| GPU path | Verified locally with Docker Desktop and NVIDIA GPU |

| CI | Python lint, tests, Helm lint, Helm template rendering |

| Kubernetes | Base and GPU overlays render with Kustomize |

| Helm | Mock and vLLM modes render successfully |

| Release | `v0.1.1` published with CI, security, release, and container workflows passing |

| External RAG app wiring | Intentionally excluded from this completion |

GPU validation snapshot from May 19, 2026:

| Item | Value |

| --- | --- |

| GPU | NVIDIA GeForce RTX 4060 Laptop GPU, 8GB VRAM |

| vLLM image | `vllm/vllm-openai:v0.8.5.post1` |

| Served model | `Qwen/Qwen2.5-0.5B-Instruct` |

| Gateway alias | `qwen-small` |

| Result | Direct vLLM and Gateway streaming benchmarks completed with zero errors |

## Documentation

- [API usage](docs/api_usage.md)

- [Configuration matrix](docs/configuration.md)

- [Design decisions](docs/design_decisions.md)

- [Failure analysis](docs/failure_analysis.md)

- [Production hardening notes](docs/production_hardening.md)

- [GitOps deployment guide](docs/gitops_deployment.md)

- [Terraform IaC guide](deploy/terraform/README.md)

- [Security and supply chain](docs/security.md)

- [Portfolio demo playbook](docs/demo.md)

- [Release process](docs/release_process.md)

- [Gateway overhead report](docs/gateway_overhead_report.md)

- [Performance benchmarking guide](docs/performance_benchmarking.md)

- [Project status and acceptance checklist](docs/project_status.md)

- [Portfolio summary](docs/portfolio_summary.md)

- [RAG integration guide](docs/rag_integration.md)

- [Recommended GitHub repository metadata](docs/repository_metadata.md)

## License

This project is released under the [MIT License](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ictup/llm-serving-gateway-vllm

Awesome Lists containing this project

README