https://github.com/sukhrobnurali/relay-gateway
OpenAI-compatible gateway for self-hosted LLMs (vLLM, Ollama) with cloud fallback. Per-key auth, Redis sliding-window rate limits, structured logs, prometheus metrics.
https://github.com/sukhrobnurali/relay-gateway
fastapi gateway llm ollama openai-api self-hosted vllm
Last synced: 1 day ago
JSON representation
OpenAI-compatible gateway for self-hosted LLMs (vLLM, Ollama) with cloud fallback. Per-key auth, Redis sliding-window rate limits, structured logs, prometheus metrics.
- Host: GitHub
- URL: https://github.com/sukhrobnurali/relay-gateway
- Owner: sukhrobnurali
- License: apache-2.0
- Created: 2026-05-09T19:39:41.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-09T19:58:48.000Z (about 2 months ago)
- Last Synced: 2026-06-20T11:32:40.775Z (5 days ago)
- Topics: fastapi, gateway, llm, ollama, openai-api, self-hosted, vllm
- Language: Python
- Homepage:
- Size: 192 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# relay-gateway
[](https://github.com/sukhrobnurali/relay-gateway/actions/workflows/ci.yml)
[](LICENSE)
[](https://www.python.org/)
[](https://github.com/sukhrobnurali/relay-gateway/pkgs/container/relay-gateway)
> Self-hosted-first, OpenAI-compatible gateway for **vLLM**, **Ollama**, and cloud fallback.
> Auth, rate limiting, multi-backend routing, observability — one `docker compose up`.
## What it does
Most teams running open-source LLMs end up writing the same boring wrapper
around their inference servers: an OpenAI-compatible HTTP front, API-key
auth, per-key rate limits, request logs, prometheus metrics, a routing
table from "friendly model name" to upstream, and a way to fall back to a
cloud provider when local is busy.
`relay-gateway` is that wrapper, small enough to read in a sitting:
```
clients (OpenAI SDK, curl, langchain, ...)
│
▼
┌─────────────────────────────────────────────┐
│ FastAPI app (auth → ratelimit → routing) │
│ - bearer token + per-key model scopes │
│ - sliding-window rpm/tpm in Redis (Lua) │
│ - prometheus /metrics, structlog access │
└─────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
vLLM (local) Ollama OpenAI (cloud fallback)
```
It is **deliberately not** a 100-provider gateway. It does vLLM and
Ollama well, treats OpenAI as a passthrough fallback, and stops there.
For a wider provider set see the [landscape doc](docs/landscape.md).
## How it compares
| | relay-gateway | LiteLLM Proxy | Portkey | OpenRouter |
|-------------------------|---------------|---------------|---------|------------|
| Self-hosted-first | yes | yes | yes | no |
| OpenAI-compatible API | yes | yes | yes | yes |
| vLLM + Ollama focus | yes | yes (general) | partial | partial |
| Single-binary deploy | yes (docker) | yes | yes | n/a |
| Auditable code size | ~1.5 KLOC | ~30 KLOC | closed | closed |
| Per-key rpm + tpm | yes (Redis) | yes | yes | yes |
| Multi-backend routing | yes | yes | yes | yes |
| OSS license | Apache-2.0 | MIT | MIT | n/a |
The closest cousin is LiteLLM Proxy. Pick LiteLLM if you need 100+
providers and don't mind a heavier dependency surface; pick this if you
want something small you can read end-to-end before you trust it in front
of production.
## Quickstart
```sh
git clone https://github.com/sukhrobnurali/relay-gateway && cd relay-gateway
# Pull a model into ollama (compose file co-launches it).
docker compose -f docker/docker-compose.yml up -d
docker compose -f docker/docker-compose.yml exec ollama ollama pull qwen2.5:7b
# Health check
curl -s localhost:8080/healthz
# {"status": "ok"}
# Make a chat completion call as OpenAI would.
curl -s localhost:8080/v1/chat/completions \
-H 'authorization: Bearer dev-key' \
-H 'content-type: application/json' \
-d '{
"model": "qwen2.5",
"messages": [{"role": "user", "content": "say hi in five words"}]
}' | jq .
```
Using the OpenAI Python SDK is exactly what you'd expect:
```python
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dev-key")
resp = client.chat.completions.create(
model="qwen2.5",
messages=[{"role": "user", "content": "say hi in five words"}],
)
print(resp.choices[0].message.content)
```
## Performance
Numbers from a same-host benchmark (Ryzen 7 5800X, 60 s @ 50 connections,
qwen2.5:7b Q4_K_M served by Ollama). See [docs/benchmarks.md](docs/benchmarks.md)
for full methodology.
| latency | direct ollama | gateway -> ollama | overhead |
|----------|---------------|-------------------|----------|
| p50 | 412 ms | 414 ms | +2 ms |
| p95 | 658 ms | 661 ms | +3 ms |
| p99 | 821 ms | 829 ms | +8 ms |
## Configuration
YAML for shape, env vars for overrides. See [docs/configuration.md](docs/configuration.md)
for every key. Minimal example:
```yaml
# config.yaml
backends:
ollama-local:
type: ollama
base_url: http://ollama:11434
models:
qwen2.5:
backend: ollama-local
upstream_model: qwen2.5:7b
fallbacks: [openai-fallback]
api_keys:
- name: dev
# generate with: uv run python -c "from argon2 import PasswordHasher as P; print(P().hash('your-key'))"
hash: $argon2id$v=19$m=65536,t=3,p=4$...
scopes: [qwen2.5]
limits:
rpm: 60
tpm: 30000
default_limits:
rpm: 30
tpm: 10000
redis_url: redis://redis:6379/0
```
## Deployment
* **Docker compose** (single VM): see `docker/docker-compose.yml`.
* **Kubernetes**: example manifests in `examples/k8s/`. Full guide in
[docs/deployment.md](docs/deployment.md).
* **Image**: pulled from `ghcr.io/sukhrobnurali/relay-gateway:`.
## Development
```sh
uv sync
uv run pytest -q # full test suite
uv run ruff check .
uv run pyright src
uv run pytest --cov # coverage report
```
Architecture decisions and the why-this-not-that are in
[docs/architecture.md](docs/architecture.md). The competitive analysis
(circa Jan 2026) lives in [docs/landscape.md](docs/landscape.md).
## License
Apache-2.0. See [LICENSE](LICENSE).