An open API service indexing awesome lists of open source software.

https://github.com/sukhrobnurali/relay-gateway

OpenAI-compatible gateway for self-hosted LLMs (vLLM, Ollama) with cloud fallback. Per-key auth, Redis sliding-window rate limits, structured logs, prometheus metrics.
https://github.com/sukhrobnurali/relay-gateway

fastapi gateway llm ollama openai-api self-hosted vllm

Last synced: 1 day ago
JSON representation

OpenAI-compatible gateway for self-hosted LLMs (vLLM, Ollama) with cloud fallback. Per-key auth, Redis sliding-window rate limits, structured logs, prometheus metrics.

Awesome Lists containing this project

README

          

# relay-gateway

[![CI](https://github.com/sukhrobnurali/relay-gateway/actions/workflows/ci.yml/badge.svg)](https://github.com/sukhrobnurali/relay-gateway/actions/workflows/ci.yml)
[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/)
[![Container](https://img.shields.io/badge/ghcr.io-relay--gateway-1f6feb)](https://github.com/sukhrobnurali/relay-gateway/pkgs/container/relay-gateway)

> Self-hosted-first, OpenAI-compatible gateway for **vLLM**, **Ollama**, and cloud fallback.
> Auth, rate limiting, multi-backend routing, observability — one `docker compose up`.

## What it does

Most teams running open-source LLMs end up writing the same boring wrapper
around their inference servers: an OpenAI-compatible HTTP front, API-key
auth, per-key rate limits, request logs, prometheus metrics, a routing
table from "friendly model name" to upstream, and a way to fall back to a
cloud provider when local is busy.

`relay-gateway` is that wrapper, small enough to read in a sitting:

```
clients (OpenAI SDK, curl, langchain, ...)


┌─────────────────────────────────────────────┐
│ FastAPI app (auth → ratelimit → routing) │
│ - bearer token + per-key model scopes │
│ - sliding-window rpm/tpm in Redis (Lua) │
│ - prometheus /metrics, structlog access │
└─────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
vLLM (local) Ollama OpenAI (cloud fallback)
```

It is **deliberately not** a 100-provider gateway. It does vLLM and
Ollama well, treats OpenAI as a passthrough fallback, and stops there.
For a wider provider set see the [landscape doc](docs/landscape.md).

## How it compares

| | relay-gateway | LiteLLM Proxy | Portkey | OpenRouter |
|-------------------------|---------------|---------------|---------|------------|
| Self-hosted-first | yes | yes | yes | no |
| OpenAI-compatible API | yes | yes | yes | yes |
| vLLM + Ollama focus | yes | yes (general) | partial | partial |
| Single-binary deploy | yes (docker) | yes | yes | n/a |
| Auditable code size | ~1.5 KLOC | ~30 KLOC | closed | closed |
| Per-key rpm + tpm | yes (Redis) | yes | yes | yes |
| Multi-backend routing | yes | yes | yes | yes |
| OSS license | Apache-2.0 | MIT | MIT | n/a |

The closest cousin is LiteLLM Proxy. Pick LiteLLM if you need 100+
providers and don't mind a heavier dependency surface; pick this if you
want something small you can read end-to-end before you trust it in front
of production.

## Quickstart

```sh
git clone https://github.com/sukhrobnurali/relay-gateway && cd relay-gateway

# Pull a model into ollama (compose file co-launches it).
docker compose -f docker/docker-compose.yml up -d
docker compose -f docker/docker-compose.yml exec ollama ollama pull qwen2.5:7b

# Health check
curl -s localhost:8080/healthz
# {"status": "ok"}

# Make a chat completion call as OpenAI would.
curl -s localhost:8080/v1/chat/completions \
-H 'authorization: Bearer dev-key' \
-H 'content-type: application/json' \
-d '{
"model": "qwen2.5",
"messages": [{"role": "user", "content": "say hi in five words"}]
}' | jq .
```

Using the OpenAI Python SDK is exactly what you'd expect:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="dev-key")
resp = client.chat.completions.create(
model="qwen2.5",
messages=[{"role": "user", "content": "say hi in five words"}],
)
print(resp.choices[0].message.content)
```

## Performance

Numbers from a same-host benchmark (Ryzen 7 5800X, 60 s @ 50 connections,
qwen2.5:7b Q4_K_M served by Ollama). See [docs/benchmarks.md](docs/benchmarks.md)
for full methodology.

| latency | direct ollama | gateway -> ollama | overhead |
|----------|---------------|-------------------|----------|
| p50 | 412 ms | 414 ms | +2 ms |
| p95 | 658 ms | 661 ms | +3 ms |
| p99 | 821 ms | 829 ms | +8 ms |

## Configuration

YAML for shape, env vars for overrides. See [docs/configuration.md](docs/configuration.md)
for every key. Minimal example:

```yaml
# config.yaml
backends:
ollama-local:
type: ollama
base_url: http://ollama:11434

models:
qwen2.5:
backend: ollama-local
upstream_model: qwen2.5:7b
fallbacks: [openai-fallback]

api_keys:
- name: dev
# generate with: uv run python -c "from argon2 import PasswordHasher as P; print(P().hash('your-key'))"
hash: $argon2id$v=19$m=65536,t=3,p=4$...
scopes: [qwen2.5]
limits:
rpm: 60
tpm: 30000

default_limits:
rpm: 30
tpm: 10000

redis_url: redis://redis:6379/0
```

## Deployment

* **Docker compose** (single VM): see `docker/docker-compose.yml`.
* **Kubernetes**: example manifests in `examples/k8s/`. Full guide in
[docs/deployment.md](docs/deployment.md).
* **Image**: pulled from `ghcr.io/sukhrobnurali/relay-gateway:`.

## Development

```sh
uv sync
uv run pytest -q # full test suite
uv run ruff check .
uv run pyright src
uv run pytest --cov # coverage report
```

Architecture decisions and the why-this-not-that are in
[docs/architecture.md](docs/architecture.md). The competitive analysis
(circa Jan 2026) lives in [docs/landscape.md](docs/landscape.md).

## License

Apache-2.0. See [LICENSE](LICENSE).