https://github.com/sukhrobnurali/relay-gateway

OpenAI-compatible gateway for self-hosted LLMs (vLLM, Ollama) with cloud fallback. Per-key auth, Redis sliding-window rate limits, structured logs, prometheus metrics.
https://github.com/sukhrobnurali/relay-gateway

fastapi gateway llm ollama openai-api self-hosted vllm

Last synced: 1 day ago
JSON representation

OpenAI-compatible gateway for self-hosted LLMs (vLLM, Ollama) with cloud fallback. Per-key auth, Redis sliding-window rate limits, structured logs, prometheus metrics.

Host: GitHub
URL: https://github.com/sukhrobnurali/relay-gateway
Owner: sukhrobnurali
License: apache-2.0
Created: 2026-05-09T19:39:41.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-05-09T19:58:48.000Z (about 2 months ago)
Last Synced: 2026-06-20T11:32:40.775Z (5 days ago)
Topics: fastapi, gateway, llm, ollama, openai-api, self-hosted, vllm
Language: Python
Homepage:
Size: 192 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # relay-gateway

[![CI](https://github.com/sukhrobnurali/relay-gateway/actions/workflows/ci.yml/badge.svg)](https://github.com/sukhrobnurali/relay-gateway/actions/workflows/ci.yml)

[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)

[![Python](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/)

[![Container](https://img.shields.io/badge/ghcr.io-relay--gateway-1f6feb)](https://github.com/sukhrobnurali/relay-gateway/pkgs/container/relay-gateway)

> Self-hosted-first, OpenAI-compatible gateway for **vLLM**, **Ollama**, and cloud fallback.

> Auth, rate limiting, multi-backend routing, observability — one `docker compose up`.

## What it does

Most teams running open-source LLMs end up writing the same boring wrapper

around their inference servers: an OpenAI-compatible HTTP front, API-key

auth, per-key rate limits, request logs, prometheus metrics, a routing

table from "friendly model name" to upstream, and a way to fall back to a

cloud provider when local is busy.

`relay-gateway` is that wrapper, small enough to read in a sitting:

```

clients (OpenAI SDK, curl, langchain, ...)

        │

        ▼

┌─────────────────────────────────────────────┐

│ FastAPI app (auth → ratelimit → routing)    │

│  - bearer token + per-key model scopes      │

│  - sliding-window rpm/tpm in Redis (Lua)    │

│  - prometheus /metrics, structlog access    │

└─────────────────────────────────────────────┘

        │           │             │

        ▼           ▼             ▼

   vLLM (local)   Ollama       OpenAI (cloud fallback)

```

It is **deliberately not** a 100-provider gateway. It does vLLM and

Ollama well, treats OpenAI as a passthrough fallback, and stops there.

For a wider provider set see the [landscape doc](docs/landscape.md).

## How it compares

|                         | relay-gateway | LiteLLM Proxy | Portkey | OpenRouter |

|-------------------------|---------------|---------------|---------|------------|

| Self-hosted-first       | yes           | yes           | yes     | no         |

| OpenAI-compatible API   | yes           | yes           | yes     | yes        |

| vLLM + Ollama focus     | yes           | yes (general) | partial | partial    |

| Single-binary deploy    | yes (docker)  | yes           | yes     | n/a        |

| Auditable code size     | ~1.5 KLOC     | ~30 KLOC      | closed  | closed     |

| Per-key rpm + tpm       | yes (Redis)   | yes           | yes     | yes        |

| Multi-backend routing   | yes           | yes           | yes     | yes        |

| OSS license             | Apache-2.0    | MIT           | MIT     | n/a        |

The closest cousin is LiteLLM Proxy. Pick LiteLLM if you need 100+

providers and don't mind a heavier dependency surface; pick this if you

want something small you can read end-to-end before you trust it in front

of production.

## Quickstart

```sh

git clone https://github.com/sukhrobnurali/relay-gateway && cd relay-gateway

# Pull a model into ollama (compose file co-launches it).

docker compose -f docker/docker-compose.yml up -d

docker compose -f docker/docker-compose.yml exec ollama ollama pull qwen2.5:7b

# Health check

curl -s localhost:8080/healthz

# {"status": "ok"}

# Make a chat completion call as OpenAI would.

curl -s localhost:8080/v1/chat/completions \

  -H 'authorization: Bearer dev-key' \

  -H 'content-type: application/json' \

  -d '{

    "model": "qwen2.5",

    "messages": [{"role": "user", "content": "say hi in five words"}]

  }' | jq .

```

Using the OpenAI Python SDK is exactly what you'd expect:

```python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="dev-key")

resp = client.chat.completions.create(

    model="qwen2.5",

    messages=[{"role": "user", "content": "say hi in five words"}],

)

print(resp.choices[0].message.content)

```

## Performance

Numbers from a same-host benchmark (Ryzen 7 5800X, 60 s @ 50 connections,

qwen2.5:7b Q4_K_M served by Ollama). See [docs/benchmarks.md](docs/benchmarks.md)

for full methodology.

| latency  | direct ollama | gateway -> ollama | overhead |

|----------|---------------|-------------------|----------|

| p50      | 412 ms        | 414 ms            | +2 ms    |

| p95      | 658 ms        | 661 ms            | +3 ms    |

| p99      | 821 ms        | 829 ms            | +8 ms    |

## Configuration

YAML for shape, env vars for overrides. See [docs/configuration.md](docs/configuration.md)

for every key. Minimal example:

```yaml

# config.yaml

backends:

  ollama-local:

    type: ollama

    base_url: http://ollama:11434

models:

  qwen2.5:

    backend: ollama-local

    upstream_model: qwen2.5:7b

    fallbacks: [openai-fallback]

api_keys:

  - name: dev

    # generate with: uv run python -c "from argon2 import PasswordHasher as P; print(P().hash('your-key'))"

    hash: $argon2id$v=19$m=65536,t=3,p=4$...

    scopes: [qwen2.5]

    limits:

      rpm: 60

      tpm: 30000

default_limits:

  rpm: 30

  tpm: 10000

redis_url: redis://redis:6379/0

```

## Deployment

* **Docker compose** (single VM): see `docker/docker-compose.yml`.

* **Kubernetes**: example manifests in `examples/k8s/`. Full guide in

  [docs/deployment.md](docs/deployment.md).

* **Image**: pulled from `ghcr.io/sukhrobnurali/relay-gateway:`.

## Development

```sh

uv sync

uv run pytest -q          # full test suite

uv run ruff check .

uv run pyright src

uv run pytest --cov       # coverage report

```

Architecture decisions and the why-this-not-that are in

[docs/architecture.md](docs/architecture.md). The competitive analysis

(circa Jan 2026) lives in [docs/landscape.md](docs/landscape.md).

## License

Apache-2.0. See [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sukhrobnurali/relay-gateway

Awesome Lists containing this project

README