An open API service indexing awesome lists of open source software.

https://github.com/ericflo/modelrelay

Central HTTP LLM proxy that routes inference requests to authenticated remote workers over WebSocket — queueing, streaming, and cancellation included.
https://github.com/ericflo/modelrelay

anthropic-compatible gpu inference llama llm openai-compatible proxy rust websocket worker

Last synced: 2 months ago
JSON representation

Central HTTP LLM proxy that routes inference requests to authenticated remote workers over WebSocket — queueing, streaming, and cancellation included.

Awesome Lists containing this project

README

          

[![CI](https://github.com/ericflo/modelrelay/actions/workflows/ci.yml/badge.svg)](https://github.com/ericflo/modelrelay/actions/workflows/ci.yml)
[![Latest Release](https://img.shields.io/github/v/release/ericflo/modelrelay)](https://github.com/ericflo/modelrelay/releases/latest)
[![Coverage](https://codecov.io/gh/ericflo/modelrelay/branch/main/graph/badge.svg)](https://codecov.io/gh/ericflo/modelrelay)
[![crates.io](https://img.shields.io/crates/v/modelrelay-protocol)](https://crates.io/crates/modelrelay-protocol)
[![Minimum Rust Version](https://img.shields.io/badge/rustc-1.94+-orange.svg)](rust-toolchain.toml)
[![Documentation](https://img.shields.io/badge/docs-GitHub%20Pages-blue)](https://ericflo.github.io/modelrelay/)
[![MIT License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

# ModelRelay

**Stop configuring clients for every GPU box. Workers connect out; requests route in.**

You have GPU boxes running `llama-server` (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.

ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect *out* to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.

```
Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)

│ POST /v1/chat/completions
│ POST /v1/messages

┌──────────────────────┐
│ modelrelay-server │◄─── workers connect out (WebSocket)
│ (one stable │ no inbound ports needed on GPU boxes
│ endpoint) │
└──────────────────────┘
│ routes request to best available worker

┌────────┐ ┌────────┐ ┌────────┐
│worker-1│ │worker-2│ │worker-3│
│ llama │ │ ollama │ │ vllm │ ← your GPU boxes,
│ server │ │ │ │ │ anywhere on any network
└────────┘ └────────┘ └────────┘
```

## Desktop App

ModelRelay Desktop is a native tray application that wraps the worker daemon in a lightweight GUI. It stays in your system tray and manages the connection to your relay server — no terminal required.

**Features:**
- System tray icon showing connection status (connected / disconnected / relaying)
- Settings UI for backend URL, relay server, worker secret, model selection, and poll interval
- Auto-reconnect on connection loss with status notifications
- Auto-start on login
- Live model list that refreshes as your backend models change

**Download:** Grab the latest installer for your platform from the [Desktop Releases](https://github.com/ericflo/modelrelay/releases?q=desktop) page.

| Platform | Installer |
|----------|-----------|
| Windows | `.msi` or `.exe` |
| macOS | `.dmg` |
| Linux | `.AppImage` or `.deb` |

**Getting started:**
1. Download and install the app for your platform
2. Launch ModelRelay Desktop — it appears in your system tray
3. Right-click the tray icon and open **Settings**
4. Enter your backend URL (e.g. `http://127.0.0.1:8000`), relay server URL, and worker secret
5. Click **Connect** — the tray icon updates to show your connection status

The desktop app uses the same `modelrelay-worker` library under the hood, so it supports all the same backends (llama-server, Ollama, vLLM, LM Studio, etc.).

## Who is this for?

- **Home GPU users** running local models who want a single API endpoint across multiple machines
- **Teams with on-prem hardware** that need to pool GPU capacity without a service mesh
- **Researchers** juggling models across heterogeneous boxes who are tired of updating client configs

## Why this instead of...

| Alternative | What's missing |
|---|---|
| **Pointing clients directly at llama-server** | No HA, no queue, clients must know about every box, no cancellation |
| **nginx / HAProxy** | Doesn't understand LLM streaming semantics, no queueing, no worker auth, no cancellation propagation |
| **LiteLLM / OpenRouter** | Cloud-first routing — not designed for your own private hardware calling home |

## Hosted Version

Don't want to run the infrastructure yourself? A fully-managed hosted version is available at [modelrelay.io](https://modelrelay.io) — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.

## Quickstart

### Pre-built binaries (recommended)

Pre-built binaries are the fastest way to get started. Download the latest release for your platform from the [Releases page](https://github.com/ericflo/modelrelay/releases):

| Platform | modelrelay-server | modelrelay-worker |
|----------|-------------------|-------------------|
| Linux x86_64 | `modelrelay-server-linux-amd64` | `modelrelay-worker-linux-amd64` |
| Linux arm64 | `modelrelay-server-linux-arm64` | `modelrelay-worker-linux-arm64` |
| macOS Intel | `modelrelay-server-darwin-amd64` | `modelrelay-worker-darwin-amd64` |
| macOS Apple Silicon | `modelrelay-server-darwin-arm64` | `modelrelay-worker-darwin-arm64` |
| Windows x86_64 | `modelrelay-server-windows-amd64.exe` | `modelrelay-worker-windows-amd64.exe` |
| Windows arm64 | `modelrelay-server-windows-arm64.exe` | `modelrelay-worker-windows-arm64.exe` |

**Start the proxy:**

```bash
./modelrelay-server \
--listen 0.0.0.0:8080 \
--worker-secret mysecret
```

**Start a worker** (on a GPU box with `llama-server`, Ollama, vLLM, or any OpenAI-compatible backend):

```bash
./modelrelay-worker \
--proxy-url http://:8080 \
--worker-secret mysecret \
--backend-url http://127.0.0.1:8000 \
--models llama3.2:3b,llama3.2:1b
```

### Docker

Pre-built images are published to GitHub Container Registry on every release and main push.

```bash
# Pull the latest images
docker pull ghcr.io/ericflo/modelrelay/modelrelay-server:latest
docker pull ghcr.io/ericflo/modelrelay/modelrelay-worker:latest

# Run the proxy
docker run -p 8080:8080 \
-e WORKER_SECRET=mysecret \
-e LISTEN_ADDR=0.0.0.0:8080 \
ghcr.io/ericflo/modelrelay/modelrelay-server:latest

# Run a worker (on a GPU box)
docker run \
-e PROXY_URL=http://:8080 \
-e WORKER_SECRET=mysecret \
-e BACKEND_URL=http://host.docker.internal:8000 \
-e MODELS=llama3.2:3b \
ghcr.io/ericflo/modelrelay/modelrelay-worker:latest
```

For pinned versions, replace `:latest` with a release tag (e.g. `:0.2.1`).

### Docker Compose (easiest for local dev)

```bash
git clone https://github.com/ericflo/modelrelay.git
cd modelrelay

# Start the proxy + one worker (assumes llama-server on host port 8081)
docker compose up
```

The proxy is now listening on `http://localhost:8080`. The worker connects to it automatically and forwards requests to your backend.

### From crates.io

> **Note:** The crates are not yet published to crates.io. Use [pre-built binaries](#pre-built-binaries-recommended) or [Docker](#docker) in the meantime. See [CONTRIBUTING.md](CONTRIBUTING.md#ci-secrets) for how to configure the `CRATES_IO_TOKEN` secret for publishing.

```bash
cargo install modelrelay-server modelrelay-worker
```

### Build from source

```bash
cargo build --release
# Binaries: target/release/modelrelay-server target/release/modelrelay-worker
```

### Try it

```bash
# Non-streaming
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'

# Streaming (SSE chunks pass through from the backend)
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```

## Connecting your tools

Once the proxy is running, point your existing tools at it — no special client needed.

**curl** — see [Try it](#try-it) above.

**Claude Code / Claude Desktop** — set the base URL to your proxy:

```bash
export ANTHROPIC_BASE_URL=http://localhost:8080
claude # requests now route through ModelRelay
```

**LiteLLM** — add a model entry in your `config.yaml`:

```yaml
model_list:
- model_name: llama3.2:3b
litellm_params:
model: openai/llama3.2:3b
api_base: http://localhost:8080/v1
```

**Open WebUI** — point the OpenAI-compatible backend at the proxy:

```bash
export OPENAI_API_BASE_URL=http://localhost:8080/v1
```

Any tool that speaks OpenAI or Anthropic API formats works — just change the base URL.

## Make it persistent

Once your worker is running, set it up as a system service so it starts automatically on boot:

- **Linux (systemd):** Use the template unit in [`extras/modelrelay-worker@.service`](extras/modelrelay-worker@.service) — supports multiple workers per machine (`modelrelay-worker@gpu0`, `@gpu1`, etc.). See [Systemd](#systemd-bare-metal--vm) below for full instructions.
- **macOS (launchd):** Create a Launch Daemon plist pointing at the binary and your `config.toml`. The worker starts on boot and restarts on crash.
- **Windows (Service):** Register with `sc.exe create` and set env vars with `[Environment]::SetEnvironmentVariable`. See [Windows Service](#windows-service) below for full instructions.

The setup wizard at `/setup` in the web UI walks through this interactively with copy-paste commands.

## llamafile Integration

The `extras/modelrelay-llamafile` script is a self-contained CLI for downloading, running, and relaying [llamafile](https://mozilla-ai.github.io/llamafile/) models through ModelRelay. No dependencies beyond bash and curl.

```bash
# See what fits your hardware
./extras/modelrelay-llamafile recommend

# Browse models by category
./extras/modelrelay-llamafile list --tag reasoning

# Save your relay config once
./extras/modelrelay-llamafile config set proxy-url https://relay.example.com
./extras/modelrelay-llamafile config set worker-secret mysecret

# Now just serve — no flags needed
./extras/modelrelay-llamafile serve qwen3.5-4b

# Verify it works end-to-end
./extras/modelrelay-llamafile test qwen3.5-4b

# Manage running models
./extras/modelrelay-llamafile status
./extras/modelrelay-llamafile logs qwen3.5-4b -f
./extras/modelrelay-llamafile stop all

# Import your own llamafiles
./extras/modelrelay-llamafile import ./my-model.llamafile --slug my-model

# Refresh catalog when Mozilla publishes new models
./extras/modelrelay-llamafile update-catalog
```

Run `./extras/modelrelay-llamafile help` for full usage, or `./extras/modelrelay-llamafile doctor` to check system readiness.

## Features

- **Cross-platform** — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)
- **OpenAI + Anthropic compatible** — `POST /v1/chat/completions`, `POST /v1/responses`, `POST /v1/messages`, `GET /v1/models`
- **No inbound ports on GPU boxes** — workers connect out to the proxy over WebSocket
- **Request queueing** — configurable depth and timeout when all workers are busy
- **Streaming pass-through** — SSE chunks forwarded with preserved ordering and termination
- **End-to-end cancellation** — client disconnect propagates through the proxy to the worker to the backend
- **Automatic requeue** — if a worker dies mid-request, the request is requeued to another worker
- **Heartbeat and load tracking** — stale workers are cleaned up; workers report current load
- **Graceful drain** — workers can shut down while replacement workers pick up queued work
- **Model catalog refresh** — workers can update their model list without reconnecting
- **Auth cooldown recovery** — workers recover gracefully from authentication failures

## Configuration

### modelrelay-server

| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `--listen` | `LISTEN_ADDR` | `127.0.0.1:8080` | Address to listen on |
| `--worker-secret` | `WORKER_SECRET` | *(required)* | Secret workers must present to authenticate |
| `--provider` | `PROVIDER_NAME` | `local` | Provider name used for worker routing and request dispatch |
| `--max-queue-len` | `MAX_QUEUE_LEN` | `100` | Maximum number of queued requests (0 = unlimited) |
| `--queue-timeout` | `QUEUE_TIMEOUT_SECS` | `30` | Seconds before a queued request times out (0 = no timeout) |
| `--request-timeout` | `REQUEST_TIMEOUT_SECS` | `300` | Seconds before an in-flight HTTP request times out (0 = no timeout) |
| `--log-level` | `LOG_LEVEL` | `info` | Log level filter (e.g. `info`, `debug`, or `modelrelay_server=debug`). Overridden by `RUST_LOG` if set. |
| `--admin-token` | `MODELRELAY_ADMIN_TOKEN` | *(none)* | Bearer token for `/admin/*` endpoints. If unset, admin endpoints return 403. |
| `--require-api-keys` | `MODELRELAY_REQUIRE_API_KEYS` | `false` | When `true`, client inference requests must include a valid API key as Bearer token. |

### modelrelay-worker

| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `--proxy-url` | `PROXY_URL` | `http://127.0.0.1:8080` | Base URL of the proxy server |
| `--worker-secret` | `WORKER_SECRET` | *(required)* | Secret used to authenticate with the proxy |
| `--backend-url` | `BACKEND_URL` | `http://127.0.0.1:8000` | Base URL of the local model backend |
| `--models` | `MODELS` | `default` | Comma-separated list of model names this worker supports |
| `--provider` | `PROVIDER_NAME` | `local` | Provider name to register with on the proxy |
| `--worker-name` | `WORKER_NAME` | `worker` | Human-readable name for this worker instance |
| `--max-concurrency` | `MAX_CONCURRENCY` | `1` | Maximum number of concurrent requests this worker will handle |
| `--log-level` | `LOG_LEVEL` | `info` | Log level filter (e.g. `info`, `debug`, or `modelrelay_worker=debug`). Overridden by `RUST_LOG` if set. |

All flags can be passed as CLI arguments or set via the corresponding environment variable.

## Admin API & Web Dashboard

ModelRelay includes built-in admin endpoints for monitoring and an embedded web dashboard for managing your deployment.

### Admin API Endpoints

| Method | Path | Auth | Description |
|--------|------|------|-------------|
| GET | `/health` | None | Basic health check — returns version, worker count, queue depth, and uptime |
| GET | `/admin/workers` | Admin token | List connected workers with models, load, and capabilities |
| GET | `/admin/stats` | Admin token | Request counts, queue depth per provider |
| GET | `/admin/keys` | Admin token | List client API key metadata (no secrets) |
| POST | `/admin/keys` | Admin token | Create a new client API key — returns the secret once |
| DELETE | `/admin/keys/{id}` | Admin token | Revoke a client API key |

### Admin Authentication

All `/admin/*` endpoints require a Bearer token matching `MODELRELAY_ADMIN_TOKEN`:

```bash
# Set the admin token when starting the server
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret

# Query admin endpoints
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/workers
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/stats
```

If `MODELRELAY_ADMIN_TOKEN` is not set, all admin endpoints return `403 Forbidden`.

### Client API Key Authentication

When `MODELRELAY_REQUIRE_API_KEYS` is set to `true`, clients must include a valid API key as a Bearer token on inference requests (`/v1/chat/completions`, `/v1/messages`, etc.). Without a valid key, requests are rejected with `401 Unauthorized`.

```bash
# Start the server with API key auth enabled
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret --require-api-keys true

# Create a client API key (the secret is returned only once)
curl -X POST -H "Authorization: Bearer my-admin-secret" \
-H "Content-Type: application/json" \
-d '{"name": "my-app"}' \
http://localhost:8080/admin/keys

# Use the key for inference
curl -H "Authorization: Bearer mr-..." \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}' \
http://localhost:8080/v1/chat/completions

# Revoke a key
curl -X DELETE -H "Authorization: Bearer my-admin-secret" \
http://localhost:8080/admin/keys/{key-id}
```

When `MODELRELAY_REQUIRE_API_KEYS` is `false` (the default), inference endpoints accept requests without any authentication.

### Web Dashboard & Setup Wizard

The `modelrelay-web` crate provides an embedded web UI served by the proxy:

- **Dashboard** at `/dashboard` — real-time view of connected workers, request metrics, and queue depth
- **Setup Wizard** at `/setup` — step-by-step guide for connecting new workers (platform detection, backend configuration, worker binary download, and live connection verification)

The setup wizard is always accessible — not just on first run. Use it to add additional GPU boxes to your fleet at any time.

## Production deployment

### Docker Compose (multi-worker)

The included [`docker-compose.yml`](docker-compose.yml) runs the proxy with two workers, health checks, restart policies, memory limits, and log rotation:

```bash
cp .env.example .env # edit WORKER_SECRET and backend URLs
docker compose up -d
```

Add more workers by duplicating a worker service block and adjusting `MODELS`, `BACKEND_URL`, and `WORKER_NAME`.

### Systemd (bare metal / VM)

Service files live in [`extras/`](extras/):

```bash
# Install binaries (from a release archive or cargo build --release)
sudo install -m 755 modelrelay-server modelrelay-worker /usr/local/bin/

# Create a service user
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay

# Proxy
sudo cp extras/modelrelay-server.service /etc/systemd/system/
sudo cp extras/proxy.env.example /etc/modelrelay/proxy.env
sudo vim /etc/modelrelay/proxy.env # set WORKER_SECRET
sudo systemctl enable --now modelrelay-server

# Workers — the template unit lets you run multiple instances:
sudo cp extras/modelrelay-worker@.service /etc/systemd/system/
sudo cp extras/worker.env.example /etc/modelrelay/worker-gpu0.env
sudo vim /etc/modelrelay/worker-gpu0.env # set PROXY_URL, BACKEND_URL, MODELS
sudo systemctl enable --now modelrelay-worker@gpu0
```

See [`extras/`](extras/) for the full service files and annotated env examples.

### Windows Service

ModelRelay ships Windows binaries that can run as native Windows Services using `sc.exe`. No third-party service wrappers required.

```powershell
# Install the server as a service (run as Administrator)
sc.exe create ModelRelayServer binPath= "C:\ModelRelay\modelrelay-server.exe" start= auto

# Set environment variables for the service (system-wide, persists across reboots)
[Environment]::SetEnvironmentVariable("WORKER_SECRET", "your-secret-here", "Machine")
[Environment]::SetEnvironmentVariable("LISTEN_ADDR", "0.0.0.0:8080", "Machine")

# Start the service
Start-Service ModelRelayServer

# Install a worker service
sc.exe create ModelRelayWorker binPath= '"C:\ModelRelay\modelrelay-worker.exe" --models llama3-8b' start= auto
[Environment]::SetEnvironmentVariable("PROXY_URL", "http://your-proxy:8080", "Machine")
[Environment]::SetEnvironmentVariable("BACKEND_URL", "http://localhost:8000", "Machine")
Start-Service ModelRelayWorker
```

For fully annotated install scripts with error handling and uninstall support, see [`extras/install-windows-service.ps1`](extras/install-windows-service.ps1) and [`extras/install-windows-service-worker.ps1`](extras/install-windows-service-worker.ps1). The service runs as `LocalSystem` by default; to use a dedicated account, set the service log-on via `services.msc` or pass `obj=` and `password=` to `sc.exe create`.

### TLS

The proxy and workers communicate over plain HTTP/WebSocket by default. For production, terminate TLS at a reverse proxy like nginx. An annotated configuration is provided at [`examples/tls-nginx.conf`](examples/tls-nginx.conf) — it handles HTTPS for client requests and `wss://` WebSocket upgrades for workers, with streaming-friendly settings (buffering disabled, long timeouts).

### Load Testing

A ready-made load test script lives at [`extras/load-test.sh`](extras/load-test.sh). It uses `hey` if installed, falls back to `wrk`, and finally to parallel `curl` loops:

```bash
./extras/load-test.sh -n 200 -c 20 -m llama3-8b
```

### Shell Completions

Both `modelrelay-server` and `modelrelay-worker` can generate shell completion scripts via the hidden `--completions` flag:

```bash
# Bash
modelrelay-server --completions bash > ~/.local/share/bash-completion/completions/modelrelay-server
modelrelay-worker --completions bash > ~/.local/share/bash-completion/completions/modelrelay-worker

# Zsh (add the target directory to $fpath)
modelrelay-server --completions zsh > ~/.zfunc/_modelrelay-server
modelrelay-worker --completions zsh > ~/.zfunc/_modelrelay-worker

# Fish
modelrelay-server --completions fish > ~/.config/fish/completions/modelrelay-server.fish
modelrelay-worker --completions fish > ~/.config/fish/completions/modelrelay-worker.fish
```

Supported shells: `bash`, `zsh`, `fish`, `powershell`, `elvish`.

## Documents

> **Full documentation:** [ericflo.github.io/modelrelay](https://ericflo.github.io/modelrelay/)

- [Behavior contract](docs/behavior-contract.md) — the full specification of proxy, queue, streaming, and cancellation semantics
- [Architecture sketch](docs/architecture.md) — how the pieces fit together internally
- [Protocol walkthrough](docs/protocol-walkthrough.md) — ASCII wire traces for every message flow
- [Operational runbook](docs/runbook.md) — health checks, draining, scaling, troubleshooting

## Validation

The behavior matrix is exercised at three layers: black-box contract harnesses in `modelrelay-contract-tests`, live HTTP integration tests in `modelrelay-server`, and end-to-end live backend tests in `modelrelay-worker`.

```bash
cargo fmt --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
```

## Contributing

Bug reports, feature requests, and PRs are welcome. See
[CONTRIBUTING.md](CONTRIBUTING.md) for code style, test expectations,
branch naming, and CI secrets.

To report a security vulnerability, follow the process in
[SECURITY.md](SECURITY.md).

## License

MIT