https://github.com/ericflo/modelrelay
Central HTTP LLM proxy that routes inference requests to authenticated remote workers over WebSocket — queueing, streaming, and cancellation included.
https://github.com/ericflo/modelrelay
anthropic-compatible gpu inference llama llm openai-compatible proxy rust websocket worker
Last synced: 2 months ago
JSON representation
Central HTTP LLM proxy that routes inference requests to authenticated remote workers over WebSocket — queueing, streaming, and cancellation included.
- Host: GitHub
- URL: https://github.com/ericflo/modelrelay
- Owner: ericflo
- License: mit
- Created: 2026-04-02T20:11:00.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-09T05:02:20.000Z (3 months ago)
- Last Synced: 2026-04-09T05:03:09.945Z (3 months ago)
- Topics: anthropic-compatible, gpu, inference, llama, llm, openai-compatible, proxy, rust, websocket, worker
- Language: Rust
- Homepage: https://ericflo.github.io/modelrelay/
- Size: 2.03 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
[](https://github.com/ericflo/modelrelay/actions/workflows/ci.yml)
[](https://github.com/ericflo/modelrelay/releases/latest)
[](https://codecov.io/gh/ericflo/modelrelay)
[](https://crates.io/crates/modelrelay-protocol)
[](rust-toolchain.toml)
[](https://ericflo.github.io/modelrelay/)
[](LICENSE)
# ModelRelay
**Stop configuring clients for every GPU box. Workers connect out; requests route in.**
You have GPU boxes running `llama-server` (or Ollama, or vLLM, or anything OpenAI-compatible). Today you either expose each one directly — port forwarding, DNS, firewall rules — or you stick a load balancer in front that doesn't understand LLM streaming or cancellation.
ModelRelay flips the model: a central proxy receives standard inference requests while worker daemons on your GPU boxes connect *out* to it over WebSocket. The proxy handles queueing, routing, streaming pass-through, and cancellation propagation. Clients see one stable endpoint and never need to know about your hardware.
```
Clients (curl, Claude Code, LiteLLM, Open WebUI, ...)
│
│ POST /v1/chat/completions
│ POST /v1/messages
▼
┌──────────────────────┐
│ modelrelay-server │◄─── workers connect out (WebSocket)
│ (one stable │ no inbound ports needed on GPU boxes
│ endpoint) │
└──────────────────────┘
│ routes request to best available worker
▼
┌────────┐ ┌────────┐ ┌────────┐
│worker-1│ │worker-2│ │worker-3│
│ llama │ │ ollama │ │ vllm │ ← your GPU boxes,
│ server │ │ │ │ │ anywhere on any network
└────────┘ └────────┘ └────────┘
```
## Desktop App
ModelRelay Desktop is a native tray application that wraps the worker daemon in a lightweight GUI. It stays in your system tray and manages the connection to your relay server — no terminal required.
**Features:**
- System tray icon showing connection status (connected / disconnected / relaying)
- Settings UI for backend URL, relay server, worker secret, model selection, and poll interval
- Auto-reconnect on connection loss with status notifications
- Auto-start on login
- Live model list that refreshes as your backend models change
**Download:** Grab the latest installer for your platform from the [Desktop Releases](https://github.com/ericflo/modelrelay/releases?q=desktop) page.
| Platform | Installer |
|----------|-----------|
| Windows | `.msi` or `.exe` |
| macOS | `.dmg` |
| Linux | `.AppImage` or `.deb` |
**Getting started:**
1. Download and install the app for your platform
2. Launch ModelRelay Desktop — it appears in your system tray
3. Right-click the tray icon and open **Settings**
4. Enter your backend URL (e.g. `http://127.0.0.1:8000`), relay server URL, and worker secret
5. Click **Connect** — the tray icon updates to show your connection status
The desktop app uses the same `modelrelay-worker` library under the hood, so it supports all the same backends (llama-server, Ollama, vLLM, LM Studio, etc.).
## Who is this for?
- **Home GPU users** running local models who want a single API endpoint across multiple machines
- **Teams with on-prem hardware** that need to pool GPU capacity without a service mesh
- **Researchers** juggling models across heterogeneous boxes who are tired of updating client configs
## Why this instead of...
| Alternative | What's missing |
|---|---|
| **Pointing clients directly at llama-server** | No HA, no queue, clients must know about every box, no cancellation |
| **nginx / HAProxy** | Doesn't understand LLM streaming semantics, no queueing, no worker auth, no cancellation propagation |
| **LiteLLM / OpenRouter** | Cloud-first routing — not designed for your own private hardware calling home |
## Hosted Version
Don't want to run the infrastructure yourself? A fully-managed hosted version is available at [modelrelay.io](https://modelrelay.io) — no server setup, no infrastructure to manage. Just get an API key, point your workers at it, and start routing requests. Same open protocol, zero ops burden.
## Quickstart
### Pre-built binaries (recommended)
Pre-built binaries are the fastest way to get started. Download the latest release for your platform from the [Releases page](https://github.com/ericflo/modelrelay/releases):
| Platform | modelrelay-server | modelrelay-worker |
|----------|-------------------|-------------------|
| Linux x86_64 | `modelrelay-server-linux-amd64` | `modelrelay-worker-linux-amd64` |
| Linux arm64 | `modelrelay-server-linux-arm64` | `modelrelay-worker-linux-arm64` |
| macOS Intel | `modelrelay-server-darwin-amd64` | `modelrelay-worker-darwin-amd64` |
| macOS Apple Silicon | `modelrelay-server-darwin-arm64` | `modelrelay-worker-darwin-arm64` |
| Windows x86_64 | `modelrelay-server-windows-amd64.exe` | `modelrelay-worker-windows-amd64.exe` |
| Windows arm64 | `modelrelay-server-windows-arm64.exe` | `modelrelay-worker-windows-arm64.exe` |
**Start the proxy:**
```bash
./modelrelay-server \
--listen 0.0.0.0:8080 \
--worker-secret mysecret
```
**Start a worker** (on a GPU box with `llama-server`, Ollama, vLLM, or any OpenAI-compatible backend):
```bash
./modelrelay-worker \
--proxy-url http://:8080 \
--worker-secret mysecret \
--backend-url http://127.0.0.1:8000 \
--models llama3.2:3b,llama3.2:1b
```
### Docker
Pre-built images are published to GitHub Container Registry on every release and main push.
```bash
# Pull the latest images
docker pull ghcr.io/ericflo/modelrelay/modelrelay-server:latest
docker pull ghcr.io/ericflo/modelrelay/modelrelay-worker:latest
# Run the proxy
docker run -p 8080:8080 \
-e WORKER_SECRET=mysecret \
-e LISTEN_ADDR=0.0.0.0:8080 \
ghcr.io/ericflo/modelrelay/modelrelay-server:latest
# Run a worker (on a GPU box)
docker run \
-e PROXY_URL=http://:8080 \
-e WORKER_SECRET=mysecret \
-e BACKEND_URL=http://host.docker.internal:8000 \
-e MODELS=llama3.2:3b \
ghcr.io/ericflo/modelrelay/modelrelay-worker:latest
```
For pinned versions, replace `:latest` with a release tag (e.g. `:0.2.1`).
### Docker Compose (easiest for local dev)
```bash
git clone https://github.com/ericflo/modelrelay.git
cd modelrelay
# Start the proxy + one worker (assumes llama-server on host port 8081)
docker compose up
```
The proxy is now listening on `http://localhost:8080`. The worker connects to it automatically and forwards requests to your backend.
### From crates.io
> **Note:** The crates are not yet published to crates.io. Use [pre-built binaries](#pre-built-binaries-recommended) or [Docker](#docker) in the meantime. See [CONTRIBUTING.md](CONTRIBUTING.md#ci-secrets) for how to configure the `CRATES_IO_TOKEN` secret for publishing.
```bash
cargo install modelrelay-server modelrelay-worker
```
### Build from source
```bash
cargo build --release
# Binaries: target/release/modelrelay-server target/release/modelrelay-worker
```
### Try it
```bash
# Non-streaming
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
# Streaming (SSE chunks pass through from the backend)
curl http://localhost:8080/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "llama3.2:3b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```
## Connecting your tools
Once the proxy is running, point your existing tools at it — no special client needed.
**curl** — see [Try it](#try-it) above.
**Claude Code / Claude Desktop** — set the base URL to your proxy:
```bash
export ANTHROPIC_BASE_URL=http://localhost:8080
claude # requests now route through ModelRelay
```
**LiteLLM** — add a model entry in your `config.yaml`:
```yaml
model_list:
- model_name: llama3.2:3b
litellm_params:
model: openai/llama3.2:3b
api_base: http://localhost:8080/v1
```
**Open WebUI** — point the OpenAI-compatible backend at the proxy:
```bash
export OPENAI_API_BASE_URL=http://localhost:8080/v1
```
Any tool that speaks OpenAI or Anthropic API formats works — just change the base URL.
## Make it persistent
Once your worker is running, set it up as a system service so it starts automatically on boot:
- **Linux (systemd):** Use the template unit in [`extras/modelrelay-worker@.service`](extras/modelrelay-worker@.service) — supports multiple workers per machine (`modelrelay-worker@gpu0`, `@gpu1`, etc.). See [Systemd](#systemd-bare-metal--vm) below for full instructions.
- **macOS (launchd):** Create a Launch Daemon plist pointing at the binary and your `config.toml`. The worker starts on boot and restarts on crash.
- **Windows (Service):** Register with `sc.exe create` and set env vars with `[Environment]::SetEnvironmentVariable`. See [Windows Service](#windows-service) below for full instructions.
The setup wizard at `/setup` in the web UI walks through this interactively with copy-paste commands.
## llamafile Integration
The `extras/modelrelay-llamafile` script is a self-contained CLI for downloading, running, and relaying [llamafile](https://mozilla-ai.github.io/llamafile/) models through ModelRelay. No dependencies beyond bash and curl.
```bash
# See what fits your hardware
./extras/modelrelay-llamafile recommend
# Browse models by category
./extras/modelrelay-llamafile list --tag reasoning
# Save your relay config once
./extras/modelrelay-llamafile config set proxy-url https://relay.example.com
./extras/modelrelay-llamafile config set worker-secret mysecret
# Now just serve — no flags needed
./extras/modelrelay-llamafile serve qwen3.5-4b
# Verify it works end-to-end
./extras/modelrelay-llamafile test qwen3.5-4b
# Manage running models
./extras/modelrelay-llamafile status
./extras/modelrelay-llamafile logs qwen3.5-4b -f
./extras/modelrelay-llamafile stop all
# Import your own llamafiles
./extras/modelrelay-llamafile import ./my-model.llamafile --slug my-model
# Refresh catalog when Mozilla publishes new models
./extras/modelrelay-llamafile update-catalog
```
Run `./extras/modelrelay-llamafile help` for full usage, or `./extras/modelrelay-llamafile doctor` to check system readiness.
## Features
- **Cross-platform** — pre-built binaries for Linux, macOS, and Windows (x86_64 + arm64)
- **OpenAI + Anthropic compatible** — `POST /v1/chat/completions`, `POST /v1/responses`, `POST /v1/messages`, `GET /v1/models`
- **No inbound ports on GPU boxes** — workers connect out to the proxy over WebSocket
- **Request queueing** — configurable depth and timeout when all workers are busy
- **Streaming pass-through** — SSE chunks forwarded with preserved ordering and termination
- **End-to-end cancellation** — client disconnect propagates through the proxy to the worker to the backend
- **Automatic requeue** — if a worker dies mid-request, the request is requeued to another worker
- **Heartbeat and load tracking** — stale workers are cleaned up; workers report current load
- **Graceful drain** — workers can shut down while replacement workers pick up queued work
- **Model catalog refresh** — workers can update their model list without reconnecting
- **Auth cooldown recovery** — workers recover gracefully from authentication failures
## Configuration
### modelrelay-server
| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `--listen` | `LISTEN_ADDR` | `127.0.0.1:8080` | Address to listen on |
| `--worker-secret` | `WORKER_SECRET` | *(required)* | Secret workers must present to authenticate |
| `--provider` | `PROVIDER_NAME` | `local` | Provider name used for worker routing and request dispatch |
| `--max-queue-len` | `MAX_QUEUE_LEN` | `100` | Maximum number of queued requests (0 = unlimited) |
| `--queue-timeout` | `QUEUE_TIMEOUT_SECS` | `30` | Seconds before a queued request times out (0 = no timeout) |
| `--request-timeout` | `REQUEST_TIMEOUT_SECS` | `300` | Seconds before an in-flight HTTP request times out (0 = no timeout) |
| `--log-level` | `LOG_LEVEL` | `info` | Log level filter (e.g. `info`, `debug`, or `modelrelay_server=debug`). Overridden by `RUST_LOG` if set. |
| `--admin-token` | `MODELRELAY_ADMIN_TOKEN` | *(none)* | Bearer token for `/admin/*` endpoints. If unset, admin endpoints return 403. |
| `--require-api-keys` | `MODELRELAY_REQUIRE_API_KEYS` | `false` | When `true`, client inference requests must include a valid API key as Bearer token. |
### modelrelay-worker
| Flag | Env var | Default | Description |
|------|---------|---------|-------------|
| `--proxy-url` | `PROXY_URL` | `http://127.0.0.1:8080` | Base URL of the proxy server |
| `--worker-secret` | `WORKER_SECRET` | *(required)* | Secret used to authenticate with the proxy |
| `--backend-url` | `BACKEND_URL` | `http://127.0.0.1:8000` | Base URL of the local model backend |
| `--models` | `MODELS` | `default` | Comma-separated list of model names this worker supports |
| `--provider` | `PROVIDER_NAME` | `local` | Provider name to register with on the proxy |
| `--worker-name` | `WORKER_NAME` | `worker` | Human-readable name for this worker instance |
| `--max-concurrency` | `MAX_CONCURRENCY` | `1` | Maximum number of concurrent requests this worker will handle |
| `--log-level` | `LOG_LEVEL` | `info` | Log level filter (e.g. `info`, `debug`, or `modelrelay_worker=debug`). Overridden by `RUST_LOG` if set. |
All flags can be passed as CLI arguments or set via the corresponding environment variable.
## Admin API & Web Dashboard
ModelRelay includes built-in admin endpoints for monitoring and an embedded web dashboard for managing your deployment.
### Admin API Endpoints
| Method | Path | Auth | Description |
|--------|------|------|-------------|
| GET | `/health` | None | Basic health check — returns version, worker count, queue depth, and uptime |
| GET | `/admin/workers` | Admin token | List connected workers with models, load, and capabilities |
| GET | `/admin/stats` | Admin token | Request counts, queue depth per provider |
| GET | `/admin/keys` | Admin token | List client API key metadata (no secrets) |
| POST | `/admin/keys` | Admin token | Create a new client API key — returns the secret once |
| DELETE | `/admin/keys/{id}` | Admin token | Revoke a client API key |
### Admin Authentication
All `/admin/*` endpoints require a Bearer token matching `MODELRELAY_ADMIN_TOKEN`:
```bash
# Set the admin token when starting the server
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret
# Query admin endpoints
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/workers
curl -H "Authorization: Bearer my-admin-secret" http://localhost:8080/admin/stats
```
If `MODELRELAY_ADMIN_TOKEN` is not set, all admin endpoints return `403 Forbidden`.
### Client API Key Authentication
When `MODELRELAY_REQUIRE_API_KEYS` is set to `true`, clients must include a valid API key as a Bearer token on inference requests (`/v1/chat/completions`, `/v1/messages`, etc.). Without a valid key, requests are rejected with `401 Unauthorized`.
```bash
# Start the server with API key auth enabled
modelrelay-server --worker-secret mysecret --admin-token my-admin-secret --require-api-keys true
# Create a client API key (the secret is returned only once)
curl -X POST -H "Authorization: Bearer my-admin-secret" \
-H "Content-Type: application/json" \
-d '{"name": "my-app"}' \
http://localhost:8080/admin/keys
# Use the key for inference
curl -H "Authorization: Bearer mr-..." \
-H "Content-Type: application/json" \
-d '{"model": "llama3.2:3b", "messages": [{"role": "user", "content": "Hello"}]}' \
http://localhost:8080/v1/chat/completions
# Revoke a key
curl -X DELETE -H "Authorization: Bearer my-admin-secret" \
http://localhost:8080/admin/keys/{key-id}
```
When `MODELRELAY_REQUIRE_API_KEYS` is `false` (the default), inference endpoints accept requests without any authentication.
### Web Dashboard & Setup Wizard
The `modelrelay-web` crate provides an embedded web UI served by the proxy:
- **Dashboard** at `/dashboard` — real-time view of connected workers, request metrics, and queue depth
- **Setup Wizard** at `/setup` — step-by-step guide for connecting new workers (platform detection, backend configuration, worker binary download, and live connection verification)
The setup wizard is always accessible — not just on first run. Use it to add additional GPU boxes to your fleet at any time.
## Production deployment
### Docker Compose (multi-worker)
The included [`docker-compose.yml`](docker-compose.yml) runs the proxy with two workers, health checks, restart policies, memory limits, and log rotation:
```bash
cp .env.example .env # edit WORKER_SECRET and backend URLs
docker compose up -d
```
Add more workers by duplicating a worker service block and adjusting `MODELS`, `BACKEND_URL`, and `WORKER_NAME`.
### Systemd (bare metal / VM)
Service files live in [`extras/`](extras/):
```bash
# Install binaries (from a release archive or cargo build --release)
sudo install -m 755 modelrelay-server modelrelay-worker /usr/local/bin/
# Create a service user
sudo useradd --system --no-create-home modelrelay
sudo mkdir -p /var/lib/modelrelay /etc/modelrelay
# Proxy
sudo cp extras/modelrelay-server.service /etc/systemd/system/
sudo cp extras/proxy.env.example /etc/modelrelay/proxy.env
sudo vim /etc/modelrelay/proxy.env # set WORKER_SECRET
sudo systemctl enable --now modelrelay-server
# Workers — the template unit lets you run multiple instances:
sudo cp extras/modelrelay-worker@.service /etc/systemd/system/
sudo cp extras/worker.env.example /etc/modelrelay/worker-gpu0.env
sudo vim /etc/modelrelay/worker-gpu0.env # set PROXY_URL, BACKEND_URL, MODELS
sudo systemctl enable --now modelrelay-worker@gpu0
```
See [`extras/`](extras/) for the full service files and annotated env examples.
### Windows Service
ModelRelay ships Windows binaries that can run as native Windows Services using `sc.exe`. No third-party service wrappers required.
```powershell
# Install the server as a service (run as Administrator)
sc.exe create ModelRelayServer binPath= "C:\ModelRelay\modelrelay-server.exe" start= auto
# Set environment variables for the service (system-wide, persists across reboots)
[Environment]::SetEnvironmentVariable("WORKER_SECRET", "your-secret-here", "Machine")
[Environment]::SetEnvironmentVariable("LISTEN_ADDR", "0.0.0.0:8080", "Machine")
# Start the service
Start-Service ModelRelayServer
# Install a worker service
sc.exe create ModelRelayWorker binPath= '"C:\ModelRelay\modelrelay-worker.exe" --models llama3-8b' start= auto
[Environment]::SetEnvironmentVariable("PROXY_URL", "http://your-proxy:8080", "Machine")
[Environment]::SetEnvironmentVariable("BACKEND_URL", "http://localhost:8000", "Machine")
Start-Service ModelRelayWorker
```
For fully annotated install scripts with error handling and uninstall support, see [`extras/install-windows-service.ps1`](extras/install-windows-service.ps1) and [`extras/install-windows-service-worker.ps1`](extras/install-windows-service-worker.ps1). The service runs as `LocalSystem` by default; to use a dedicated account, set the service log-on via `services.msc` or pass `obj=` and `password=` to `sc.exe create`.
### TLS
The proxy and workers communicate over plain HTTP/WebSocket by default. For production, terminate TLS at a reverse proxy like nginx. An annotated configuration is provided at [`examples/tls-nginx.conf`](examples/tls-nginx.conf) — it handles HTTPS for client requests and `wss://` WebSocket upgrades for workers, with streaming-friendly settings (buffering disabled, long timeouts).
### Load Testing
A ready-made load test script lives at [`extras/load-test.sh`](extras/load-test.sh). It uses `hey` if installed, falls back to `wrk`, and finally to parallel `curl` loops:
```bash
./extras/load-test.sh -n 200 -c 20 -m llama3-8b
```
### Shell Completions
Both `modelrelay-server` and `modelrelay-worker` can generate shell completion scripts via the hidden `--completions` flag:
```bash
# Bash
modelrelay-server --completions bash > ~/.local/share/bash-completion/completions/modelrelay-server
modelrelay-worker --completions bash > ~/.local/share/bash-completion/completions/modelrelay-worker
# Zsh (add the target directory to $fpath)
modelrelay-server --completions zsh > ~/.zfunc/_modelrelay-server
modelrelay-worker --completions zsh > ~/.zfunc/_modelrelay-worker
# Fish
modelrelay-server --completions fish > ~/.config/fish/completions/modelrelay-server.fish
modelrelay-worker --completions fish > ~/.config/fish/completions/modelrelay-worker.fish
```
Supported shells: `bash`, `zsh`, `fish`, `powershell`, `elvish`.
## Documents
> **Full documentation:** [ericflo.github.io/modelrelay](https://ericflo.github.io/modelrelay/)
- [Behavior contract](docs/behavior-contract.md) — the full specification of proxy, queue, streaming, and cancellation semantics
- [Architecture sketch](docs/architecture.md) — how the pieces fit together internally
- [Protocol walkthrough](docs/protocol-walkthrough.md) — ASCII wire traces for every message flow
- [Operational runbook](docs/runbook.md) — health checks, draining, scaling, troubleshooting
## Validation
The behavior matrix is exercised at three layers: black-box contract harnesses in `modelrelay-contract-tests`, live HTTP integration tests in `modelrelay-server`, and end-to-end live backend tests in `modelrelay-worker`.
```bash
cargo fmt --check
cargo clippy --workspace --all-targets --all-features -- -D warnings
cargo test --workspace
```
## Contributing
Bug reports, feature requests, and PRs are welcome. See
[CONTRIBUTING.md](CONTRIBUTING.md) for code style, test expectations,
branch naming, and CI secrets.
To report a security vulnerability, follow the process in
[SECURITY.md](SECURITY.md).
## License
MIT