https://github.com/robolamp/smol-llm-proxy
Lightweight API key proxy for llama.cpp servers with per-user token usage tracking.
https://github.com/robolamp/smol-llm-proxy
llama-cpp llamacpp logging proxy python
Last synced: about 1 month ago
JSON representation
Lightweight API key proxy for llama.cpp servers with per-user token usage tracking.
- Host: GitHub
- URL: https://github.com/robolamp/smol-llm-proxy
- Owner: robolamp
- License: apache-2.0
- Created: 2026-04-29T18:36:06.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-27T11:48:36.000Z (about 1 month ago)
- Last Synced: 2026-05-27T13:12:29.298Z (about 1 month ago)
- Topics: llama-cpp, llamacpp, logging, proxy, python
- Language: Python
- Homepage:
- Size: 206 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# smol-llm-proxy
[](https://pypi.org/project/smol-llm-proxy/)
[](https://github.com/robolamp/smol-llm-proxy/actions)
[](LICENSE)
[](https://www.python.org/downloads/)
A small API proxy for self-hosted llama.cpp setups. Routes across multiple llama-server instances, per-user API keys, token usage tracking. <1000 code lines (excluding blanks, docstrings, comments), ~53 MB RAM, ~0.2ms overhead.
Built for the case where you run multiple llama-server instances (different models, different GPUs) and want to share them across users with token tracking. Not a replacement for LiteLLM or llama-swap — see comparison below.
## Features
- Per-user API keys (create / delete / toggle active)
- Multi-server routing by model name with in-memory cache
- Model aliases (`alias` -> `model-name.gguf`)
- Token usage logging: prompt/completion tokens, timings
- Streaming and non-streaming proxy support
- Connection-pooled httpx client (keepalive connections to backends)
- SQLite backend (zero external DB dependencies)
## Quick Start
### Docker Compose (recommended)
Clone the repo, then:
```bash
cp .env.example .env # set ADMIN_KEY
cp config.example.yaml config.yaml # fill in your servers
docker compose up -d --build
```
The proxy listens on `0.0.0.0:8000` by default.
### Plain Docker
```bash
docker build -t smol-llm-proxy .
docker run -p 8000:8000 \
-e ADMIN_KEY=secret \
-v db-data:/data \
-v $(pwd)/config.yaml:/app/config.yaml:ro \
smol-llm-proxy
```
### Pip install
```bash
pip install smol-llm-proxy
```
Example configs ship with the package. Copy them:
```bash
python -c "import smol_llm_proxy, shutil, os; d=os.path.dirname(smol_llm_proxy.__file__); shutil.copy2(f'{d}/config.example.yaml','config.yaml'); shutil.copy2(f'{d}/.env.example','.env')"
cp .env.example .env # set ADMIN_KEY
cp config.example.yaml config.yaml # fill in your servers
ADMIN_KEY=secret python -m smol_llm_proxy
```
Or download from GitHub:
```bash
curl -sO https://raw.githubusercontent.com/robolamp/smol-llm-proxy/main/.env.example && cp .env.example .env
curl -sO https://raw.githubusercontent.com/robolamp/smol-llm-proxy/main/config.example.yaml && cp config.example.yaml config.yaml
```
### Quick usage
1. Create a user key:
```bash
curl -X POST http://localhost:8000/admin/keys \
-H "Authorization: Bearer $ADMIN_KEY" \
-d '{"name": "my-user"}'
```
The response contains a JSON object with a `key` field — that's the user's Bearer token. Save it; you'll need it for proxy requests.
2. Send a chat completion with the user key:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-" \
-d '{
"model": "Qwen3.5-2B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```
## Configuration
The proxy reads two files: `config.yaml` for routing and `.env` for runtime settings.
### `config.yaml` — servers, models, aliases
Loaded into SQLite at startup, persisted across restarts:
```yaml
servers:
- name: my-server
url: http://host:port
api_key: "" # optional, if llama-server requires auth
models:
- model-name.gguf
aliases:
alias: model-name.gguf # short name -> real model name
```
### Environment variables
| Variable | Default | Description |
|----------|---------|-------------|
| `ADMIN_KEY` | required | Bearer token for `/admin/*` endpoints |
| `PROXY_HOST` | `0.0.0.0` | Listen address |
| `PROXY_PORT` | `8000` | Listen port |
| `DB_PATH` | `data/proxy.db` | SQLite database location |
| `CONFIG_PATH` | `./config.yaml` | Path to config file |
For `pip install`, set them in shell or via a `.env`-like loader. For Docker Compose, put them in `.env`:
```bash
ADMIN_KEY=secret
PROXY_PORT=8000
```
### Docker volumes
The Compose setup mounts two volumes:
- `db-data:/data` — SQLite database, persists across container restarts
- `./config.yaml:/config/config.yaml:ro` — config file, read-only
## Admin API
All admin endpoints require `Authorization: Bearer ` header.
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/admin/servers` | `GET` | List all registered servers |
| `/admin/servers` | `POST` | Register a llama-server |
| `/admin/servers/{id}` | `DELETE` / `PATCH` | Remove or update server |
| `/admin/servers/{id}/models` | `POST` / `DELETE` | Assign/unassign model name |
| `/admin/keys` | `GET` | List all API keys |
| `/admin/keys` | `POST` | Create user key |
| `/admin/keys/{key_id}` | `DELETE` | Revoke key (by integer id) |
| `/admin/keys/{key_id}/toggle` | `PATCH` | Activate/deactivate key (by integer id) |
| `/admin/aliases` | `GET` / `POST` | List or create model aliases |
| `/admin/aliases/{alias_name}` | `DELETE` | Delete alias |
| `/admin/usage` | `GET` | View token usage logs |
**Note:** Key operations (`DELETE`, `PATCH /toggle`) use integer `key_id` from the database, not the API key string itself.
## Proxy Endpoints
These forward to llama-server backends based on model name routing.
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/v1/chat/completions` | `POST` | Chat completions (streaming + non-streaming) |
| `/v1/completions` | `POST` | Legacy completions |
| `/v1/embeddings` | `POST` | Embeddings |
| `/v1/models` | `GET` | List available models (no auth required) |
| `/health` | `GET` | Health check (no auth required) |
## Usage Logs
Each request logs: user, server, model name, prompt/completion tokens, timings (ms), and total tokens. No conversation content is stored.
```bash
curl "http://localhost:8000/admin/usage?key_id=1" \
-H "Authorization: Bearer $ADMIN_KEY"
```
## Benchmarking
Proxy overhead measured with Locust using **parallel concurrent execution** — both benchmarks (direct and proxy) hit the same backend simultaneously for a fair comparison.
All requests authenticated, routed, and logged via SQLite on every call (cold cache mode). In-memory cache disabled to measure worst-case per-request overhead.
**Hardware:** i9-14900K, RTX 4090 | **Model:** Qwen3.5-2B-GGUF | **Backend:** llama.cpp server
### Mock backend (fixed 100ms delay, clean conditions)
| Mode | Users | Direct P50 | Proxy P50 | Overhead P50 | Direct P95 | Proxy P95 | Overhead P95 | Direct P99 | Proxy P99 | Overhead P99 | Direct Mean | Through proxy | Overhead Mean | Direct RPS | Through proxy | RPS overhead |
|------|-------|-----------|-----------|-------------|-----------|-----------|-------------|-----------|----------|-------------|------------|--------------|--------------|-----------|--------------|-------------|
| Low | 5+5 | 100ms | 100ms | +0ms | 100ms | 100ms | 0ms | 110ms | 130ms | +20ms | 101ms | 103ms | +2ms | 49.3 | 48.4 | -0.9 |
| Medium | 20+20 | 100ms | 100ms | +0ms | 100ms | 110ms | +10ms | 110ms | 130ms | +20ms | 101ms | 104ms | +2ms | 194.8 | 190.9 | -3.9 |
| High | 100+100 | 100ms | 110ms | +10ms | 100ms | 120ms | +20ms | 110ms | 130ms | +20ms | 102ms | 107ms | +4ms | 896.2 | 860.3 | -35.9 |
### Real llama-server backend
| Mode | Users | Direct P50 | Proxy P50 | Overhead P50 | Direct P95 | Proxy P95 | Overhead P95 | Direct P99 | Proxy P99 | Overhead P99 | Direct Mean | Through proxy | Overhead Mean | Direct RPS | Through proxy | RPS overhead |
|------|-------|-----------|-----------|-------------|-----------|-----------|-------------|-----------|----------|-------------|------------|--------------|--------------|-----------|--------------|-------------|
| Low | 5+5 | 570ms | 570ms | +0ms | 940ms | 930ms | -10ms | ~1200ms | ~1100ms | ~0ms | 605ms | 604ms | -1ms | 8.2 | 8.2 | 0.0 |
| Medium | 20+20 | ~2500ms | ~2500ms | ~0ms | ~2900ms | ~2900ms | ~0ms | ~14000ms | ~14000ms | ~0ms | ~2413ms | ~2460ms | +47ms | 8.0 | 7.9 | -0.1 |
| High | 100+100 | 12000ms | 12000ms | ~0ms | 13000ms | 13000ms | ~0ms | 13000ms | 13000ms | ~0ms | 10073ms | 10058ms | -15ms | 8.1 | 8.1 | -0.0 |
Proxy overhead on clean conditions (mock): **~2ms** mean, **+20ms** P99 across all load levels with 4 uvicorn workers. Against real backend: **negligible** — latency identical within measurement noise (~1s variance at tail).
Run your own benchmarks: `python tests/benchmark/run.py [low|medium|high]` (add `--mock` for fixed-delay backend)
### Memory footprint
| Workers | Idle | Under load | Growth |
|---------|---------|------------|--------|
| 1 | 53 MB | 62 MB | +9 MB |
| 4 | 252 MB | 273 MB | +21 MB |
Per-worker baseline: **~53 MB**, load growth: **+4–6 MB** per worker.
Identical footprint against mock and real backends — the proxy forwards without buffering responses. No memory growth observed across extended runs.
### Caveat
The aggregate overhead numbers (+2-4ms mean, +20ms P99 on mock) include asyncio event loop contention at high concurrency (100+ concurrent users per worker). Per-request proxy logic itself is **~0.18ms** — the difference comes from how asyncio handles many simultaneous awaits on a single thread. With 4 uvicorn workers, each worker handles ~25 requests, keeping contention minimal.
Real backend P99 spikes (Medium mode, ~14s) are caused by llama.cpp single-thread inference bottleneck under 20 concurrent users — not proxy overhead. Proxy adds negligible latency regardless of backend saturation.
## How it compares
smol-llm-proxy is built for one specific case: **multiple llama-server instances, multiple users, per-user token accounting**.
- **[LiteLLM](https://github.com/BerriAI/litellm)** — much broader scope: 100+ cloud providers, virtual keys, budgets, admin UI, fallbacks. Requires Postgres + Redis for full features. Use it if you need a production gateway across cloud LLMs.
- **[llama-swap](https://github.com/mostlygeek/llama-swap)** — solves a different problem: hot-swapping models on one llama.cpp instance. No users, no accounting. Use it if you run many models on one machine and want them loaded on demand.
- **[llama.cpp router mode](https://github.com/ggerganov/llama.cpp)** — built into llama-server itself. Same scope as llama-swap, no auth layer.
If you self-host several llama-server instances on one or more machines and want to share them with a small group while tracking usage, smol-llm-proxy is the smallest thing that does that. Otherwise, one of the above is probably a better fit.
## Architecture
```
[users] ──HTTPS──> [proxy :port] ──HTTP──> [llama-server 1 :port]
│ [llama-server 2 :port]
│ [llama-server N :port]
│
├── in-memory cache (keys, aliases, routes) — TTL 30s
├── validate API key + resolve routing (SQLite on first call, then cache)
├── forward request via connection-pooled httpx client
└── async log tokens + timings (background worker, no blocking)
```