https://github.com/robolamp/smol-llm-proxy

Lightweight API key proxy for llama.cpp servers with per-user token usage tracking.
https://github.com/robolamp/smol-llm-proxy
llama-cpp llamacpp logging proxy python
Last synced: about 2 months ago
JSON representation
Lightweight API key proxy for llama.cpp servers with per-user token usage tracking.
Host: GitHub
URL: https://github.com/robolamp/smol-llm-proxy
Owner: robolamp
License: apache-2.0
Created: 2026-04-29T18:36:06.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-05-27T11:48:36.000Z (about 2 months ago)
Last Synced: 2026-05-27T13:12:29.298Z (about 2 months ago)
Topics: llama-cpp, llamacpp, logging, proxy, python
Language: Python
Homepage:
Size: 206 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # smol-llm-proxy

[![PyPI version](https://img.shields.io/pypi/v/smol-llm-proxy)](https://pypi.org/project/smol-llm-proxy/)

[![CI](https://github.com/robolamp/smol-llm-proxy/actions/workflows/ci.yml/badge.svg)](https://github.com/robolamp/smol-llm-proxy/actions)

[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)

A small API proxy for self-hosted llama.cpp setups. Routes across multiple llama-server instances, per-user API keys, token usage tracking. <1000 code lines (excluding blanks, docstrings, comments), ~53 MB RAM, ~0.2ms overhead.

Built for the case where you run multiple llama-server instances (different models, different GPUs) and want to share them across users with token tracking. Not a replacement for LiteLLM or llama-swap — see comparison below.

## Features

- Per-user API keys (create / delete / toggle active)

- Multi-server routing by model name with in-memory cache

- Model aliases (`alias` -> `model-name.gguf`)

- Token usage logging: prompt/completion tokens, timings

- Streaming and non-streaming proxy support

- Connection-pooled httpx client (keepalive connections to backends)

- SQLite backend (zero external DB dependencies)

## Quick Start

### Docker Compose (recommended)

Clone the repo, then:

```bash

cp .env.example .env                # set ADMIN_KEY

cp config.example.yaml config.yaml  # fill in your servers

docker compose up -d --build

```

The proxy listens on `0.0.0.0:8000` by default.

### Plain Docker

```bash

docker build -t smol-llm-proxy .

docker run -p 8000:8000 \

  -e ADMIN_KEY=secret \

  -v db-data:/data \

  -v $(pwd)/config.yaml:/app/config.yaml:ro \

  smol-llm-proxy

```

### Pip install

```bash

pip install smol-llm-proxy

```

Example configs ship with the package. Copy them:

```bash

python -c "import smol_llm_proxy, shutil, os; d=os.path.dirname(smol_llm_proxy.__file__); shutil.copy2(f'{d}/config.example.yaml','config.yaml'); shutil.copy2(f'{d}/.env.example','.env')"

cp .env.example .env                # set ADMIN_KEY

cp config.example.yaml config.yaml  # fill in your servers

ADMIN_KEY=secret python -m smol_llm_proxy

```

Or download from GitHub:

```bash

curl -sO https://raw.githubusercontent.com/robolamp/smol-llm-proxy/main/.env.example && cp .env.example .env

curl -sO https://raw.githubusercontent.com/robolamp/smol-llm-proxy/main/config.example.yaml && cp config.example.yaml config.yaml

```

### Quick usage

1. Create a user key:

```bash

curl -X POST http://localhost:8000/admin/keys \

  -H "Authorization: Bearer $ADMIN_KEY" \

  -d '{"name": "my-user"}'

```

The response contains a JSON object with a `key` field — that's the user's Bearer token. Save it; you'll need it for proxy requests.

2. Send a chat completion with the user key:

```bash

curl http://localhost:8000/v1/chat/completions \

  -H "Content-Type: application/json" \

  -H "Authorization: Bearer sk-" \

  -d '{

    "model": "Qwen3.5-2B",

    "messages": [{"role": "user", "content": "Hello!"}],

    "stream": true

  }'

```

## Configuration

The proxy reads two files: `config.yaml` for routing and `.env` for runtime settings.

### `config.yaml` — servers, models, aliases

Loaded into SQLite at startup, persisted across restarts:

```yaml

servers:

  - name: my-server

    url: http://host:port

    api_key: ""              # optional, if llama-server requires auth

    models:

      - model-name.gguf

aliases:

  alias: model-name.gguf     # short name -> real model name

```

### Environment variables

| Variable | Default | Description |

|----------|---------|-------------|

| `ADMIN_KEY` | required | Bearer token for `/admin/*` endpoints |

| `PROXY_HOST` | `0.0.0.0` | Listen address |

| `PROXY_PORT` | `8000` | Listen port |

| `DB_PATH` | `data/proxy.db` | SQLite database location |

| `CONFIG_PATH` | `./config.yaml` | Path to config file |

For `pip install`, set them in shell or via a `.env`-like loader. For Docker Compose, put them in `.env`:

```bash

ADMIN_KEY=secret

PROXY_PORT=8000

```

### Docker volumes

The Compose setup mounts two volumes:

- `db-data:/data` — SQLite database, persists across container restarts

- `./config.yaml:/config/config.yaml:ro` — config file, read-only

## Admin API

All admin endpoints require `Authorization: Bearer ` header.

| Endpoint | Method | Description |

|----------|--------|-------------|

| `/admin/servers` | `GET` | List all registered servers |

| `/admin/servers` | `POST` | Register a llama-server |

| `/admin/servers/{id}` | `DELETE` / `PATCH` | Remove or update server |

| `/admin/servers/{id}/models` | `POST` / `DELETE` | Assign/unassign model name |

| `/admin/keys` | `GET` | List all API keys |

| `/admin/keys` | `POST` | Create user key |

| `/admin/keys/{key_id}` | `DELETE` | Revoke key (by integer id) |

| `/admin/keys/{key_id}/toggle` | `PATCH` | Activate/deactivate key (by integer id) |

| `/admin/aliases` | `GET` / `POST` | List or create model aliases |

| `/admin/aliases/{alias_name}` | `DELETE` | Delete alias |

| `/admin/usage` | `GET` | View token usage logs |

**Note:** Key operations (`DELETE`, `PATCH /toggle`) use integer `key_id` from the database, not the API key string itself.

## Proxy Endpoints

These forward to llama-server backends based on model name routing.

| Endpoint | Method | Description |

|----------|--------|-------------|

| `/v1/chat/completions` | `POST` | Chat completions (streaming + non-streaming) |

| `/v1/completions` | `POST` | Legacy completions |

| `/v1/embeddings` | `POST` | Embeddings |

| `/v1/models` | `GET` | List available models (no auth required) |

| `/health` | `GET` | Health check (no auth required) |

## Usage Logs

Each request logs: user, server, model name, prompt/completion tokens, timings (ms), and total tokens. No conversation content is stored.

```bash

curl "http://localhost:8000/admin/usage?key_id=1" \

  -H "Authorization: Bearer $ADMIN_KEY"

```

## Benchmarking

Proxy overhead measured with Locust using **parallel concurrent execution** — both benchmarks (direct and proxy) hit the same backend simultaneously for a fair comparison.

All requests authenticated, routed, and logged via SQLite on every call (cold cache mode). In-memory cache disabled to measure worst-case per-request overhead.

**Hardware:** i9-14900K, RTX 4090 | **Model:** Qwen3.5-2B-GGUF | **Backend:** llama.cpp server

### Mock backend (fixed 100ms delay, clean conditions)

| Mode | Users | Direct P50 | Proxy P50 | Overhead P50 | Direct P95 | Proxy P95 | Overhead P95 | Direct P99 | Proxy P99 | Overhead P99 | Direct Mean | Through proxy | Overhead Mean | Direct RPS | Through proxy | RPS overhead |

|------|-------|-----------|-----------|-------------|-----------|-----------|-------------|-----------|----------|-------------|------------|--------------|--------------|-----------|--------------|-------------|

| Low | 5+5 | 100ms | 100ms | +0ms | 100ms | 100ms | 0ms | 110ms | 130ms | +20ms | 101ms | 103ms | +2ms | 49.3 | 48.4 | -0.9 |

| Medium | 20+20 | 100ms | 100ms | +0ms | 100ms | 110ms | +10ms | 110ms | 130ms | +20ms | 101ms | 104ms | +2ms | 194.8 | 190.9 | -3.9 |

| High | 100+100 | 100ms | 110ms | +10ms | 100ms | 120ms | +20ms | 110ms | 130ms | +20ms | 102ms | 107ms | +4ms | 896.2 | 860.3 | -35.9 |

### Real llama-server backend

| Mode | Users | Direct P50 | Proxy P50 | Overhead P50 | Direct P95 | Proxy P95 | Overhead P95 | Direct P99 | Proxy P99 | Overhead P99 | Direct Mean | Through proxy | Overhead Mean | Direct RPS | Through proxy | RPS overhead |

|------|-------|-----------|-----------|-------------|-----------|-----------|-------------|-----------|----------|-------------|------------|--------------|--------------|-----------|--------------|-------------|

| Low | 5+5 | 570ms | 570ms | +0ms | 940ms | 930ms | -10ms | ~1200ms | ~1100ms | ~0ms | 605ms | 604ms | -1ms | 8.2 | 8.2 | 0.0 |

| Medium | 20+20 | ~2500ms | ~2500ms | ~0ms | ~2900ms | ~2900ms | ~0ms | ~14000ms | ~14000ms | ~0ms | ~2413ms | ~2460ms | +47ms | 8.0 | 7.9 | -0.1 |

| High | 100+100 | 12000ms | 12000ms | ~0ms | 13000ms | 13000ms | ~0ms | 13000ms | 13000ms | ~0ms | 10073ms | 10058ms | -15ms | 8.1 | 8.1 | -0.0 |

Proxy overhead on clean conditions (mock): **~2ms** mean, **+20ms** P99 across all load levels with 4 uvicorn workers. Against real backend: **negligible** — latency identical within measurement noise (~1s variance at tail).

Run your own benchmarks: `python tests/benchmark/run.py [low|medium|high]` (add `--mock` for fixed-delay backend)

### Memory footprint

| Workers | Idle    | Under load | Growth |

|---------|---------|------------|--------|

| 1       | 53 MB   | 62 MB      | +9 MB  |

| 4       | 252 MB  | 273 MB     | +21 MB |

Per-worker baseline: **~53 MB**, load growth: **+4–6 MB** per worker.  

Identical footprint against mock and real backends — the proxy forwards without buffering responses. No memory growth observed across extended runs.

### Caveat

The aggregate overhead numbers (+2-4ms mean, +20ms P99 on mock) include asyncio event loop contention at high concurrency (100+ concurrent users per worker). Per-request proxy logic itself is **~0.18ms** — the difference comes from how asyncio handles many simultaneous awaits on a single thread. With 4 uvicorn workers, each worker handles ~25 requests, keeping contention minimal.

Real backend P99 spikes (Medium mode, ~14s) are caused by llama.cpp single-thread inference bottleneck under 20 concurrent users — not proxy overhead. Proxy adds negligible latency regardless of backend saturation.

## How it compares

smol-llm-proxy is built for one specific case: **multiple llama-server instances, multiple users, per-user token accounting**. 

- **[LiteLLM](https://github.com/BerriAI/litellm)** — much broader scope: 100+ cloud providers, virtual keys, budgets, admin UI, fallbacks. Requires Postgres + Redis for full features. Use it if you need a production gateway across cloud LLMs.

- **[llama-swap](https://github.com/mostlygeek/llama-swap)** — solves a different problem: hot-swapping models on one llama.cpp instance. No users, no accounting. Use it if you run many models on one machine and want them loaded on demand.

- **[llama.cpp router mode](https://github.com/ggerganov/llama.cpp)** — built into llama-server itself. Same scope as llama-swap, no auth layer.

If you self-host several llama-server instances on one or more machines and want to share them with a small group while tracking usage, smol-llm-proxy is the smallest thing that does that. Otherwise, one of the above is probably a better fit.

## Architecture

```

[users] ──HTTPS──> [proxy :port] ──HTTP──> [llama-server 1 :port]

                        │                  [llama-server 2 :port]

                        │                  [llama-server N :port]

                        │

                        ├── in-memory cache (keys, aliases, routes) — TTL 30s

                        ├── validate API key + resolve routing (SQLite on first call, then cache)

                        ├── forward request via connection-pooled httpx client

                        └── async log tokens + timings (background worker, no blocking)

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/robolamp/smol-llm-proxy

Awesome Lists containing this project

README