{"id":50350995,"url":"https://github.com/robolamp/smol-llm-proxy","last_synced_at":"2026-05-29T21:01:07.223Z","repository":{"id":355277392,"uuid":"1224897278","full_name":"robolamp/smol-llm-proxy","owner":"robolamp","description":"Lightweight API key proxy for llama.cpp servers with per-user token usage tracking.","archived":false,"fork":false,"pushed_at":"2026-05-27T11:48:36.000Z","size":211,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-27T13:12:29.298Z","etag":null,"topics":["llama-cpp","llamacpp","logging","proxy","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/robolamp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-29T18:36:06.000Z","updated_at":"2026-05-27T11:36:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/robolamp/smol-llm-proxy","commit_stats":null,"previous_names":["robolamp/smol-llm-proxy"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/robolamp/smol-llm-proxy","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robolamp%2Fsmol-llm-proxy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robolamp%2Fsmol-llm-proxy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robolamp%2Fsmol-llm-proxy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robolamp%2Fsmol-llm-proxy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/robolamp","download_url":"https://codeload.github.com/robolamp/smol-llm-proxy/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/robolamp%2Fsmol-llm-proxy/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33670211,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-29T02:00:06.066Z","response_time":107,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llama-cpp","llamacpp","logging","proxy","python"],"created_at":"2026-05-29T21:00:37.050Z","updated_at":"2026-05-29T21:01:07.194Z","avatar_url":"https://github.com/robolamp.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# smol-llm-proxy\n\n[![PyPI version](https://img.shields.io/pypi/v/smol-llm-proxy)](https://pypi.org/project/smol-llm-proxy/)\n[![CI](https://github.com/robolamp/smol-llm-proxy/actions/workflows/ci.yml/badge.svg)](https://github.com/robolamp/smol-llm-proxy/actions)\n[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n\nA small API proxy for self-hosted llama.cpp setups. Routes across multiple llama-server instances, per-user API keys, token usage tracking. \u003c1000 code lines (excluding blanks, docstrings, comments), ~53 MB RAM, ~0.2ms overhead.\n\nBuilt for the case where you run multiple llama-server instances (different models, different GPUs) and want to share them across users with token tracking. Not a replacement for LiteLLM or llama-swap — see comparison below.\n\n## Features\n\n- Per-user API keys (create / delete / toggle active)\n- Multi-server routing by model name with in-memory cache\n- Model aliases (`alias` -\u003e `model-name.gguf`)\n- Token usage logging: prompt/completion tokens, timings\n- Streaming and non-streaming proxy support\n- Connection-pooled httpx client (keepalive connections to backends)\n- SQLite backend (zero external DB dependencies)\n\n## Quick Start\n\n### Docker Compose (recommended)\n\nClone the repo, then:\n\n```bash\ncp .env.example .env                # set ADMIN_KEY\ncp config.example.yaml config.yaml  # fill in your servers\ndocker compose up -d --build\n```\n\nThe proxy listens on `0.0.0.0:8000` by default.\n\n### Plain Docker\n\n```bash\ndocker build -t smol-llm-proxy .\ndocker run -p 8000:8000 \\\n  -e ADMIN_KEY=secret \\\n  -v db-data:/data \\\n  -v $(pwd)/config.yaml:/app/config.yaml:ro \\\n  smol-llm-proxy\n```\n\n### Pip install\n\n```bash\npip install smol-llm-proxy\n```\n\nExample configs ship with the package. Copy them:\n\n```bash\npython -c \"import smol_llm_proxy, shutil, os; d=os.path.dirname(smol_llm_proxy.__file__); shutil.copy2(f'{d}/config.example.yaml','config.yaml'); shutil.copy2(f'{d}/.env.example','.env')\"\ncp .env.example .env                # set ADMIN_KEY\ncp config.example.yaml config.yaml  # fill in your servers\nADMIN_KEY=secret python -m smol_llm_proxy\n```\n\nOr download from GitHub:\n\n```bash\ncurl -sO https://raw.githubusercontent.com/robolamp/smol-llm-proxy/main/.env.example \u0026\u0026 cp .env.example .env\ncurl -sO https://raw.githubusercontent.com/robolamp/smol-llm-proxy/main/config.example.yaml \u0026\u0026 cp config.example.yaml config.yaml\n```\n\n### Quick usage\n\n1. Create a user key:\n\n```bash\ncurl -X POST http://localhost:8000/admin/keys \\\n  -H \"Authorization: Bearer $ADMIN_KEY\" \\\n  -d '{\"name\": \"my-user\"}'\n```\n\nThe response contains a JSON object with a `key` field — that's the user's Bearer token. Save it; you'll need it for proxy requests.\n\n2. Send a chat completion with the user key:\n\n```bash\ncurl http://localhost:8000/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -H \"Authorization: Bearer sk-\u003cfull-key-from-step-1\u003e\" \\\n  -d '{\n    \"model\": \"Qwen3.5-2B\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n    \"stream\": true\n  }'\n```\n\n## Configuration\n\nThe proxy reads two files: `config.yaml` for routing and `.env` for runtime settings.\n\n### `config.yaml` — servers, models, aliases\n\nLoaded into SQLite at startup, persisted across restarts:\n\n```yaml\nservers:\n  - name: my-server\n    url: http://host:port\n    api_key: \"\"              # optional, if llama-server requires auth\n    models:\n      - model-name.gguf\n\naliases:\n  alias: model-name.gguf     # short name -\u003e real model name\n```\n\n### Environment variables\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `ADMIN_KEY` | required | Bearer token for `/admin/*` endpoints |\n| `PROXY_HOST` | `0.0.0.0` | Listen address |\n| `PROXY_PORT` | `8000` | Listen port |\n| `DB_PATH` | `data/proxy.db` | SQLite database location |\n| `CONFIG_PATH` | `./config.yaml` | Path to config file |\n\nFor `pip install`, set them in shell or via a `.env`-like loader. For Docker Compose, put them in `.env`:\n\n```bash\nADMIN_KEY=secret\nPROXY_PORT=8000\n```\n\n### Docker volumes\n\nThe Compose setup mounts two volumes:\n\n- `db-data:/data` — SQLite database, persists across container restarts\n- `./config.yaml:/config/config.yaml:ro` — config file, read-only\n\n\n## Admin API\n\nAll admin endpoints require `Authorization: Bearer \u003cADMIN_KEY\u003e` header.\n\n| Endpoint | Method | Description |\n|----------|--------|-------------|\n| `/admin/servers` | `GET` | List all registered servers |\n| `/admin/servers` | `POST` | Register a llama-server |\n| `/admin/servers/{id}` | `DELETE` / `PATCH` | Remove or update server |\n| `/admin/servers/{id}/models` | `POST` / `DELETE` | Assign/unassign model name |\n| `/admin/keys` | `GET` | List all API keys |\n| `/admin/keys` | `POST` | Create user key |\n| `/admin/keys/{key_id}` | `DELETE` | Revoke key (by integer id) |\n| `/admin/keys/{key_id}/toggle` | `PATCH` | Activate/deactivate key (by integer id) |\n| `/admin/aliases` | `GET` / `POST` | List or create model aliases |\n| `/admin/aliases/{alias_name}` | `DELETE` | Delete alias |\n| `/admin/usage` | `GET` | View token usage logs |\n\n**Note:** Key operations (`DELETE`, `PATCH /toggle`) use integer `key_id` from the database, not the API key string itself.\n\n## Proxy Endpoints\n\nThese forward to llama-server backends based on model name routing.\n\n| Endpoint | Method | Description |\n|----------|--------|-------------|\n| `/v1/chat/completions` | `POST` | Chat completions (streaming + non-streaming) |\n| `/v1/completions` | `POST` | Legacy completions |\n| `/v1/embeddings` | `POST` | Embeddings |\n| `/v1/models` | `GET` | List available models (no auth required) |\n| `/health` | `GET` | Health check (no auth required) |\n\n## Usage Logs\n\nEach request logs: user, server, model name, prompt/completion tokens, timings (ms), and total tokens. No conversation content is stored.\n\n```bash\ncurl \"http://localhost:8000/admin/usage?key_id=1\" \\\n  -H \"Authorization: Bearer $ADMIN_KEY\"\n```\n\n## Benchmarking\n\nProxy overhead measured with Locust using **parallel concurrent execution** — both benchmarks (direct and proxy) hit the same backend simultaneously for a fair comparison.\n\nAll requests authenticated, routed, and logged via SQLite on every call (cold cache mode). In-memory cache disabled to measure worst-case per-request overhead.\n\n**Hardware:** i9-14900K, RTX 4090 | **Model:** Qwen3.5-2B-GGUF | **Backend:** llama.cpp server\n\n### Mock backend (fixed 100ms delay, clean conditions)\n\n| Mode | Users | Direct P50 | Proxy P50 | Overhead P50 | Direct P95 | Proxy P95 | Overhead P95 | Direct P99 | Proxy P99 | Overhead P99 | Direct Mean | Through proxy | Overhead Mean | Direct RPS | Through proxy | RPS overhead |\n|------|-------|-----------|-----------|-------------|-----------|-----------|-------------|-----------|----------|-------------|------------|--------------|--------------|-----------|--------------|-------------|\n| Low | 5+5 | 100ms | 100ms | +0ms | 100ms | 100ms | 0ms | 110ms | 130ms | +20ms | 101ms | 103ms | +2ms | 49.3 | 48.4 | -0.9 |\n| Medium | 20+20 | 100ms | 100ms | +0ms | 100ms | 110ms | +10ms | 110ms | 130ms | +20ms | 101ms | 104ms | +2ms | 194.8 | 190.9 | -3.9 |\n| High | 100+100 | 100ms | 110ms | +10ms | 100ms | 120ms | +20ms | 110ms | 130ms | +20ms | 102ms | 107ms | +4ms | 896.2 | 860.3 | -35.9 |\n\n### Real llama-server backend\n\n| Mode | Users | Direct P50 | Proxy P50 | Overhead P50 | Direct P95 | Proxy P95 | Overhead P95 | Direct P99 | Proxy P99 | Overhead P99 | Direct Mean | Through proxy | Overhead Mean | Direct RPS | Through proxy | RPS overhead |\n|------|-------|-----------|-----------|-------------|-----------|-----------|-------------|-----------|----------|-------------|------------|--------------|--------------|-----------|--------------|-------------|\n| Low | 5+5 | 570ms | 570ms | +0ms | 940ms | 930ms | -10ms | ~1200ms | ~1100ms | ~0ms | 605ms | 604ms | -1ms | 8.2 | 8.2 | 0.0 |\n| Medium | 20+20 | ~2500ms | ~2500ms | ~0ms | ~2900ms | ~2900ms | ~0ms | ~14000ms | ~14000ms | ~0ms | ~2413ms | ~2460ms | +47ms | 8.0 | 7.9 | -0.1 |\n| High | 100+100 | 12000ms | 12000ms | ~0ms | 13000ms | 13000ms | ~0ms | 13000ms | 13000ms | ~0ms | 10073ms | 10058ms | -15ms | 8.1 | 8.1 | -0.0 |\n\nProxy overhead on clean conditions (mock): **~2ms** mean, **+20ms** P99 across all load levels with 4 uvicorn workers. Against real backend: **negligible** — latency identical within measurement noise (~1s variance at tail).\nRun your own benchmarks: `python tests/benchmark/run.py [low|medium|high]` (add `--mock` for fixed-delay backend)\n\n### Memory footprint\n\n| Workers | Idle    | Under load | Growth |\n|---------|---------|------------|--------|\n| 1       | 53 MB   | 62 MB      | +9 MB  |\n| 4       | 252 MB  | 273 MB     | +21 MB |\n\nPer-worker baseline: **~53 MB**, load growth: **+4–6 MB** per worker.  \nIdentical footprint against mock and real backends — the proxy forwards without buffering responses. No memory growth observed across extended runs.\n\n### Caveat\n\nThe aggregate overhead numbers (+2-4ms mean, +20ms P99 on mock) include asyncio event loop contention at high concurrency (100+ concurrent users per worker). Per-request proxy logic itself is **~0.18ms** — the difference comes from how asyncio handles many simultaneous awaits on a single thread. With 4 uvicorn workers, each worker handles ~25 requests, keeping contention minimal.\n\nReal backend P99 spikes (Medium mode, ~14s) are caused by llama.cpp single-thread inference bottleneck under 20 concurrent users — not proxy overhead. Proxy adds negligible latency regardless of backend saturation.\n\n## How it compares\n\nsmol-llm-proxy is built for one specific case: **multiple llama-server instances, multiple users, per-user token accounting**. \n\n- **[LiteLLM](https://github.com/BerriAI/litellm)** — much broader scope: 100+ cloud providers, virtual keys, budgets, admin UI, fallbacks. Requires Postgres + Redis for full features. Use it if you need a production gateway across cloud LLMs.\n- **[llama-swap](https://github.com/mostlygeek/llama-swap)** — solves a different problem: hot-swapping models on one llama.cpp instance. No users, no accounting. Use it if you run many models on one machine and want them loaded on demand.\n- **[llama.cpp router mode](https://github.com/ggerganov/llama.cpp)** — built into llama-server itself. Same scope as llama-swap, no auth layer.\n\nIf you self-host several llama-server instances on one or more machines and want to share them with a small group while tracking usage, smol-llm-proxy is the smallest thing that does that. Otherwise, one of the above is probably a better fit.\n\n## Architecture\n\n```\n[users] ──HTTPS──\u003e [proxy :port] ──HTTP──\u003e [llama-server 1 :port]\n                        │                  [llama-server 2 :port]\n                        │                  [llama-server N :port]\n                        │\n                        ├── in-memory cache (keys, aliases, routes) — TTL 30s\n                        ├── validate API key + resolve routing (SQLite on first call, then cache)\n                        ├── forward request via connection-pooled httpx client\n                        └── async log tokens + timings (background worker, no blocking)\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobolamp%2Fsmol-llm-proxy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frobolamp%2Fsmol-llm-proxy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frobolamp%2Fsmol-llm-proxy/lists"}