{"id":49074245,"url":"https://github.com/mrmushfiq/llm0-gateway","last_synced_at":"2026-04-20T09:06:07.031Z","repository":{"id":352560628,"uuid":"1209177078","full_name":"mrmushfiq/llm0-gateway","owner":"mrmushfiq","description":"Self-hosted, OpenAI-compatible LLM gateway in a single Go binary. Routes to OpenAI, Anthropic, Gemini, and local Ollama with failover, two-tier caching (exact + semantic), per-key rate limits, and per-customer spend caps.","archived":false,"fork":false,"pushed_at":"2026-04-20T05:27:18.000Z","size":172,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-20T07:28:57.119Z","etag":null,"topics":["ai-gate","ai-infrastructure","anthropic","chatgpt","claude","gemini","golang","gpt","llm","llm-gateway","openai","openai-compatible","pgvector","postgres","rate-limiting","redis","self-hosted","semantic-cache"],"latest_commit_sha":null,"homepage":"http://llm0.ai","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mrmushfiq.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-13T07:03:20.000Z","updated_at":"2026-04-20T05:27:22.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mrmushfiq/llm0-gateway","commit_stats":null,"previous_names":["mrmushfiq/llm0-gateway"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/mrmushfiq/llm0-gateway","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrmushfiq%2Fllm0-gateway","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrmushfiq%2Fllm0-gateway/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrmushfiq%2Fllm0-gateway/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrmushfiq%2Fllm0-gateway/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mrmushfiq","download_url":"https://codeload.github.com/mrmushfiq/llm0-gateway/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mrmushfiq%2Fllm0-gateway/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32040366,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T00:18:06.643Z","status":"online","status_checked_at":"2026-04-20T02:00:06.527Z","response_time":94,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-gate","ai-infrastructure","anthropic","chatgpt","claude","gemini","golang","gpt","llm","llm-gateway","openai","openai-compatible","pgvector","postgres","rate-limiting","redis","self-hosted","semantic-cache"],"created_at":"2026-04-20T09:06:03.929Z","updated_at":"2026-04-20T09:06:07.022Z","avatar_url":"https://github.com/mrmushfiq.png","language":"Go","funding_links":[],"categories":["Quick Comparison"],"sub_categories":[],"readme":"# LLM0 Gateway\n\n[![Go](https://img.shields.io/badge/Go-1.24-blue?logo=go)](https://go.dev)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)\n[![Docker](https://img.shields.io/badge/Docker-Compose-blue?logo=docker)](docker-compose.yml)\n[![OpenAI Compatible](https://img.shields.io/badge/API-OpenAI_Compatible-412991)](https://platform.openai.com/docs/api-reference)\n\nA production-grade, self-hosted LLM gateway written in Go. One **OpenAI-compatible** API endpoint for **OpenAI**, **Anthropic**, **Google Gemini**, and **local Ollama models** — with configurable cloud/local failover, two-tier caching, streaming, per-key rate limiting, per-customer spend caps, and cost tracking out of the box.\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer llm0_live_...\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"gpt-4o-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}]}'\n```\n\nSwitch `gpt-4o-mini` for `claude-haiku-4-5-20251001`, `gemini-2.0-flash`, or any local Ollama model (`llama3.3`, `qwen2.5`, `gemma3`, …) — same endpoint, no code changes in your application.\n\n### At a glance\n\n| | |\n|---|---|\n| **Cache-hit p50 / p99** | **11 ms / 16 ms** ([how it's measured](#performance)) |\n| **Rate-limit rejection p50** | **2 ms** — fast-fail protects the gateway from abuse bursts |\n| **Throughput** | **~1,480 req/sec** sustained on a single MacBook Air core |\n| **Semantic caching** | `pgvector` + `all-MiniLM-L6-v2` — catches paraphrased duplicates at `$0` |\n| **Binary size / memory** | **30 MB** Go binary, ~50 MB RSS under load |\n| **Dependencies** | Postgres + Redis (+ optional bundled embedding service). That's it. |\n\n\u003e **Faster than LiteLLM, Portkey, and most of the commercial alternatives** — while shipping as a single self-hosted Go binary. See [full benchmark \u0026 methodology](#performance).\n\n---\n\n## Why LLM0 Gateway?\n\n- **One endpoint, four backends** — swap between OpenAI, Anthropic, Gemini, and local Ollama models without touching client code.\n- **Local-first or cloud-first, your choice** — a single `FAILOVER_MODE` env var decides whether requests try Ollama first, cloud first, local-only, or cloud-only. Great for privacy-sensitive workloads that need cloud as a backup.\n- **Never get paged for a provider outage** — automatic failover on `429`/`5xx`/`4xx`/timeout/connection errors across providers. Clients never see the failure.\n- **Save real money with two-tier caching** — exact-match cache returns in `\u003c1ms`; an optional **semantic cache** (`pgvector` + `all-MiniLM-L6-v2` embeddings) catches paraphrased duplicates so \"What's the capital of France?\" and \"Tell me France's capital city\" share one cached answer. Local Ollama calls cost `$0`.\n- **Built-in SaaS controls** — per-API-key rate limits, per-customer spend caps, hard monthly project caps, customer labels for analytics.\n- **Zero lock-in** — single Go binary, standard Postgres + Redis, open source.\n\n---\n\n## Features\n\n### Multi-Provider Routing\nRoute to **OpenAI**, **Anthropic**, **Google Gemini**, and **Ollama** (local models) through a single OpenAI-compatible API. The gateway detects the correct provider from the model name automatically and exposes a standard `GET /v1/models` endpoint for SDK discovery.\n\n### Configurable Failover Modes\nSet `FAILOVER_MODE` to control how cloud and local providers are ordered in the failover chain:\n\n| Mode | Behavior | Typical use case |\n|---|---|---|\n| `cloud_first` *(default)* | Cloud providers first, Ollama as last-resort fallback | Production, best quality + cost reduction |\n| `local_first` | Ollama first, cloud as fallback when local fails | Privacy-first apps, air-gapped + cloud-capable |\n| `local_only` | Never contact cloud APIs | Offline, compliance, dev without API keys |\n| `cloud_only` | Never use local models (even if configured) | Pure cloud deployments |\n\n### Automatic Cross-Provider Failover\nWhen a provider returns `429`, `5xx`, `401`/`403`, `404`, a timeout, or a connection failure, the gateway transparently retries the next provider in the chain — without the caller knowing. Preset chains are defined for all major models.\n\n```\ngpt-4o-mini  →  OpenAI (primary)\n             →  Anthropic claude-haiku-4-5\n             →  Google gemini-2.5-flash\n             →  Ollama qwen2.5:14b   (if OLLAMA_BASE_URL is set)\n```\n\nResponse headers `X-Failover: true` and `X-Original-Provider` tell you when a failover happened.\n\n### Local Ollama Support\nPoint the gateway at a running Ollama instance (`OLLAMA_BASE_URL=http://host.docker.internal:11434/v1`) and:\n- All pulled Ollama models become routable through `/v1/chat/completions`\n- They appear automatically in `GET /v1/models`\n- Streaming works identically to cloud providers\n- Cost is always `$0` — skipped in spend checks and logs\n- Tier mapping (`OLLAMA_MODEL_FLAGSHIP`, `_BALANCED`, `_BUDGET`) transparently substitutes local models for cloud equivalents during failover\n\n### Two-Tier Caching — Exact + Semantic\n\nThe gateway ships two independent cache layers that stack together to cut LLM spend dramatically:\n\n**1. Exact-match cache** — SHA-256 over `(project_id, model, provider, messages)`. Checked in Redis (`\u003c1ms`) first, falls through to Postgres (`~5ms`) on restart / Redis eviction. Identical requests **never hit the LLM twice**.\n\n**2. Semantic cache** — for when users ask the same thing differently. The first user message is sent to a bundled embedding service, which returns a 384-dim vector. That vector is compared against cached vectors in Postgres using `pgvector` cosine similarity. If the best match exceeds a configurable threshold (default `0.95`), we return that cached response.\n\n```\nUser A: \"What's the capital of France?\"         → cache miss, calls OpenAI\nUser B: \"Tell me France's capital city\"         → semantic hit (0.97) → $0 instant response\nUser C: \"france capital?\"                       → semantic hit (0.96) → $0 instant response\n```\n\nBoth caches are toggleable per-API-key (`cache_enabled`, `semantic_cache_enabled`) and per-project (`semantic_threshold`). When a semantic hit occurs you get:\n\n- `X-Cache-Hit: semantic`\n- `X-Cache-Similarity: 0.973`\n- `similarity_score` column populated in `gateway_logs` for offline analysis\n\n### Embedding Service (bundled)\n\nSemantic caching is powered by a small FastAPI service shipped alongside the gateway in `embedding_service/`:\n\n- **Model**: [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) — 22M params, 384-dim output, runs on CPU\n- **Runtime**: ~80–150 MB RAM, ~20–40 ms per embedding on a modern CPU\n- **Deployment**: included in `docker-compose.yml` as the `embedding` service; model weights are baked into the image at build time so first-request latency is zero\n- **Optional**: skip the service entirely and semantic caching disables gracefully — exact-match caching still works\n- **Swappable**: implements a simple `POST /embed` contract, so you can point the gateway at any HTTP embedder (BGE, E5, OpenAI `text-embedding-3-small`, self-hosted Instructor) by changing `EMBEDDING_SERVICE_URL`\n\nThe architecture is deliberate: keeping embeddings in a separate process means you can scale the embedding service independently, swap in a different model without rebuilding the gateway, or point at a GPU-backed embedder for throughput.\n\n### Streaming (SSE)\nFull Server-Sent Events support for **all four providers** (OpenAI, Anthropic, Gemini, Ollama). Chunks are normalized to a single OpenAI-compatible `chat.completion.chunk` shape regardless of which provider is upstream, so the same client code works against any backend.\n\nSend `\"stream\": true` to get a stream instead of a blocking JSON response:\n\n```bash\ncurl -N http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer llm0_live_...\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"gpt-4o-mini\",\n    \"messages\": [{\"role\":\"user\",\"content\":\"count to 10 slowly\"}],\n    \"stream\": true\n  }'\n```\n\nThe response starts with standard OpenAI chunks, ends with a **metadata frame** carrying cost / usage / latency (so you don't need a second call to know what the request cost), and terminates with `[DONE]`:\n\n```\ndata: {\"id\":\"chatcmpl-...\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"delta\":{\"content\":\"Sure\"}}],...}\ndata: {\"id\":\"chatcmpl-...\",\"object\":\"chat.completion.chunk\",\"choices\":[{\"delta\":{\"content\":\"!\"}}],...}\n...\ndata: {\"object\":\"chat.completion.chunk.metadata\",\"usage\":{\"prompt_tokens\":5,\"completion_tokens\":38,\"total_tokens\":43},\"cost_usd\":0.0000236,\"latency_ms\":1962,\"provider\":\"openai\"}\ndata: [DONE]\n```\n\n**Streaming behavior notes:**\n- **Cache hits return a single JSON body, not a stream.** The response is already complete — there's nothing to stream — so you get the cached payload with `X-Cache-Hit: exact` or `semantic` set. Treat `Content-Type: application/json` in response to a stream request as \"this was a cache hit.\" This matches OpenAI's own caching semantics.\n- **Failover is disabled for streaming requests.** Once a single chunk has been written to the client, we can't retry against a different provider without breaking the stream. Non-streaming requests keep full automatic failover. If provider reliability matters more than streaming UX, set `\"stream\": false`.\n- **No client-side timeout issues.** The gateway disables the server's 60-second `WriteTimeout` on streaming requests only, so long reasoning outputs (o1, Claude extended thinking) and slow local Ollama generations aren't truncated.\n- **Post-stream caching runs in a background goroutine** after `[DONE]`, so the second identical request returns from cache with the full metadata and no LLM call.\n\n### Token Bucket Rate Limiting (per API key)\nEach API key has its own `rate_limit_per_minute` enforced atomically in Redis via Lua scripts — no race conditions under high concurrency. Uses a full token bucket algorithm (not a naive counter), so burst traffic within the minute is allowed as long as the per-minute rate isn't breached.\n\nResponse headers included on every call:\n- `X-RateLimit-Limit`\n- `X-RateLimit-Remaining`\n- `X-RateLimit-Reset` (Unix timestamp)\n\nWhen the limit is exceeded, the gateway returns `429` with a `retry_after` field.\n\n### Per-Customer Spend Caps\nPass `X-Customer-ID` on any request to enable per-end-user daily and monthly USD spend limits. Limits are stored in the `customer_limits` table and support two overflow behaviors:\n- `block` — return `429` with spend details and how much longer until reset\n- `downgrade` — automatically route to a cheaper model (e.g. `gpt-4o` → `gpt-4o-mini`)\n\nCustomer labels (`X-LLM0-Tier: pro`, `X-LLM0-Team: billing`, …) are stored as JSONB on every request log for downstream analytics.\n\n### Hard Project Spend Cap\nSet `monthly_cap_usd` on a project and requests are blocked with `402 Payment Required` once the cap is hit. Checked **before** the LLM call using cost estimation, so runaway prompts can't silently exceed the cap.\n\n### Cost Tracking\nPre-request cost estimation (for spend cap checks) plus post-request reconciliation based on actual token usage. Costs are pulled from the `model_pricing` table and stored per request. Local Ollama calls are always `$0`.\n\n### Request Logging\nEvery request is logged to `gateway_logs` with: provider, model, tokens, cost, latency, cache status (exact/semantic/miss), similarity score, failover info, customer ID, and arbitrary labels.\n\n### Background Workers\nRuns in-process as Go goroutines — no separate cron container.\n- **Monthly spend reset** — zeroes `projects.current_month_spend_usd` at 00:00 UTC on the 1st; catches up on missed resets after downtime\n- **Exact cache cleanup** — hourly prune of expired `exact_cache` rows\n- **Semantic cache cleanup** — daily at 02:00 UTC, prunes `semantic_cache` rows past their per-row TTL\n- **Log maintenance** — weekly `gateway_logs` retention cleanup (Sunday 03:00 UTC)\n- **Spend reconciliation** — hourly drift check between Redis counters and Postgres\n\nEvery run writes an audit row to `system_logs` (when it does work). Disable all five with `DISABLE_BACKGROUND_WORKERS=true` for multi-replica deployments — enforcement is Redis-authoritative and unaffected. See [Background Worker Schedule](#background-worker-schedule) for the full cadence table and operational notes, and [How Spend Caps Reset](#how-spend-caps-reset) for how these jobs tie into cap enforcement.\n\n---\n\n## Supported Models\n\nPricing ships pre-seeded in [`schema/seed_models.sql`](schema/seed_models.sql) and can be extended at runtime via [`scripts/manage_models.sh`](scripts/manage_models.sql) — no code changes or redeploy required. New models from any cloud provider are auto-routable as soon as they're added to the pricing table (see [Dynamic Model Routing](#managing-model-pricing)).\n\n### OpenAI\n| Model | Tier | Context | Input $/1K | Output $/1K |\n|---|---|---:|---:|---:|\n| `gpt-5.4` | Flagship | 1M | $0.0025 | $0.0150 |\n| `gpt-5.4-mini` | Balanced | 1M | $0.00025 | $0.0020 |\n| `gpt-5.4-nano` | Budget | 1M | $0.0001 | $0.0008 |\n| `gpt-4o` | Flagship (prev-gen) | 128K | $0.0025 | $0.0100 |\n| `gpt-4o-mini` | Cost-optimized | 128K | $0.00015 | $0.0006 |\n| `gpt-4-turbo` | Legacy flagship | 128K | $0.0100 | $0.0300 |\n| `gpt-3.5-turbo` | Budget | 16K | $0.0005 | $0.0015 |\n\n### Anthropic\n| Model | Tier | Context | Input $/1K | Output $/1K |\n|---|---|---:|---:|---:|\n| `claude-opus-4-7` | Flagship | 200K | $0.0050 | $0.0250 |\n| `claude-opus-4-6` | Most capable | 200K | $0.0150 | $0.0750 |\n| `claude-sonnet-4-6` | Balanced | 200K | $0.0030 | $0.0150 |\n| `claude-opus-4-5-20251101` | Most capable (dated) | 200K | $0.0150 | $0.0750 |\n| `claude-sonnet-4-5-20250929` | Balanced (dated) | 200K | $0.0030 | $0.0150 |\n| `claude-haiku-4-5-20251001` | Cost-optimized | 200K | $0.0008 | $0.0040 |\n| `claude-sonnet-4-20250514` | Balanced (legacy) | 200K | $0.0030 | $0.0150 |\n| `claude-3-haiku-20240307` | Budget | 200K | $0.00025 | $0.00125 |\n\n### Google Gemini\n| Model | Tier | Context | Input $/1K | Output $/1K |\n|---|---|---:|---:|---:|\n| `gemini-2.5-pro` | Most capable | 2M | $0.00125 | $0.0100 |\n| `gemini-2.5-flash` | Balanced | 1M | $0.0001 | $0.0004 |\n| `gemini-2.0-flash` | Cost-optimized | 1M | $0.0001 | $0.0004 |\n| `gemini-2.0-flash-lite` | Budget | 1M | $0.000075 | $0.00030 |\n\n\u003e Any new model you add to `model_pricing` is **automatically routable** — the provider is selected by name prefix (`gpt-*` → OpenAI, `claude-*` → Anthropic, `gemini-*` → Google). No code change or redeploy required when a provider ships a new model.\n\n### Ollama (local)\nAny model pulled on your Ollama instance is automatically routable — `llama3.3:70b`, `qwen2.5:14b`, `gemma3:4b`, `mistral`, `deepseek-r1`, etc. Pull models with `ollama pull \u003cmodel\u003e` and they appear in `GET /v1/models` instantly. All Ollama requests are metered at **$0 cost**.\n\nThe tier env vars (`OLLAMA_MODEL_FLAGSHIP`, `OLLAMA_MODEL_BALANCED`, `OLLAMA_MODEL_BUDGET`) tell the failover engine which local model to substitute when a cloud model is requested. For example, with `OLLAMA_MODEL_BALANCED=qwen2.5:14b` set, a `gpt-4o-mini` request in `local_first` mode tries `qwen2.5:14b` first, then `gpt-4o-mini` on OpenAI if the local call fails.\n\n---\n\n## Quick Start\n\n### Option A — Docker Compose (recommended)\n\nRequires: [Docker Desktop](https://www.docker.com/products/docker-desktop/).\n\n**Step 1 — Clone and configure**\n\n```bash\ngit clone https://github.com/mrmushfiq/llm0-gateway\ncd llm0-gateway\n\ncp .env.example .env\n```\n\nOpen `.env` and add at least one provider API key:\n\n```env\nOPENAI_API_KEY=sk-proj-...\nANTHROPIC_API_KEY=sk-ant-...\nGEMINI_API_KEY=AIza...\n```\n\n**Step 2 — Build the images**\n\n```bash\ndocker compose build\n```\n\n\u003e **This takes 3–5 minutes on first run.** The embedding service downloads and bakes the `all-MiniLM-L6-v2` model weights (~90MB) into the image at build time so startup is instant afterwards. Subsequent builds use the Docker layer cache and complete in seconds.\n\n**Step 3 — Start all services**\n\n```bash\ndocker compose up\n```\n\nPostgres (with `pgvector`), Redis, the embedding service, and the gateway all start together. The database schema is applied automatically on first boot. When you see:\n\n```\nllm0_gateway  | ✅ Failover executor initialized with 3 providers\nllm0_gateway  | ✅ Semantic cache initialized\nllm0_gateway  | 🚀 LLM0 Gateway listening on :8080\n```\n\nthe gateway is ready.\n\n**Step 4 — Create an API key**\n\n```bash\n./scripts/create_api_key.sh\n```\n\n**Step 5 — Send your first request**\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer llm0_live_...\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"gpt-4o-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Say hello!\"}]}'\n```\n\n**Useful Docker commands**\n\n```bash\n# Run in background\ndocker compose up -d\n\n# View gateway logs\ndocker compose logs -f gateway\n\n# Stop everything\ndocker compose down\n\n# Stop and wipe all data (full reset)\ndocker compose down -v\n\n# Restart just the gateway (e.g. after editing .env)\ndocker compose up -d gateway\n```\n\n**Step 6 — (Optional) Add local Ollama models**\n\nIf you're running [Ollama](https://ollama.com) on your host machine, point the gateway at it for local, zero-cost inference with cloud failover:\n\n```env\n# In .env\nOLLAMA_BASE_URL=http://host.docker.internal:11434/v1\nFAILOVER_MODE=local_first\n\n# Map local models to tiers (match whatever you've pulled)\nOLLAMA_MODEL_FLAGSHIP=llama3.3:70b\nOLLAMA_MODEL_BALANCED=qwen2.5:14b\nOLLAMA_MODEL_BUDGET=gemma3:4b\n```\n\nThen restart the gateway and test:\n\n```bash\ndocker compose up -d --force-recreate gateway\n\n# Request a cloud model — gets served by Ollama first, cloud as fallback\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer llm0_live_...\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"model\":\"gpt-4o-mini\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}]}'\n\n# List everything the gateway can route (cloud + local)\ncurl http://localhost:8080/v1/models \\\n  -H \"Authorization: Bearer llm0_live_...\"\n```\n\nThe `X-Provider` response header shows which backend actually served the request.\n\n### Option B — Run with Go\n\nRequires: Go 1.24+, Postgres with the `pgvector` extension, Redis.\n\n**Step 1 — Clone and configure**\n\n```bash\ngit clone https://github.com/mrmushfiq/llm0-gateway\ncd llm0-gateway\n\ncp .env.example .env\n# Edit .env — set DATABASE_URL, REDIS_URL, and at least one provider key\n```\n\n**Step 2 — Apply the database schema**\n\n```bash\npsql $DATABASE_URL -f schema/schema.sql\n```\n\n**Step 3 — (Optional) Start the embedding service for semantic caching**\n\n```bash\ncd embedding_service\npip install -r requirements.txt\nuvicorn app:app --host 0.0.0.0 --port 8001\n```\n\nThen set `EMBEDDING_SERVICE_URL=http://localhost:8001` in your `.env`. Skip this step to run without semantic caching — exact-match caching still works.\n\n**Step 4 — Run the gateway**\n\n```bash\ngo run ./cmd/gateway/main.go\n```\n\nOr build a binary:\n\n```bash\ngo build -o llm0-gateway ./cmd/gateway/main.go\n./llm0-gateway\n```\n\n---\n\n## Managing Model Pricing\n\n### How the default list is seeded\n\nThe gateway ships with a curated set of model prices in [`schema/seed_models.sql`](schema/seed_models.sql). It's applied automatically on **first** Postgres boot via the `docker-entrypoint-initdb.d/` mount.\n\n- **Docker Compose users (fresh install)** — no action needed. Works out of the box.\n- **Docker Compose users (existing install)** — Postgres only runs initdb scripts on an empty data volume, so an upgraded `seed_models.sql` won't auto-apply. Re-run it manually against your live DB (safe — idempotent):\n  ```bash\n  docker compose exec -T postgres psql -U llm0 -d llm0_gateway \\\n    -f /docker-entrypoint-initdb.d/02_seed_models.sql\n  ```\n- **Non-Docker / manual Postgres** — after applying `schema/schema.sql`, also run:\n  ```bash\n  psql $DATABASE_URL -f schema/seed_models.sql\n  ```\n\nThe seed uses `ON CONFLICT (provider, model) DO NOTHING`, so it's safe to re-run and will never overwrite rows you've managed manually.\n\n\u003e **Want stricter schema versioning?** The project ships a single `schema.sql` + `seed_models.sql` pair for simplicity. If your team prefers versioned, reversible migrations, drop in [`golang-migrate`](https://github.com/golang-migrate/migrate) (classic up/down SQL files) or [Atlas](https://atlasgo.io/) (declarative, diff-based) — both integrate cleanly without changing application code.\n\n### Adding / updating / removing entries\n\nModel prices live in the `model_pricing` table. Use the bundled interactive script to add, update, or delete entries when providers release new models or change prices:\n\n```bash\n./scripts/manage_models.sh           # interactive menu\n./scripts/manage_models.sh list      # list all models\n./scripts/manage_models.sh add       # add a new model\n./scripts/manage_models.sh update    # update pricing for an existing model\n./scripts/manage_models.sh delete    # remove a model\n```\n\nAfter any change, restart the gateway to reload the pricing cache:\n\n```bash\ndocker compose restart gateway\n```\n\nPrices are specified per 1,000 tokens in USD (e.g. `gpt-4o-mini` input is `0.00015`). Ollama models can be added with `0.00000000` prices to make their cost explicit in request logs.\n\n### Keeping pricing current\n\nProvider pricing drifts — new models launch, old ones get cheaper, and context windows change. Here's the policy:\n\n| Situation | What to do |\n|---|---|\n| New model released upstream | Add it with `./scripts/manage_models.sh add` — no code change needed. Cloud providers are routed by prefix (`gpt-*`, `claude-*`, `gemini-*`), so new models work immediately. |\n| Want the fix to persist across fresh installs | Submit a PR updating [`schema/seed_models.sql`](schema/seed_models.sql). That single file is the canonical source of truth. |\n| Pricing changed on an existing model | `./scripts/manage_models.sh update` locally; PR the seed file for the upstream fix. |\n| Running a fleet of gateways | Roll out the updated `seed_models.sql` and apply it once per database (`psql ... -f seed_models.sql`). It's idempotent, so re-running is safe. |\n\nWe intentionally **do not** auto-scrape provider pricing pages: those pages are unstable, ToS-ambiguous, and silently reformat. Community-reviewed PRs against `seed_models.sql` are the safest long-term update channel — the same approach LiteLLM uses.\n\n---\n\n## Creating Your First API Key\n\nAPI keys are in the format `llm0_live_\u003c64 hex chars\u003e`. Only the `bcrypt(SHA-256(key))` hash is stored — the raw key is shown once.\n\nThe script requires Docker Compose to be running (uses `pgcrypto` inside Postgres — no host dependencies needed):\n\n```bash\n./scripts/create_api_key.sh\n```\n\nExample output:\n\n```\n════════════════════════════════════════════════\n  LLM0 Gateway — Create API Key\n════════════════════════════════════════════════\n\n▶  Generated key (save this — shown only once):\n\n   llm0_live_c0244eec5b7a8426a6a96b5f9748efa8...\n\n▶  Bcrypt hash generated (via pgcrypto)\n▶  Project ID : 54ce26a8-2f93-4afd-924d-28a8832ea52e\n▶  Key prefix : llm0_live_c0244...\n\n  Test it:\n\n  curl http://localhost:8080/v1/chat/completions \\\n    -H \"Authorization: Bearer llm0_live_c0244...\" \\\n    ...\n════════════════════════════════════════════════\n```\n\n---\n\n## Configuration\n\nAll configuration is via environment variables. Copy `.env.example` to `.env`.\n\n### Required\n\n| Variable | Description |\n|---|---|\n| `DATABASE_URL` | Postgres connection string (must have `pgvector` extension) |\n| `REDIS_URL` | Redis connection string |\n| At least one of: `OPENAI_API_KEY`, `ANTHROPIC_API_KEY`, `GEMINI_API_KEY`, `OLLAMA_BASE_URL` | The gateway routes to whichever providers have keys set |\n\n### Cloud Providers\n\n| Variable | Default | Description |\n|---|---|---|\n| `OPENAI_API_KEY` | — | OpenAI API key |\n| `ANTHROPIC_API_KEY` | — | Anthropic API key |\n| `GEMINI_API_KEY` | — | Google Gemini API key |\n\n### Local Models (Ollama)\n\n| Variable | Default | Description |\n|---|---|---|\n| `OLLAMA_BASE_URL` | `\"\"` | Set to enable local models. In Docker: `http://host.docker.internal:11434/v1`. Native: `http://localhost:11434/v1` |\n| `OLLAMA_MODEL_FLAGSHIP` | `llama3.3:70b` | Local model used as substitute for flagship-tier cloud models (gpt-4o, claude-opus, gemini-pro) |\n| `OLLAMA_MODEL_BALANCED` | `qwen2.5:14b` | Local model used as substitute for balanced-tier cloud models (gpt-4o-mini, claude-sonnet, gemini-flash) |\n| `OLLAMA_MODEL_BUDGET` | `gemma3:4b` | Local model used as substitute for budget-tier cloud models (gpt-3.5, claude-haiku, gemini-flash-lite) |\n\n### Failover\n\n| Variable | Default | Description |\n|---|---|---|\n| `FAILOVER_MODE` | `cloud_first` | One of `cloud_first`, `local_first`, `local_only`, `cloud_only`. See [Failover Modes](#configurable-failover-modes) above |\n\n### Server \u0026 Infrastructure\n\n| Variable | Default | Description |\n|---|---|---|\n| `PORT` | `8080` | Gateway listen port |\n| `ENVIRONMENT` | `local` | `local` or `production` (switches Gin to release mode) |\n| `CACHE_TTL_SECONDS` | `3600` | Exact-match cache TTL in seconds |\n| `EMBEDDING_SERVICE_URL` | `\"\"` | Enables semantic caching when set. Docker Compose sets this automatically |\n| `REQUEST_TIMEOUT` | `30s` | Upstream request timeout |\n| `MAX_CONCURRENT_REQUESTS` | `10000` | Concurrency ceiling for the HTTP server |\n| `DISABLE_BACKGROUND_WORKERS` | `false` | Skip starting scheduled goroutines (monthly spend reset, cache/log cleanup, reconciliation). Useful in multi-replica deployments where only one replica should run maintenance |\n\n### TLS (optional)\n\n| Variable | Default | Description |\n|---|---|---|\n| `TLS_ENABLED` | `false` | Enable TLS 1.3 |\n| `TLS_CERT_FILE` | — | Path to certificate |\n| `TLS_KEY_FILE` | — | Path to private key |\n\n---\n\n## How It Works\n\n### Request Pipeline\n\n```\nIncoming Request\n        │\n        ▼\n  Auth Middleware          validate Bearer token (bcrypt verify, Redis-cached)\n        │\n        ▼\n  Rate Limit Check         token bucket per API key via atomic Redis Lua script\n        │\n        ▼\n  Spend Cap Check          block if project monthly_cap_usd exceeded\n        │\n        ▼\n  Exact-Match Cache        SHA-256 key → Redis (\u003c1ms) → Postgres (~5ms)\n        │  cache hit: return immediately\n        ▼\n  Semantic Cache           pgvector cosine similarity search (~20–50ms)\n        │  cache hit: return immediately\n        ▼\n  Customer Limit Check     per X-Customer-ID daily/monthly spend cap\n        │\n        ▼\n  Provider Call            OpenAI / Anthropic / Gemini\n        │  on 429/5xx/timeout → automatic failover to next provider\n        ▼\n  Response                 streaming SSE or non-streaming JSON\n        │\n        ▼\n  Async Workers            log request, update spend counters, store in cache\n```\n\n### Failover Chains\n\nFailover chains are **dynamically composed at request time** based on `FAILOVER_MODE` and whether Ollama is configured. The base cloud chains are defined in `internal/gateway/failover/chains.go`.\n\n**Base cloud chains** (used when no Ollama is configured, or in `cloud_only` mode):\n\n| Requested Model | Step 1 | Step 2 | Step 3 |\n|---|---|---|---|\n| `gpt-4o` | OpenAI | Anthropic claude-sonnet-4-6 | Google gemini-2.5-pro |\n| `gpt-4o-mini` | OpenAI | Anthropic claude-haiku-4-5 | Google gemini-2.5-flash |\n| `claude-sonnet-4-6` | Anthropic | OpenAI gpt-4o | Google gemini-2.5-pro |\n| `claude-haiku-4-5-20251001` | Anthropic | OpenAI gpt-4o-mini | Google gemini-2.5-flash |\n| `gemini-2.5-pro` | Google | OpenAI gpt-4o | Anthropic claude-sonnet-4-6 |\n| `gemini-2.5-flash` | Google | OpenAI gpt-4o-mini | Anthropic claude-haiku-4-5 |\n\n**Effect of `FAILOVER_MODE`** (example: request for `gpt-4o-mini` with `OLLAMA_MODEL_BALANCED=qwen2.5:14b`):\n\n| Mode | Resulting chain |\n|---|---|\n| `cloud_only` | OpenAI → Anthropic haiku → Gemini flash |\n| `cloud_first` | OpenAI → Anthropic haiku → Gemini flash → Ollama qwen2.5:14b |\n| `local_first` | Ollama qwen2.5:14b → OpenAI → Anthropic haiku → Gemini flash |\n| `local_only` | Ollama qwen2.5:14b |\n\n**Tier resolution** — the gateway chooses which Ollama model to substitute based on the cloud model's quality tier: flagship (gpt-4o, claude-opus, gemini-pro), balanced (gpt-4o-mini, claude-sonnet, gemini-flash), or budget (gpt-3.5, claude-haiku, gemini-flash-lite).\n\n**Failover triggers**: `429` (rate limit), `5xx` (server error), connection timeout, connection error, `401`/`403` (auth failure — next provider may have a valid key), `404` (model not available on that provider).\n\n### Exact-Match Cache\n\nCache key: `SHA-256(project_id + provider + model + sorted_messages_json)`\n\nTwo-tier lookup:\n1. **Redis** (hot) — sub-millisecond, in-memory\n2. **Postgres** (warm) — ~5ms, survives Redis restarts\n\nCache hits cost `$0.00` and are returned in `\u003c1ms`.\n\n### Semantic Cache\n\nWhen `EMBEDDING_SERVICE_URL` is configured, the first user message is embedded using `all-MiniLM-L6-v2` (384 dimensions). The embedding is compared against stored vectors in Postgres using `pgvector` cosine similarity.\n\n```\nGateway ──POST /embed──► Embedding Service (all-MiniLM-L6-v2, CPU)\n        ◄─[0.12, -0.34, ...]──\n\n        ──cosine similarity──► pgvector (threshold: 0.95)\n```\n\nCache hits return the stored response without any LLM API call.\n\n**Threshold**: configurable per project (`semantic_threshold` column, default `0.95`). Lower values return more matches but risk returning less relevant cached responses.\n\n### Turning Semantic Cache Off\n\nThere are two ways to disable semantic caching, depending on scope:\n\n**1. Globally (all projects)** — unset `EMBEDDING_SERVICE_URL` in your environment. The gateway logs `⚠️ Semantic cache disabled (no EMBEDDING_SERVICE_URL)` at startup and skips the semantic lookup entirely. Exact-match caching is unaffected. The `embedding` service in `docker-compose.yml` can be removed or left idle — it's never called.\n\n```bash\n# In .env\nEMBEDDING_SERVICE_URL=\n\n# Or stop the embedding container alone\ndocker compose stop embedding\n```\n\n**2. Per project** — flip the `semantic_cache_enabled` column on the `projects` table. API keys inherit their project's setting, so every key scoped to that project loses semantic cache immediately on the next auth cache refresh (≤ 60 s by default, tunable via `CUSTOMER_LIMIT_CACHE_TTL_SECONDS`).\n\n```bash\n./scripts/manage_limits.sh           # menu option 6 — \"Update project cache settings\"\n```\n\nOr by SQL:\n\n```sql\nUPDATE projects\nSET semantic_cache_enabled = false\nWHERE id = '\u003cproject-uuid\u003e';\n```\n\nUse per-project disable when you have mixed workloads — e.g., chat UIs benefit from semantic hits, but tool-calling agents need exact matches because a single token difference changes intent. The `cache_enabled` column on the same table toggles the exact-match cache independently, so you can keep one and disable the other.\n\n**Note on existing cache rows** — disabling semantic cache only stops *reads and writes*; rows already in the `semantic_cache` table stay put. They'll age out naturally via the daily cleanup job (see below), or you can clear them manually:\n\n```sql\nDELETE FROM semantic_cache WHERE project_id = '\u003cproject-uuid\u003e';\n```\n\n---\n\n## Response Headers\n\nEvery response includes diagnostic headers:\n\n| Header | Description |\n|---|---|\n| `X-Cache-Hit` | `exact`, `semantic`, or `miss` |\n| `X-Cache-Similarity` | Cosine similarity score (semantic hits only) |\n| `X-Provider` | Which provider served the response |\n| `X-Cost-USD` | Actual cost of the request |\n| `X-Tokens-Prompt` | Prompt token count |\n| `X-Tokens-Completion` | Completion token count |\n| `X-RateLimit-Remaining` | Requests remaining in current window |\n| `X-Failover` | `true` if failover occurred |\n| `X-Original-Provider` | Provider that was tried first (on failover) |\n\n---\n\n## Rate Limiting \u0026 Cost Controls\n\nThe gateway has **three independent layers** of usage control, evaluated in order on every request.\n\n\u003e **TL;DR — tune everything via an interactive CLI:**\n\u003e\n\u003e ```bash\n\u003e ./scripts/manage_limits.sh\n\u003e ```\n\u003e\n\u003e The script wraps `psql` with a menu-driven UI for updating API-key rate limits, project spend caps, cache/semantic settings, and per-customer limits without writing SQL. Changes take effect without a gateway restart.\n\n### 1. Per-API-Key Rate Limit (requests/minute)\nA token-bucket algorithm runs atomically in Redis via a Lua script — no race conditions even under thousands of concurrent calls. Each API key has its own `rate_limit_per_minute` stored in the `api_keys` table.\n\n```bash\n# Interactive (recommended)\n./scripts/manage_limits.sh set-key-rate\n\n# Or direct SQL\ndocker compose exec postgres psql -U llm0 -d llm0_gateway -c \\\n  \"UPDATE api_keys SET rate_limit_per_minute = 120 WHERE key_prefix = 'llm0_live_abc12';\"\n```\n\nThe client sees:\n- `X-RateLimit-Limit` — the bucket capacity\n- `X-RateLimit-Remaining` — tokens left in the current window\n- `X-RateLimit-Reset` — Unix timestamp when the bucket refills\n- `429` with `retry_after` when exceeded\n\n### 2. Hard Project Spend Cap (USD/month)\nEach project has a `monthly_cap_usd` column. The gateway **estimates** the request cost before calling the LLM; if it would push the project over the cap, the request is blocked with `402 Payment Required`. This prevents runaway prompts from silently burning dollars.\n\n```bash\n./scripts/manage_limits.sh set-project-cap\n```\n\n### 3. Per-Customer Spend Limits (daily + monthly USD)\nSet limits per end-user via the `customer_limits` table. The interactive script handles upsert logic, validation, and NULL handling for you:\n\n```bash\n./scripts/manage_limits.sh set-customer-limit\n```\n\nOr directly:\n\n```sql\nINSERT INTO customer_limits (\n    project_id, customer_id,\n    daily_spend_limit_usd, monthly_spend_limit_usd,\n    on_limit_behavior, downgrade_model\n) VALUES (\n    '\u003cyour-project-id\u003e',\n    'user_123',\n    1.00,          -- $1 per day\n    20.00,         -- $20 per month\n    'downgrade',   -- 'block' or 'downgrade'\n    'gpt-4o-mini'  -- used when on_limit_behavior = 'downgrade'\n);\n```\n\nThen pass the customer ID on requests:\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer llm0_live_...\" \\\n  -H \"X-Customer-ID: user_123\" \\\n  ...\n```\n\nSpend headers are included in every response:\n- `X-Customer-Spend-Today`\n- `X-Customer-Limit-Daily`\n- `X-Customer-Remaining-Usd`\n\n### How Spend Caps Reset\n\nAll three spend counters (project `monthly_cap_usd`, customer daily, customer monthly) reset automatically — you don't run a cron job.\n\n**1. Redis is the source of truth for enforcement.**\nEvery request calls into a Lua script that reads and increments counters stored under date-stamped keys:\n\n| Counter | Redis key | Rotation |\n|---|---|---|\n| Project monthly spend | `spend:project:{project_id}:{YYYY-MM}` | New key on 1st of each month |\n| Customer daily spend | `spend:customer:{project_id}:{customer_id}:daily:{YYYY-MM-DD}` | New key at UTC midnight |\n| Customer monthly spend | `spend:customer:{project_id}:{customer_id}:monthly:{YYYY-MM}` | New key on 1st of each month |\n\nWhen the date rolls over, the Lua script computes a new key name and starts fresh at `$0.00`. The old keys are still in Redis but no longer read — they're garbage-collected by a `TTL` set on every write (31 days for monthly keys, 24 hours for daily keys). **No manual intervention, no cron job, no downtime window.**\n\n**2. Postgres mirrors for reporting.**\nThe `projects.current_month_spend_usd` column and the `customer_spend` rows exist so you can run SQL dashboards. They're maintained by an async write path (off the hot request path) and reset/pruned by a goroutine scheduler:\n\n- `resetMonthlySpend` runs at 00:00 UTC on the 1st of each month, setting `projects.current_month_spend_usd = 0` and advancing `spend_reset_at` to the next month's 1st. If the gateway was down on the 1st, the next startup catches up via `WHERE spend_reset_at \u003c= NOW()`.\n- `cleanupExpiredCache` and `cleanupSemanticCache` prune stale cache rows hourly and daily.\n- `cleanupOldLogs` runs weekly (Sunday 03:00 UTC) to trim `gateway_logs` retention.\n- `reconcileCustomerSpend` runs hourly to detect drift between Redis and Postgres customer-spend totals (for observability only — Redis remains authoritative).\n\nAll five workers are started from `cmd/gateway/main.go` on boot and cancelled on `SIGINT`/`SIGTERM`. Set `DISABLE_BACKGROUND_WORKERS=true` in multi-replica deployments where only one replica should run maintenance, or in tests.\n\n**3. Redis persistence matters for production.**\nBecause enforcement reads Redis counters directly, Redis restarts without AOF/RDB persistence will reset spend counters mid-month. The bundled `docker-compose.yml` enables `appendonly yes`; verify the same in any managed Redis you use. If you lose Redis data, the `reconcileCustomerSpend` job will flag the drift on its next run — rebuild counters from `SELECT SUM(cost_usd) FROM gateway_logs WHERE project_id = ... AND created_at \u003e= date_trunc('month', NOW())` if needed.\n\n**Manually overriding a reset or unblocking a customer:**\n\n```bash\n# Bump a project's monthly cap (immediately picked up — no gateway restart)\n./scripts/manage_limits.sh set-project-cap\n\n# Raise a specific customer's daily or monthly limit\n./scripts/manage_limits.sh set-customer-limit\n\n# Nuclear option: zero out the Redis counter for a project mid-month\ndocker compose exec redis redis-cli DEL \"spend:project:\u003cproject_id\u003e:$(date -u +%Y-%m)\"\n```\n\n### Background Worker Schedule\n\nAll scheduled jobs run as in-process Go goroutines — no cron, no sidecar container, no external dependency. On startup the gateway logs each job's next-run time, e.g.:\n\n```\n⏰ [spend-reset] Next run in 258h2m43s\n⏰ [semantic-cache-cleanup] Next run in 20h2m43s\n⏰ [cache-cleanup] Scheduled hourly, first run in 2m43s\n```\n\n| Job | Cadence | Touches | `system_logs.event_type` |\n|---|---|---|---|\n| `cache-cleanup` | Hourly | `DELETE FROM exact_cache WHERE expires_at \u003c NOW()` | `cache_cleanup` (only if \u003e100 rows) |\n| `semantic-cache-cleanup` | Daily at **02:00 UTC** | `DELETE FROM semantic_cache WHERE created_at + (ttl_seconds ‖ 'seconds')::interval \u003c NOW()` | `semantic_cache_cleanup` (only if \u003e100 rows) |\n| `log-cleanup` | Weekly, Sunday at **03:00 UTC** | Trims `gateway_logs` per retention policy | `log_cleanup` |\n| `reconciliation` | Hourly | Read-only drift check: Redis `spend:customer:…` vs `customer_spend` table | `customer_spend_reconciliation` |\n| `spend-reset` | Monthly, day 1 at **00:00 UTC** | Zeroes `projects.current_month_spend_usd`; advances `spend_reset_at` | `monthly_spend_reset` |\n\n**Why these specific cadences:**\n\n- **Exact-match cache is pruned hourly** because it churns fast (`CACHE_TTL_SECONDS` defaults to 1 hour), and row count grows linearly with traffic.\n- **Semantic cache is pruned daily at 02:00 UTC** because rows live longer (per-row `ttl_seconds`, typically hours to days), the `pgvector` HNSW index makes deletes more expensive than a plain b-tree, and scheduling off-peak avoids contention with business-hours traffic.\n- **Log cleanup is weekly** because `gateway_logs` is the most write-heavy table and clients frequently query it for dashboards; running daily would add vacuum pressure.\n- **Reconciliation is hourly** because it's read-only and cheap — it just compares key counts between Redis and Postgres so you catch drift early.\n- **Spend reset is monthly** on the 1st at 00:00 UTC because that's when new date-stamped Redis keys start being used; Postgres just needs to mirror the rollover.\n\n**Operational notes:**\n\n- **Audit trail** — cleanup jobs only write to `system_logs` when they actually delete something substantial (\u003e100 rows), to keep the audit table from filling with no-op entries. `spend-reset` and `reconciliation` always write a row.\n- **Postgres autovacuum** — `DELETE` marks rows dead but doesn't reclaim space until autovacuum runs. If you do heavy semantic-cache churn (millions of rows/day), schedule a weekly `VACUUM (VERBOSE, ANALYZE) semantic_cache;` outside peak hours.\n- **Catch-up on missed runs** — `spend-reset` uses `WHERE spend_reset_at \u003c= NOW()`, so if the gateway was down on the 1st it catches up at next startup. Cache cleanup is self-healing (rows are date-filtered in `expires_at`, so a missed run just means the next one deletes more).\n- **Disable for multi-replica** — set `DISABLE_BACKGROUND_WORKERS=true` on all replicas except one dedicated maintenance replica. Enforcement (rate limits, spend caps) is unaffected because it reads directly from Redis; only the Postgres reporting/cleanup layer goes dormant. Startup log confirms: `⚠️ Background workers disabled via DISABLE_BACKGROUND_WORKERS=true`.\n\n### How Cost is Calculated\n\nThe gateway tracks cost in two places: **before** the call (for spend-cap enforcement) and **after** the call (for actual billing).\n\n**1. Pricing source** — the `model_pricing` table, one row per `(provider, model)` pair with `input_per_1k_tokens` and `output_per_1k_tokens`. Pricing is loaded into memory at startup — restart the gateway after updates via `./scripts/manage_models.sh`.\n\n**2. Cost formula** — applied identically in every path:\n\n```\ncost_usd = (input_tokens  / 1000) × input_per_1k_tokens\n         + (output_tokens / 1000) × output_per_1k_tokens\n```\n\nBoth input and output prices are always applied. Ollama (local) requests are always `$0`, regardless of token counts.\n\n**3. Pre-request estimation** — used to block requests that would breach a project or customer spend cap *before* any API call is made:\n\n- **Input tokens** are estimated as `sum(len(role) + len(content) + 4) / 4` across all messages (the industry-standard \"~4 chars per token\" heuristic).\n- **Output tokens** use the client-supplied `max_tokens` if present. If not, defaults to `2 × input_tokens` clamped to `[100, 2000]` so neither tiny nor huge prompts produce wildly skewed estimates.\n\nThis means clients can send `max_tokens: 500` to get a tight, accurate pre-estimate — useful when hovering near a spend cap.\n\n**4. Post-request actual cost** — the gateway reads real `prompt_tokens` and `completion_tokens` from the provider's response and recalculates, then reconciles the difference against Redis spend counters. Every request log in `gateway_logs` has the real cost.\n\n### Cost Tracking Example\n\n```bash\n# Make a request\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer llm0_live_...\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"gpt-4o-mini\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"What is F1?\"}],\n    \"max_tokens\": 200\n  }'\n```\n\nThe response headers tell you:\n- `X-Cost-USD: 0.000110`\n- `X-Tokens-Prompt: 15`\n- `X-Tokens-Completion: 180`\n\nAggregate spend by customer, model, or day:\n\n```sql\n-- Top 10 costliest customers this month\nSELECT customer_id, SUM(cost_usd) AS total, COUNT(*) AS requests\nFROM gateway_logs\nWHERE created_at \u003e= date_trunc('month', NOW())\nGROUP BY customer_id\nORDER BY total DESC\nLIMIT 10;\n\n-- Spend breakdown by model\nSELECT model, SUM(cost_usd) AS total, SUM(tokens_total) AS tokens\nFROM gateway_logs\nWHERE created_at \u003e= NOW() - INTERVAL '7 days'\nGROUP BY model\nORDER BY total DESC;\n\n-- Average cost per request by customer tier (from labels)\nSELECT labels-\u003e\u003e'Tier' AS tier, AVG(cost_usd) AS avg_cost\nFROM gateway_logs\nWHERE labels ? 'Tier'\nGROUP BY tier;\n```\n\n### Customer Labels\nAttach arbitrary labels to any request for analytics — they're stored as JSONB on `gateway_logs`:\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"X-Customer-ID: user_123\" \\\n  -H \"X-LLM0-Tier: pro\" \\\n  -H \"X-LLM0-Team: billing\" \\\n  ...\n```\n\nQuery the logs later:\n```sql\nSELECT labels-\u003e\u003e'Tier', SUM(cost_usd) FROM gateway_logs GROUP BY 1;\n```\n\n---\n\n## Performance\n\nAll numbers are **in-process latency** (`gateway_logs.latency_ms`) — the time from request arrival at the Go handler to the response being written. Excludes client network.\n\n### Test setup\n\nAll numbers below come from a single run of [`bench/load_test.sh`](bench/load_test.sh) against a locally-built gateway:\n\n| Parameter | Value |\n|---|---|\n| Load tool | [`hey`](https://github.com/rakyll/hey) |\n| Concurrency | **20** in-flight workers |\n| Total requests | **200** (of which 67 succeeded, 133 were rate-limited by the test key's 60 req/min cap) |\n| Throughput observed | **~1,480 req/sec** (client-side, mixed 200 + 429) |\n| Payload | `gpt-4o-mini` chat completion, 1 user message, ~40 tokens total |\n| Host | Apple M4 MacBook Air, Go 1.24, gateway native, Redis 7 + Postgres 17 in Docker |\n| Measurement source | `gateway_logs.latency_ms` (server-side, excludes client RTT) |\n\nTo reproduce:\n\n```bash\ndocker compose up -d postgres redis\ngo run ./cmd/gateway \u0026\nexport LLM0_API_KEY=llm0_live_\u003cyour key\u003e\n./bench/load_test.sh\n```\n\n\u003e The 60 req/min cap on the default test API key is why you'll see 429s — bump `token_bucket_capacity` / `token_bucket_refill_per_min` on the key via `psql` or the management scripts if you want a longer clean run.\n\n### Cache-hit workload (measured)\n\n| Response type | p50 | p95 | p99 | n |\n|---|---:|---:|---:|---:|\n| 200 — Exact-match cache hit | **11 ms** | **15 ms** | **16 ms** | 67 |\n| 429 — Rate-limit rejection  | **2.1 ms** | **5.6 ms** | 5.6 ms | 133 |\n\n### Fast-fail on rejected requests\n\nThe gateway is designed to say \"no\" quickly — rejections short-circuit before the cache lookup, provider routing, and response marshaling:\n\n| Response | p50 | p95 | Path |\n|---|---:|---:|---|\n| **429 rate-limited** | **2.1 ms** | **5.6 ms** | auth → Redis Lua token-bucket → 429 |\n| 200 cache hit | 11 ms | 15 ms | auth → Redis Lua → Redis GET → marshal → 200 |\n\nRejections being **~5× faster than accepted requests** is the property that keeps a single gateway instance stable during abuse bursts — a runaway client or credential leak can't meaningfully consume gateway CPU because each `DENY` takes ~2ms of work and 0 provider cost.\n\n### Querying your own percentiles\n\n`hey`'s client-side summary mixes 200s and 429s. For per-status-code percentiles, query `gateway_logs` directly:\n\n```bash\ndocker compose exec -T postgres psql -U llm0 -d llm0_gateway -c \"\nSELECT status,\n       cache_hit,\n       count(*)                                                        AS n,\n       percentile_disc(0.5)  WITHIN GROUP (ORDER BY latency_ms)        AS p50,\n       percentile_disc(0.95) WITHIN GROUP (ORDER BY latency_ms)        AS p95,\n       percentile_disc(0.99) WITHIN GROUP (ORDER BY latency_ms)        AS p99\nFROM gateway_logs\nWHERE created_at \u003e now() - interval '5 minutes'\nGROUP BY status, cache_hit;\"\n```\n\n### What's in each latency bucket\n\nA p50 of 11ms on a cache hit covers:\n\n- Bearer-token auth (Redis cache ~0.3ms)\n- API-key token-bucket rate limit (Redis Lua `EVALSHA`, 1 round trip)\n- Exact-match cache lookup (Redis `GET`, 1 round trip)\n- JSON marshal + HTTP response write\n- Gin middleware chain + logging goroutine spawn\n\nFor cache misses, add the provider round-trip on top (`gpt-4o-mini` ≈ 300–800ms to OpenAI, ≈ 200–500ms to Anthropic).\n\n### A note on Docker Desktop vs production\n\nIf you run the gateway inside **Docker Desktop on macOS**, expect p50 ≈ 15ms and p99 ≈ 150ms+ — that's the Docker-for-Mac VM's network overhead, not the gateway. On Linux hosts (EC2, Kubernetes nodes, bare metal) the container networking penalty is ~0.05ms, so production numbers will match the native-Go row above almost exactly.\n\n### Memory footprint\n\nThe single Go binary is ~30MB RSS at idle, ~50–80MB under load. Concurrent request capacity is bounded by `MAX_CONCURRENT_REQUESTS` (default 10,000).\n\n---\n\n## Endpoints\n\n| Method | Path | Auth | Description |\n|---|---|---|---|\n| `POST` | `/v1/chat/completions` | Bearer token | Chat completions — streaming and non-streaming |\n| `GET` | `/v1/models` | Bearer token | OpenAI-compatible model list (includes cloud + pulled Ollama models) |\n| `GET` | `/health` | None | Basic liveness check |\n| `GET` | `/ready` | None | Readiness check (Postgres + Redis connectivity) |\n| `GET` | `/live` | None | Liveness check |\n\n---\n\n## Project Structure\n\n```\nllm0-gateway/\n├── cmd/gateway/main.go              # Entry point, router setup, worker initialization\n├── internal/\n│   ├── gateway/\n│   │   ├── auth/                   # API key validation (bcrypt + Redis cache)\n│   │   ├── cache/                  # Exact-match (Redis+Postgres) and semantic cache\n│   │   ├── cost/                   # Pre/post request cost calculation\n│   │   ├── embeddings/             # HTTP client for embedding service\n│   │   ├── failover/               # Failover executor + preset model chains\n│   │   ├── handlers/               # Gin HTTP handlers (chat, streaming, health)\n│   │   ├── providers/              # OpenAI, Anthropic, Gemini provider clients\n│   │   ├── ratelimit/              # Per-API-key and per-customer rate limiting\n│   │   ├── streaming/              # SSE normalization across providers\n│   │   └── workers/                # Background jobs (cache GC, reconciliation)\n│   └── shared/\n│       ├── config/                 # Environment variable loader\n│       ├── database/               # Postgres connection pool + query helpers\n│       ├── models/                 # Shared Go structs (Project, APIKey, etc.)\n│       ├── redis/                  # Redis client with rate limit + spend cap logic\n│       └── tls/                    # TLS 1.3 config\n├── embedding_service/\n│   ├── app.py                      # FastAPI embedding server\n│   ├── requirements.txt\n│   └── Dockerfile                  # Bakes all-MiniLM-L6-v2 weights at build time\n├── schema/schema.sql               # Canonical DB schema (single source of truth)\n├── scripts/\n│   └── create_api_key.sh           # Project + API key creation helper\n├── docker-compose.yml              # Postgres, Redis, embedding service, gateway\n├── Dockerfile\n└── .env.example\n```\n\n---\n\n## Architecture\n\n```\n                        ┌─────────────────────────────┐\n                        │         LLM0 Gateway        │\n                        │         (Go, :8080)         │\n                        └──────────────┬──────────────┘\n                                       │\n               ┌───────────────────────┼───────────────────────┐\n               │                       │                       │\n               ▼                       ▼                       ▼\n       ┌──────────────┐      ┌──────────────────┐    ┌──────────────────┐\n       │    Redis     │      │    PostgreSQL    │    │ Embedding Service│\n       │  Rate limits │      │  API keys, logs  │    │ all-MiniLM-L6-v2 │\n       │  Exact cache │      │  Exact cache     │    │   (Python)       │\n       │  Spend totals│      │  Semantic cache  │    └──────────────────┘\n       └──────────────┘      │  Model pricing   │\n                             └──────────────────┘\n                                       │\n         ┌─────────────────┬───────────┴──────────────┬─────────────────┐\n         │                 │                          │                 │\n         ▼                 ▼                          ▼                 ▼\n ┌──────────────┐  ┌──────────────┐          ┌──────────────┐  ┌──────────────┐\n │    OpenAI    │  │   Anthropic  │          │ Google Gemini│  │    Ollama    │\n │              │  │              │          │              │  │   (local)    │\n └──────────────┘  └──────────────┘          └──────────────┘  └──────────────┘\n                  ◄── cloud providers ──►                      ◄── optional ──►\n```\n\n---\n\n## Contributing\n\nContributions are welcome. Please open an issue before submitting large changes.\n\nAreas where contributions are especially useful:\n- Additional provider support (AWS Bedrock, Azure OpenAI, Mistral La Plateforme, Cohere, Groq)\n- Admin REST API for key/project/limit management\n- Prometheus metrics endpoint (`/metrics`)\n- Additional embedding models for semantic cache\n- Per-model-class routing rules (e.g. \"always route coding tasks to X\")\n\nSee [`CHANGELOG.md`](./CHANGELOG.md) for what shipped in the current release\n(v0.1.1) and what's planned for the next patch (v0.1.2).\n\n---\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrmushfiq%2Fllm0-gateway","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmrmushfiq%2Fllm0-gateway","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmrmushfiq%2Fllm0-gateway/lists"}