https://github.com/skymoonsun/model-maestro
Unified LLM Gateway that proxies multiple providers (Ollama, OpenAI-compatible) behind a single API. Enables IDEs and tools to access multiple models via standard API formats. Manage LLM usage with per-user token limits, request logging, load balancing, model groups, and an admin dashboard.
https://github.com/skymoonsun/model-maestro
ai antigravity claude-code cursor cursor-plugin kiro llm model-maestro ollama openai-compatible openclaw orchestration vllm
Last synced: about 1 month ago
JSON representation
Unified LLM Gateway that proxies multiple providers (Ollama, OpenAI-compatible) behind a single API. Enables IDEs and tools to access multiple models via standard API formats. Manage LLM usage with per-user token limits, request logging, load balancing, model groups, and an admin dashboard.
- Host: GitHub
- URL: https://github.com/skymoonsun/model-maestro
- Owner: skymoonsun
- Created: 2025-10-21T07:16:59.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2026-05-01T00:33:15.000Z (about 1 month ago)
- Last Synced: 2026-05-01T01:13:41.646Z (about 1 month ago)
- Topics: ai, antigravity, claude-code, cursor, cursor-plugin, kiro, llm, model-maestro, ollama, openai-compatible, openclaw, orchestration, vllm
- Language: Python
- Homepage:
- Size: 762 KB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Config-driven Unified LLM Gateway
Route, load-balance and manage Ollama, OpenAI and other LLM providers through a single authenticated API.
Model Maestro gives you user-based access control, model mapping, token usage tracking, health-checked node pooling and a modern Next.js admin dashboard — all wired to PostgreSQL + Redis.
Quick Start ·
Features ·
Architecture ·
API ·
Admin Panel
---
## Table of Contents
- [Quick Start](#quick-start)
- [Features](#features)
- [Architecture](#architecture)
- [Tech Stack](#tech-stack)
- [Configuration](#configuration)
- [Admin Panel](#admin-panel)
- [API Reference](#api-reference)
- [Authentication](#authentication)
- [LLM Endpoints](#llm-endpoints)
- [Admin Endpoints](#admin-endpoints)
- [OpenAI Compatible](#openai-compatible)
- [Model Mapping & Routing](#model-mapping--routing)
- [Troubleshooting](#troubleshooting)
- [Development](#development)
- [License](#license)
---
## Quick Start
> Requires Docker & Docker Compose.
```bash
# 1. Clone
git clone && cd model-maestro
# 2. Configure
cp .env.example .env
# 3. Launch full stack (PostgreSQL + Redis + FastAPI + Next.js)
docker compose -f docker-compose.dev.yml up --build -d
# 4. Seed the database
docker exec maestro python -m app.seeder
# 5. Open the admin panel at http://localhost:3000
```
| Service | URL | Notes |
|---|---|---|
| **API** | `http://localhost:8000` | FastAPI gateway |
| **Admin Dashboard** | `http://localhost:3000` | Next.js admin panel |
| **API Docs** | `http://localhost:8000/api/docs` | Basic-auth protected |
For a more detailed setup guide, see [`docs/SETUP.md`](docs/SETUP.md).
---
## Features
- **JWT Authentication** — Bearer-token auth on every LLM request.
- **Admin Dashboard** — Next.js 16 panel for visual management of users, nodes, models, groups and audit logs.
- **Model Mapping** — Translate display names (`gpt-oss:120b`) to real names (`gpt-oss:120b-cloud`) via PostgreSQL with JSON-file caching.
- **Multi-Node Load Balancing** — Round-robin, weighted and priority-based strategies across Ollama nodes.
- **Model Groups** — Group models into logical units with fallback chains. Requests dynamically resolve to the best member based on capability tags (vision, tools) and strategy.
- **Node Health Management** — Automatic health checks, model discovery and availability tracking.
- **User-Level Access Control** — Per-user model allowlists and rate limits (requests / tokens per day).
- **Token Usage Tracking** — Background-batched activity logs with prompt / completion / total token breakdowns.
- **Tool Set Filtering** — Restrict which tools a model is allowed to invoke via configurable tool sets.
- **Context Length Config** — Per-model context length stored in mappings (used by Cursor/Antigravity for usage bars).
- **Streaming** — SSE-based streaming on `/api/chat`, `/api/generate` and `/v1/chat/completions`.
- **OpenAI Compatible** — Drop-in `/v1/chat/completions` and `/v1/models` endpoints.
- **Full Ollama API** — `/api/generate`, `/api/chat`, `/api/embeddings`, `/api/tags`, `/api/show`, `/api/copy`, `/api/delete`, `/api/pull`, `/api/push`, `/api/create`.
- **DeepSeek Tool Call Parsing** — Auto-detects and converts DeepSeek's raw XML tool call output (``, ``, ``) to OpenAI `tool_calls` format in streaming and non-streaming responses. Kimi/Moonshot `<|tool_calls_section_begin|>` format also supported.
- **Streaming-Aware Background Tasks** — Health checks, model discovery and warmup defer when streams are active, preventing interruptions.
- **Node-Aware Model Warmup** — Warmup requests target only models that exist on each node, eliminating 404 errors from stale model names.
- **Background Tasks** — Redis-backed async queue for activity logging, node health checks, model discovery, model warmup and load cleanup.
- **Audit Logs** — Every admin action is timestamped and queryable.
- **PostgreSQL + Alembic** — Schema migrations run automatically on container startup.
- **Redis Cache** — Hot-path caching for mappings, config and user usage data.
---
## Architecture
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Cursor │ │ Antigravity │ │ Claude │
│ IDE │ │ IDE │ │ Code │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────────────────┼────────────────────┘
│
┌──────┴──────┐
│ Load │
│ Balancer │
└──────┬──────┘
│
┌────────────────────┼────────────────────┐
│ │ │
┌──────┴──────┐ ┌────────┴────────┐ ┌──────┴──────┐
│ Ollama │ │ Ollama │ │ OpenAI │
│ Node 1 │ │ Node 2 │ │ / Other │
└─────────────┘ └─────────────────┘ └─────────────┘
```
**Request Flow**
```
Client Request
│
▼
┌─────────────────┐
│ JWT Middleware │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Model Group? │──No──▶┌──────────────┐
│ (resolve member)│ │ Model Mapper │
└────────┬────────┘ │ (display→real)│
│Yes └──────┬───────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────┐
│ Load Balancer │──────▶│ Node Pool │
│ (pick healthy) │ │ (health check│
└────────┬────────┘ │ + retry) │
│ └──────┬───────┘
│ │
▼ ▼
┌─────────────────┐ ┌──────────────┐
│ Ollama Proxy │◀──────│ Ollama / │
│ (reverse map) │ │ Provider API │
└────────┬────────┘ └──────────────┘
│
▼
Client Response
```
For the full architecture documentation, see [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md).
---
## Tech Stack
| Layer | Technology |
|---|---|
| **API Gateway** | Python 3.11, FastAPI, Uvicorn |
| **Async HTTP** | httpx (HTTP/2) |
| **Auth** | JWT (PyJWT) |
| **Database** | PostgreSQL 15 + asyncpg + SQLAlchemy async |
| **Migrations** | Alembic |
| **Cache** | Redis 7 |
| **Frontend** | Next.js 16, React 19, Tailwind CSS v4, shadcn/ui |
| **Background Tasks** | Redis-backed async queue |
| **Deployment** | Docker, Docker Compose |
---
## Configuration
Copy `.env.example` to `.env` and set:
```env
# Ollama
OLLAMA_BASE_URL=http://host.docker.internal:11434
JWT_SECRET_KEY=change-this-to-a-strong-secret
LOG_LEVEL=INFO
# PostgreSQL
DATABASE_URL=postgresql+asyncpg://maestro_user:maestro_password@postgres:5432/maestro
# Redis
REDIS_URL=redis://redis:6379/0
# Admin Token (for /admin/* endpoints)
ADMIN_TOKEN=change-this-for-production
# Admin Panel Login
ADMIN_USERNAME=admin
ADMIN_PASSWORD=admin
# Swagger / ReDoc Basic Auth
DOCS_USERNAME=admin
DOCS_PASSWORD=admin
```
---
## Admin Panel
The Next.js dashboard (`http://localhost:3000`) provides a visual interface for everything.
| Page | What you can do |
|---|---|
| **Dashboard** | Node health, model counts, user statistics |
| **Users** | Create users, manage tokens, assign models, set limits |
| **Ollama > Nodes** | Add/edit Ollama nodes, view health status, trigger discovery |
| **Ollama > Models** | Browse discovered models per node |
| **Models > Mappings** | Display↔Real name mappings, set context length, capabilities |
| **Models > Groups** | Create groups, add members, set strategy, reorder fallbacks |
| **Models > Config** | Per-model tool restrictions and settings |
| **Tool Sets** | Create tool groups and assign to models |
| **Settings** | System-wide configuration |
| **Audit Logs** | Filterable history of all admin actions |
**Default login:** username `admin`, password from `ADMIN_PASSWORD` in `.env`.
---
## API Reference
For the complete API reference with all request/response examples, see [`docs/API.md`](docs/API.md).
### Authentication
Every LLM request requires:
```
Authorization: Bearer
```
Admin endpoints require:
```
Authorization: Bearer
```
### LLM Endpoints
| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/api/chat` | Chat completions (Ollama format) |
| `POST` | `/api/generate` | Text generation |
| `POST` | `/api/embeddings` | Generate embeddings |
| `GET` | `/api/tags` | List available models |
| `POST` | `/api/show` | Show model info |
| `POST` | `/api/copy` | Copy model |
| `DELETE`| `/api/delete` | Delete model |
| `POST` | `/api/pull` | Pull model |
| `POST` | `/api/push` | Push model |
| `POST` | `/api/create` | Create model from Modelfile |
**Example — Chat**
```bash
curl -X POST http://localhost:8000/api/chat \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss:120b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
```
**Example — Streaming Chat**
```bash
curl -X POST http://localhost:8000/api/chat \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss:120b",
"messages": [{"role": "user", "content": "Tell me a story"}],
"stream": true
}'
```
### Admin Endpoints
**Users**
```bash
# Create user
curl -X POST http://localhost:8000/admin/users \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"username": "john"}'
# List users
curl http://localhost:8000/admin/users \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Refresh token
curl -X PUT http://localhost:8000/admin/users/john/token \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
**Model Assignment**
```bash
# Assign specific models
curl -X POST http://localhost:8000/admin/users/john/models \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"models": ["gpt-oss:120b", "deepseek-v3.1:671b"]}'
# Grant access to all models
curl -X POST http://localhost:8000/admin/users/john/models/all \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
**User Limits**
```bash
# Set limits (null = unlimited)
curl -X POST http://localhost:8000/admin/users/john/limits \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"request_limit": 1000, "token_limit": 1000000}'
```
**Model Mappings**
```bash
# Create mapping with context length
curl -X POST http://localhost:8000/admin/model-mappings \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"display_name": "gpt-oss:120b",
"real_name": "gpt-oss:120b-cloud",
"context_length": 128000,
"capabilities": ["completion", "tools"]
}'
# List
curl http://localhost:8000/admin/model-mappings \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Delete
curl -X DELETE http://localhost:8000/admin/model-mappings/gpt-oss:120b \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
**Nodes**
```bash
# Add node
curl -X POST http://localhost:8000/admin/nodes \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "main", "base_url": "http://localhost:11434", "priority": 100}'
# Toggle activation
curl -X PATCH http://localhost:8000/admin/nodes/1/toggle \
-H "Authorization: Bearer $ADMIN_TOKEN"
```
**Model Groups**
```bash
# Create group
curl -X POST http://localhost:8000/admin/model-groups \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"name": "coding", "strategy": "round_robin", "description": "Code models"}'
# Add member
curl -X POST http://localhost:8000/admin/model-groups/coding/members \
-H "Authorization: Bearer $ADMIN_TOKEN" \
-H "Content-Type: application/json" \
-d '{"model_display_name": "qwen3-coder:480b", "priority": 1}'
```
### OpenAI Compatible
| Method | Endpoint | Description |
|---|---|---|
| `POST` | `/v1/chat/completions` | Chat completions (OpenAI format) |
| `GET` | `/v1/models` | Model list (OpenAI format) |
**Example — OpenAI Compatible**
```bash
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-oss:120b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```
---
## Model Mapping & Routing
**Display Name → Real Name**
```
Client sends: gpt-oss:120b
Proxy looks up: gpt-oss:120b → gpt-oss:120b-cloud
Ollama receives: gpt-oss:120b-cloud
```
**Real Name → Display Name**
```
Ollama returns: gpt-oss:120b-cloud
Proxy translates: gpt-oss:120b-cloud → gpt-oss:120b
Client sees: gpt-oss:120b
```
**Model Groups**
If the requested model is a group, the gateway resolves it dynamically:
1. Detect if the request needs vision (image content in messages).
2. Filter members by capability tags (`vision`, `tools`).
3. Pick a member using the group's strategy:
- `round_robin` — cycle through members
- `weighted` — weighted random selection
- `priority` — always pick lowest priority number
4. If the selected model fails, retry with the next member in priority order.
---
## Troubleshooting
**Restart the full stack**
```bash
docker compose -f docker-compose.dev.yml down
docker compose -f docker-compose.dev.yml up --build -d
```
**Run migrations manually**
```bash
docker exec maestro alembic upgrade head
```
**Re-run seeds**
```bash
docker exec maestro python -m app.seeder --reset
docker exec maestro python -m app.seeder
```
**Clear cache**
```bash
docker exec maestro python scripts/clear_cache.py
```
**Check PostgreSQL health**
```bash
docker exec maestro-postgres pg_isready -U maestro_user -d maestro
```
**Check Redis**
```bash
docker exec maestro-redis redis-cli ping
```
**View logs**
```bash
# All services
docker compose -f docker-compose.dev.yml logs -f
# API only
docker compose -f docker-compose.dev.yml logs -f maestro
# Frontend only
docker compose -f docker-compose.dev.yml logs -f frontend
```
---
## Development
### Project Structure
```
model-maestro/
├── app/
│ ├── main.py # FastAPI app, routers, docs auth
│ ├── proxy.py # Proxy logic, model routing, failover, tool call parsing
│ ├── config.py # Settings, ModelMappingManager, ModelGroupManager
│ ├── auth.py # JWT authentication
│ ├── models.py # Pydantic request/response models
│ ├── models_db.py # SQLAlchemy ORM models
│ ├── database.py # Async DB engine & session maker
│ ├── redis.py # Redis client & queue
│ ├── load_balancer.py # Node selection algorithms
│ ├── node_manager.py # Health checks, discovery, node CRUD
│ ├── user_manager.py # User CRUD
│ ├── background_tasks.py # Activity log processor, health checks, model warmup
│ ├── openclaw.py # OpenClaw integration
│ ├── admin*.py # Admin API routers
│ ├── repositories/ # Data access layer
│ ├── services/ # Business logic layer
│ └── seeds/ # DB seed migrations
├── frontend/
│ ├── src/app/ # Next.js App Router pages
│ ├── src/components/ # React components (sidebar, shell, etc.)
│ └── public/ # Static assets (logo, favicon)
├── docs/ # Documentation (architecture, API, setup)
├── alembic/ # Alembic migrations
├── tests/ # pytest suite
├── docker-compose.dev.yml # Dev stack (PG + Redis + API + Frontend)
├── docker-compose.yml # Production stack (API + Frontend only)
└── Dockerfile # FastAPI container
```
### Running Tests
```bash
python -m pytest tests/ -v
```
### Lint & Format
```bash
# Backend
python -m black app/
python -m ruff check app/
# Frontend
cd frontend && npm run lint
```
---
## Documentation
- [`docs/ARCHITECTURE.md`](docs/ARCHITECTURE.md) — System architecture, request flow, database schema
- [`docs/API.md`](docs/API.md) — Complete API reference with all endpoints, requests and responses
- [`docs/SETUP.md`](docs/SETUP.md) — Detailed setup guide, environment variables, production deployment
- [`QUICKSTART.md`](QUICKSTART.md) — Get running in under 5 minutes
---
## License
MIT