An open API service indexing awesome lists of open source software.

https://github.com/nullata/llamaman

A browser-based UI for launching, monitoring, and managing multiple llama.cpp server instances from inside a Docker container. Includes an Ollama-compatible API proxy
https://github.com/nullata/llamaman

frontend llamacpp llm llm-inference llm-infrastructure llm-manager llm-proxy proxy rest-api

Last synced: 3 months ago
JSON representation

A browser-based UI for launching, monitoring, and managing multiple llama.cpp server instances from inside a Docker container. Includes an Ollama-compatible API proxy

Awesome Lists containing this project

README

          

# logo LlamaMan


LlamaMan

A browser-based UI for launching, monitoring, and managing multiple [llama.cpp](https://github.com/ggerganov/llama.cpp) server instances from inside a Docker container. Includes an Ollama-compatible API proxy so it works as a drop-in replacement for Ollama with [Open WebUI](https://github.com/open-webui/open-webui).

## Features

- **Model library** - scans `/models` for GGUF files, shows quant type and file size
- **One-click launch** - configure GPU layers, context size, threads, multi-GPU, extra args
- **Preset configs** - save/load per-model launch settings
- **Download manager** - pull models from HuggingFace with speed throttling and auto-retry on failure
- **Instance management** - stop, restart, remove, view live-streamed logs
- **GPU VRAM indicator** - per-GPU usage bars via nvidia-smi or rocm-smi
- **Idle timeout** - auto-sleep instances after configurable idle period, wake on next request
- **Ollama-compatible proxy** - OpenWebUI discovers models and auto-starts servers on demand
- **Authentication** - user accounts with session login, API key management with bearer tokens
- **Require auth toggle** - enforce bearer token authentication on all endpoints (including model loading) or leave model endpoints open
- **Persistent state** - instance history and configs survive container restarts
- **Storage backends** - JSON files (default) or MariaDB/MySQL via SQLAlchemy
- **Proxy sampling overrides** - force temperature, top-k, top-p, and presence penalty on all proxied requests, configurable per model preset

## Requirements

- Docker with **one** of:
- [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) (for CUDA / NVIDIA GPUs)
- [ROCm-compatible setup](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) (for AMD GPUs) - **experimental, not tested**
- A supported GPU (llama.cpp can offload to CPU/RAM when VRAM is insufficient)

## Quick Start

**NVIDIA (CUDA):**

```bash
docker compose up --build
```

**AMD (ROCm)** - experimental, not tested:

```bash
docker compose --profile rocm up --build llamaman-rocm
```

- **Management UI**: http://localhost:5000
- **Llamaman proxy** (Ollama-compatible API): http://localhost:42069
- **llama-server public instance ports**: 8000-8020

On first launch, visit the UI to create an admin account via `/setup`.

## Authentication

LlamaMan has a built-in auth system with two layers:

### User accounts (session-based)

On first launch, `/setup` lets you create an admin account. After that, all browser access requires login. Session cookies authenticate UI requests.

### API keys (bearer tokens)

Create API keys in the **API Keys** section of the UI. External clients (OpenWebUI, scripts, etc.) authenticate with:

```
Authorization: Bearer llm-xxxxxxxxxx
```

### Require authentication toggle

The **"Require authentication for all endpoints"** toggle (on by default) controls whether model-serving endpoints require a bearer token:

| Toggle | Model endpoints (`/api/chat`, `/v1/chat/completions`, etc.) | Management endpoints (`/api/instances`, etc.) | Per-instance proxy ports |
|--------|--------------------------------------------------------------|-----------------------------------------------|--------------------------|
| **ON** (default) | Bearer token required | Bearer token or session required | Bearer token required |
| **OFF** | Open (no auth) | Bearer token or session required | Open (no auth) |

When the toggle is **ON**, all three port surfaces are protected:
- **Port 5000** (management UI + API) - Flask `before_request` hook
- **Port 42069** (Ollama-compatible proxy) - same Flask app, same hook
- **Ports 8000-8020** (per-instance proxies) - WSGI-level auth check

### OpenWebUI with authentication

When `require_auth` is on, configure OpenWebUI to send a valid API key:

```yaml
open-webui:
environment:
- OLLAMA_BASE_URL=http://llamaman:42069
- OPENAI_API_BASE_URLS=http://llamaman:42069/v1
- OPENAI_API_KEYS=llm-your-api-key-here
```

## Models

Place models inside the `models/` volume:

- **GGUF files**: any `.gguf` file (recommended - llama.cpp native format)
- **HuggingFace repos**: directories containing `config.json`

Or use the **Download** button in the UI to pull from HuggingFace.

## Launching Instances

1. Select a model from the sidebar
2. Configure launch settings (GPU layers, context size, idle timeout, etc.)
3. Click **Launch** - the instance appears with a status badge
4. Optionally click **Save Preset** to remember settings for that model

Each instance exposes an OpenAI-compatible API on its assigned port.

### Layer autodetection

When you select a GGUF model, LlamaMan reads the file's metadata to detect the total number of layers (block count). This is displayed next to the **GPU Layers** input so you can see exactly how many layers are available to offload (e.g. `/ 32`). Set GPU Layers to `-1` to offload all layers to GPU.

### Launch settings reference

| Setting | Default | Description |
|---|---|---|
| **GPU Layers** | `-1` | Number of layers to offload to GPU. `-1` = all layers, `0` = CPU only. Total layers are autodetected from the GGUF file. |
| **Context Size** | `4096` | Maximum context window in tokens (`--ctx-size`). |
| **Parallel** | `1` | Number of parallel sequences the llama-server can process simultaneously (`--parallel`). Controls KV cache slot allocation inside the server itself. |
| **Idle Timeout min** | `0` | Minutes of inactivity before the server is stopped to free VRAM. `0` = disabled. See [Idle Timeout](#idle-timeout). |
| **Max Concurrent** | `0` | Maximum number of inference requests allowed in-flight at once. `0` = unlimited. When set, incoming requests are queued and gated by a semaphore. |
| **Max Queue Depth** | `200` | Maximum number of requests that can wait in the queue when `Max Concurrent` is active. Requests beyond this limit are rejected with HTTP 429. |
| **Share Queue** | off | When enabled, multiple proxy-managed instances of the **same model** share a single request queue. Incoming requests are distributed across instances as slots become available, providing simple load balancing. |
| **Embedding Model** | off | Marks the instance as an embedding model. Embedding instances are **excluded** from the `LLAMAMAN_MAX_MODELS` count and will never be evicted by the proxy's LRU policy. |
| **GPU Devices** | `0` | Comma-separated GPU indices for multi-GPU setups (e.g. `0,1`). |
| **Extra Args** | _(empty)_ | Additional flags passed directly to llama-server (e.g. `--flash-attn`). |
| **Proxy Sampling Overrides** | off | When enabled, the proxy forces the configured sampling parameters on every request forwarded to this instance, regardless of what the client sends. |
| **Temperature** | `0.8` | Sampling temperature to enforce (range: `0.0`–`2.0`). Only active when proxy sampling overrides are enabled. |
| **Top K** | `40` | Top-k sampling value to enforce (min: `0`). Only active when proxy sampling overrides are enabled. |
| **Top P** | `0.95` | Top-p (nucleus) sampling value to enforce (range: `0.01`–`1.0`). Only active when proxy sampling overrides are enabled. |
| **Presence Penalty** | `0.0` | Presence penalty to enforce (range: `-2.0`–`2.0`). Only active when proxy sampling overrides are enabled. |

### Concurrency and queueing

When **Max Concurrent** is set to a value greater than 0, LlamaMan places a concurrency gate in front of the instance. Requests that exceed the limit are held in a FIFO queue (up to **Max Queue Depth**). If the queue is also full, new requests are rejected with HTTP 429.

The gate tracks active and queued request counts, which are visible in the instance list via the API.

**Parallel vs Max Concurrent:** `Parallel` controls how many sequences the llama-server processes internally (KV cache slots). `Max Concurrent` is an external gate that limits how many requests LlamaMan forwards to the server at once. You can use both together - for example, `Parallel=4` with `Max Concurrent=4` ensures the server always has enough KV slots for the requests it receives.

## Idle Timeout

Set **Idle Timeout min** in the launch form (0 = disabled). When enabled:

- The manager proxies the instance port (transparent to clients)
- After N minutes of no requests, the llama-server is stopped to free VRAM
- On the next request, the server auto-relaunches with the same config
- Client sees the same port/API with just a cold-start delay

For instances managed by the llamaman proxy (OpenWebUI), use the `LLAMAMAN_IDLE_TIMEOUT` env var instead.

## Download Settings

The UI provides download-related options under **Settings >> Download Settings**:

- **Auto-retry failed downloads** - automatically retries downloads that fail due to network errors or interruptions. Off by default.
- **Retry count per failed download** - how many times to retry before marking a download as permanently failed (default: 3, min: 1). Only active when auto-retry is enabled.

## Cleanup Settings

The UI provides automatic cleanup under **Settings >> Cleanup Settings**:

- **Auto-clean completed/failed downloads** - removes download records older than a configurable number of hours (default: 24). Only affects completed, failed, or cancelled downloads - active downloads are never touched.
- **Auto-clean stopped instances** - removes stopped instance records older than a configurable number of hours (default: 24). Only affects stopped instances - running instances are never removed.
- **Auto-remove stale instance records** - periodically checks all `starting`/`healthy`/`sleeping` instance records against their actual OS process. Records whose backing process is no longer alive are marked stopped. Configurable check interval (default: 5 minutes). Useful for catching crashes the normal health-check loop may have missed.

Cleanup runs periodically in the background. These settings only remove or update records in the UI/state - they do not delete model files.

## OpenWebUI Integration (llamaman proxy)

The llamaman proxy exposes an Ollama-compatible API on port **42069** (configurable). Point OpenWebUI at it:

```yaml
open-webui:
environment:
- OLLAMA_BASE_URL=http://llamaman:42069
```

**How it works:**

1. OpenWebUI calls `/api/tags` -> LlamaMan returns all available GGUF models
2. User selects a model in OpenWebUI -> `/api/chat` request arrives
3. LlamaMan auto-launches a llama-server (using saved preset or defaults)
4. Waits for healthy, then proxies the request with format translation
5. When `LLAMAMAN_MAX_MODELS` limit is reached, the least-recently-used **Ollama-managed** model is evicted. Admin UI launched models are never evicted by the Ollama API by default (see [Model eviction policy](#model-eviction-policy))

Supported Ollama endpoints: `/api/tags`, `/api/chat`, `/api/generate`, `/api/show`, `/api/version`, `/api/ps`

Also supports OpenAI-compatible endpoints with auto-start: `/v1/models`, `/v1/chat/completions`

### Model eviction policy

The `LLAMAMAN_MAX_MODELS` limit controls how many **chat** models the proxy will keep loaded simultaneously. When a new model is requested and the limit is reached, the least-recently-used (LRU) chat model is evicted to make room.

#### Priority rules

Admin UI launched models have ultimate priority. The two API surfaces have different eviction rights:

| Launcher | Eviction behaviour | Cannot evict |
|----------|--------------------|--------------|
| **Admin UI** | Evicts Ollama-managed models first (LRU), then admin UI models if needed | - |
| **Ollama API** (`/api/chat`, `/api/generate`) | Evicts Ollama-managed models (LRU) | Admin UI launched models (by default) |
| **OpenAI API** (`/v1/chat/completions`) | No eviction - starts model only if a slot is free | Everything |

If the cap is full, requests that cannot evict return HTTP 503:
```
model limit reached (LLAMAMAN_MAX_MODELS=N); admin-launched models cannot be evicted via the API
model limit reached (LLAMAMAN_MAX_MODELS=N); the OpenAI API does not evict running models
```

#### App Settings toggles

Two toggles in **Settings >> App Settings** control eviction behaviour:

- **Enforce `LLAMAMAN_MAX_MODELS` for admin UI launches** - when on, the admin UI silently evicts the LRU model (Ollama-managed first) before launching. When off (default), the UI prompts you to confirm before exceeding the cap.
- **Allow Ollama API to evict admin-launched models** - when on, the Ollama API can also evict admin UI launched models as a fallback if no Ollama-managed models are available to evict. Off by default. Has no effect on the OpenAI API, which never evicts.

#### Other details

- **All running instances count toward the limit** - both admin UI and proxy-managed instances. If you manually launch 2 models and `LLAMAMAN_MAX_MODELS=1`, the proxy sees you are already over the limit.
- **Embedding models are excluded.** Instances marked as **Embedding Model** do not count toward the limit and are never evicted. This lets you keep an embedding model loaded permanently alongside your chat models.
- **`LLAMAMAN_MAX_MODELS=0` (default) disables eviction entirely.** The proxy will launch models on demand without ever stopping existing ones.

## Storage Backends

### JSON (default)

Zero-config. Stores data in JSON files under `DATA_DIR` (`/data`):
- `state.json` - instances and downloads
- `presets.json` - per-model launch presets
- `users.json` - user accounts
- `settings.json` - global settings
- `api_keys.json` - API key hashes

Instance and download logs are written to `LOGS_DIR` (`/tmp/llama-logs`), which is separate from persistent data.

### MariaDB / MySQL

Set `DATABASE_URL` to enable:

```yaml
environment:
- DATABASE_URL=mysql+pymysql://user:password@host:3306/llamaman
```

Tables are auto-created on first connection. Requires `sqlalchemy` and `pymysql` (included in requirements).

## Environment Variables

| Variable | Default | Description |
|---|---|---|
| `MODELS_DIR` | `/models` | Directory scanned for model files |
| `DATA_DIR` | `/data` | Directory for persistent config/state (JSON files) |
| `LOGS_DIR` | `/tmp/llama-logs` | Directory for instance and download logs |
| `PORT_RANGE_START` | `8000` | Start of public llama-server/proxy port pool |
| `PORT_RANGE_END` | `8020` | End of public llama-server/proxy port pool |
| `INTERNAL_PORT_RANGE_START` | `9000` | Start of internal llama-server port pool used when proxy mode is enabled |
| `INTERNAL_PORT_RANGE_END` | `9020` | End of internal llama-server port pool used when proxy mode is enabled |
| `LLAMAMAN_PROXY_PORT` | `42069` | Port for the Ollama-compatible proxy |
| `LLAMAMAN_MAX_MODELS` | `0` | Max concurrent **chat** models via the proxy (LRU eviction, 0 = unlimited) |
| `LLAMAMAN_IDLE_TIMEOUT` | `0` | Idle timeout in minutes for proxy-managed instances (0 = disabled) |
| `SECRET_KEY` | _(auto)_ | Flask session secret. Auto-derived from machine-id if unset. Set this for multi-replica deployments. |
| `DATABASE_URL` | _(unset)_ | MariaDB/MySQL connection string. Unset = use JSON files. |
| `HEALTH_CHECK_TIMEOUT` | `3` | Timeout in seconds for instance health checks |
| `MODEL_LOAD_TIMEOUT` | `300` | Seconds to wait for a model to become healthy during launch/relaunch. Increase for very large models. |
| `REQUEST_TIMEOUT` | `300` | Timeout in seconds for upstream requests to llama-server and gate acquire waits. Increase if requests are being cut off under heavy concurrency. |

## REST API

All endpoints return and accept JSON.

**Authentication:** Management endpoints require either a session cookie (from browser login) or an `Authorization: Bearer ` header. When `require_auth` is enabled (default), model-serving endpoints also require a bearer token.

### Authentication

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/login` | Login page |
| `POST` | `/login` | Authenticate (`username`, `password` form data) |
| `GET` | `/setup` | First-run setup page |
| `POST` | `/setup` | Create first user account |
| `GET` | `/logout` | End session |

### API Keys

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/api-keys` | List all API keys (hashes stripped) |
| `POST` | `/api/api-keys` | Create a new API key (`{"name": "..."}`) |
| `DELETE` | `/api/api-keys/` | Revoke an API key |

### Instances

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/instances` | List all instances |
| `POST` | `/api/instances` | Launch a new instance |
| `GET` | `/api/instances/` | Get a single instance |
| `DELETE` | `/api/instances/` | Stop and remove an instance |
| `POST` | `/api/instances//restart` | Restart a stopped/sleeping instance |
| `DELETE` | `/api/instances//remove` | Remove a stopped instance from the list |
| `GET` | `/api/instances//logs` | Last N log lines |
| `GET` | `/api/instances//logs/stream` | SSE live log tail |
| `GET` | `/api/next-port` | Get next available port from the pool |

**Launch body** (`POST /api/instances`):
```json
{
"model_path": "/models/my-model.gguf",
"port": 8000,
"n_gpu_layers": -1,
"ctx_size": 4096,
"threads": null,
"parallel": null,
"extra_args": "--flash-attn",
"gpu_devices": "0",
"idle_timeout_min": 0,
"max_concurrent": 0,
"max_queue_depth": 200,
"share_queue": false,
"proxy_sampling_override_enabled": false,
"proxy_sampling_temperature": 0.8,
"proxy_sampling_top_k": 40,
"proxy_sampling_top_p": 0.95,
"proxy_sampling_presence_penalty": 0.0
}
```

### Downloads

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/downloads` | List all downloads |
| `POST` | `/api/downloads` | Start a new download |
| `GET` | `/api/downloads/` | Get a single download |
| `DELETE` | `/api/downloads/` | Cancel an active download |
| `DELETE` | `/api/downloads//remove` | Remove a completed/failed entry |
| `GET` | `/api/downloads//logs` | Download log output |
| `GET` | `/api/downloads//logs/stream` | SSE live log tail |

**Download body** (`POST /api/downloads`):
```json
{
"repo_id": "bartowski/Mistral-7B-Instruct-v0.3-GGUF",
"filename": "Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
"hf_token": "hf_...",
"speed_limit_mbps": 0
}
```

Leave `filename` blank to download the full repository.

### Models

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/models` | List discovered models in `MODELS_DIR` (includes `repo_id` when source is known) |
| `POST` | `/api/models/delete` | Delete a model from disk (`{"path": "/models/..."}`) |
| `GET` | `/api/model-layers?path=` | Read layer count from GGUF metadata |
| `GET` | `/api/disk-space` | Free/used space on the models volume |

### Presets

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/presets` | List all saved presets |
| `GET` | `/api/presets/` | Get preset for a model |
| `PUT` | `/api/presets/` | Save/update a preset |
| `DELETE` | `/api/presets/` | Delete a preset |

### Settings

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/settings` | Get global settings |
| `POST` | `/api/settings` | Save global settings |

**Settings body** (`POST /api/settings`):
```json
{
"require_auth": true,
"admin_ui_enforce_max_models": false,
"allow_ollama_api_override_admin": false,
"auto_retry_failed_downloads": false,
"retry_count_per_failed_download": 3,
"cleanup": {
"downloads_enabled": true,
"downloads_max_age_hours": 24,
"downloads_last_run_at": 1710000000,
"instances_enabled": false,
"instances_max_age_hours": 48,
"instances_last_run_at": 1710000000,
"stale_records_enabled": false,
"stale_records_interval_min": 5,
"stale_records_last_run_at": null
}
}
```

### System

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/system-info` | CPU usage, core count, RAM usage |
| `GET` | `/api/gpu-info` | Per-GPU VRAM usage via nvidia-smi |
| `GET` | `/health` | Health check (`{"status": "ok"}`) - always open, no auth required |

### Ollama-compatible (llamaman)

| Method | Endpoint | Description |
|---|---|---|
| `GET` | `/api/tags` | List available models (Ollama format) |
| `GET` | `/api/version` | Version info |
| `POST` | `/api/show` | Model metadata |
| `GET` | `/api/ps` | Running models |
| `POST` | `/api/chat` | Chat completion (auto-starts model) |
| `POST` | `/api/generate` | Text generation (auto-starts model) |
| `GET` | `/v1/models` | List models (OpenAI format) |
| `POST` | `/v1/chat/completions` | Chat completion (OpenAI format, auto-starts model) |

## Troubleshooting

| Symptom | Fix |
|---|---|
| _"llama-server binary not found"_ | The base image must be `ghcr.io/ggml-org/llama.cpp:server-cuda` (or `server-rocm` for AMD). Rebuild with `--no-cache`. |
| Instance stuck on **starting** | Check logs via the Logs button. Common causes: OOM, model path typo, corrupt GGUF. |
| No GPU / CUDA error | Ensure the NVIDIA Container Toolkit is installed and `docker run --gpus all` works. |
| No GPU / ROCm error | Ensure `/dev/kfd` and `/dev/dri` exist on the host and your user is in the `video`/`render` groups. The ROCm image is experimental and not tested. |
| Port conflict | The form auto-suggests an unused port; adjust if needed. |
| Model not showing in OpenWebUI | Ensure `OLLAMA_BASE_URL` points to `http://llamaman:42069`. Check `/api/tags` returns models. |
| OpenWebUI gets 401 errors | `require_auth` is on (default). Create an API key in the UI and set `OPENAI_API_KEYS` in OpenWebUI's environment. |
| _"API key required"_ on all requests | Either create an API key, or turn off the "Require authentication" toggle in the API Keys section. |

## Credits

This work would not be possible without the work of [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp)

## License

LlamaMan is licensed under the [Elastic License 2.0](LICENSE). You may use, copy, distribute, and modify the software, subject to the following limitations:

- You may **not** provide it as a hosted or managed service
- You may **not** remove or circumvent license key functionality
- You may **not** alter or remove licensing, copyright, or other notices