{"id":47698566,"url":"https://github.com/nullata/llamaman","last_synced_at":"2026-04-02T16:59:35.909Z","repository":{"id":346060034,"uuid":"1188377910","full_name":"nullata/llamaman","owner":"nullata","description":"A browser-based UI for launching, monitoring, and managing multiple llama.cpp server instances from inside a Docker container. Includes an Ollama-compatible API proxy","archived":false,"fork":false,"pushed_at":"2026-03-29T17:48:24.000Z","size":641,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-29T18:48:16.022Z","etag":null,"topics":["frontend","llamacpp","llm","llm-inference","llm-infrastructure","llm-manager","llm-proxy","proxy","rest-api"],"latest_commit_sha":null,"homepage":"https://nickscripts.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nullata.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-22T01:44:27.000Z","updated_at":"2026-03-29T17:46:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/nullata/llamaman","commit_stats":null,"previous_names":["nullata/llamaman"],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/nullata/llamaman","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullata%2Fllamaman","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullata%2Fllamaman/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullata%2Fllamaman/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullata%2Fllamaman/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nullata","download_url":"https://codeload.github.com/nullata/llamaman/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nullata%2Fllamaman/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31310980,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T12:59:32.332Z","status":"ssl_error","status_checked_at":"2026-04-02T12:54:48.875Z","response_time":89,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["frontend","llamacpp","llm","llm-inference","llm-infrastructure","llm-manager","llm-proxy","proxy","rest-api"],"created_at":"2026-04-02T16:59:35.343Z","updated_at":"2026-04-02T16:59:35.902Z","avatar_url":"https://github.com/nullata.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# \u003cimg src=\"static/images/logo.svg\" alt=\"logo\" width=\"24\"\u003e LlamaMan\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/llamaman.jpg\" alt=\"LlamaMan\" width=\"400\"\u003e\n\u003c/p\u003e\n\nA browser-based UI for launching, monitoring, and managing multiple [llama.cpp](https://github.com/ggerganov/llama.cpp) server instances from inside a Docker container. Includes an Ollama-compatible API proxy so it works as a drop-in replacement for Ollama with [Open WebUI](https://github.com/open-webui/open-webui).\n\n## Features\n\n- **Model library** - scans `/models` for GGUF files, shows quant type and file size\n- **One-click launch** - configure GPU layers, context size, threads, multi-GPU, extra args\n- **Preset configs** - save/load per-model launch settings\n- **Download manager** - pull models from HuggingFace with speed throttling and auto-retry on failure\n- **Instance management** - stop, restart, remove, view live-streamed logs\n- **GPU VRAM indicator** - per-GPU usage bars via nvidia-smi or rocm-smi\n- **Idle timeout** - auto-sleep instances after configurable idle period, wake on next request\n- **Ollama-compatible proxy** - OpenWebUI discovers models and auto-starts servers on demand\n- **Authentication** - user accounts with session login, API key management with bearer tokens\n- **Require auth toggle** - enforce bearer token authentication on all endpoints (including model loading) or leave model endpoints open\n- **Persistent state** - instance history and configs survive container restarts\n- **Storage backends** - JSON files (default) or MariaDB/MySQL via SQLAlchemy\n- **Proxy sampling overrides** - force temperature, top-k, top-p, and presence penalty on all proxied requests, configurable per model preset\n\n## Requirements\n\n- Docker with **one** of:\n  - [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html) (for CUDA / NVIDIA GPUs)\n  - [ROCm-compatible setup](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/) (for AMD GPUs) - **experimental, not tested**\n- A supported GPU (llama.cpp can offload to CPU/RAM when VRAM is insufficient)\n\n## Quick Start\n\n**NVIDIA (CUDA):**\n\n```bash\ndocker compose up --build\n```\n\n**AMD (ROCm)** - experimental, not tested:\n\n```bash\ndocker compose --profile rocm up --build llamaman-rocm\n```\n\n- **Management UI**: http://localhost:5000\n- **Llamaman proxy** (Ollama-compatible API): http://localhost:42069\n- **llama-server public instance ports**: 8000-8020\n\nOn first launch, visit the UI to create an admin account via `/setup`.\n\n## Authentication\n\nLlamaMan has a built-in auth system with two layers:\n\n### User accounts (session-based)\n\nOn first launch, `/setup` lets you create an admin account. After that, all browser access requires login. Session cookies authenticate UI requests.\n\n### API keys (bearer tokens)\n\nCreate API keys in the **API Keys** section of the UI. External clients (OpenWebUI, scripts, etc.) authenticate with:\n\n```\nAuthorization: Bearer llm-xxxxxxxxxx\n```\n\n### Require authentication toggle\n\nThe **\"Require authentication for all endpoints\"** toggle (on by default) controls whether model-serving endpoints require a bearer token:\n\n| Toggle | Model endpoints (`/api/chat`, `/v1/chat/completions`, etc.) | Management endpoints (`/api/instances`, etc.) | Per-instance proxy ports |\n|--------|--------------------------------------------------------------|-----------------------------------------------|--------------------------|\n| **ON** (default) | Bearer token required | Bearer token or session required | Bearer token required |\n| **OFF** | Open (no auth) | Bearer token or session required | Open (no auth) |\n\nWhen the toggle is **ON**, all three port surfaces are protected:\n- **Port 5000** (management UI + API) - Flask `before_request` hook\n- **Port 42069** (Ollama-compatible proxy) - same Flask app, same hook\n- **Ports 8000-8020** (per-instance proxies) - WSGI-level auth check\n\n### OpenWebUI with authentication\n\nWhen `require_auth` is on, configure OpenWebUI to send a valid API key:\n\n```yaml\nopen-webui:\n  environment:\n    - OLLAMA_BASE_URL=http://llamaman:42069\n    - OPENAI_API_BASE_URLS=http://llamaman:42069/v1\n    - OPENAI_API_KEYS=llm-your-api-key-here\n```\n\n## Models\n\nPlace models inside the `models/` volume:\n\n- **GGUF files**: any `.gguf` file (recommended - llama.cpp native format)\n- **HuggingFace repos**: directories containing `config.json`\n\nOr use the **Download** button in the UI to pull from HuggingFace.\n\n## Launching Instances\n\n1. Select a model from the sidebar\n2. Configure launch settings (GPU layers, context size, idle timeout, etc.)\n3. Click **Launch** - the instance appears with a status badge\n4. Optionally click **Save Preset** to remember settings for that model\n\nEach instance exposes an OpenAI-compatible API on its assigned port.\n\n### Layer autodetection\n\nWhen you select a GGUF model, LlamaMan reads the file's metadata to detect the total number of layers (block count). This is displayed next to the **GPU Layers** input so you can see exactly how many layers are available to offload (e.g. `/ 32`). Set GPU Layers to `-1` to offload all layers to GPU.\n\n### Launch settings reference\n\n| Setting | Default | Description |\n|---|---|---|\n| **GPU Layers** | `-1` | Number of layers to offload to GPU. `-1` = all layers, `0` = CPU only. Total layers are autodetected from the GGUF file. |\n| **Context Size** | `4096` | Maximum context window in tokens (`--ctx-size`). |\n| **Parallel** | `1` | Number of parallel sequences the llama-server can process simultaneously (`--parallel`). Controls KV cache slot allocation inside the server itself. |\n| **Idle Timeout min** | `0` | Minutes of inactivity before the server is stopped to free VRAM. `0` = disabled. See [Idle Timeout](#idle-timeout). |\n| **Max Concurrent** | `0` | Maximum number of inference requests allowed in-flight at once. `0` = unlimited. When set, incoming requests are queued and gated by a semaphore. |\n| **Max Queue Depth** | `200` | Maximum number of requests that can wait in the queue when `Max Concurrent` is active. Requests beyond this limit are rejected with HTTP 429. |\n| **Share Queue** | off | When enabled, multiple proxy-managed instances of the **same model** share a single request queue. Incoming requests are distributed across instances as slots become available, providing simple load balancing. |\n| **Embedding Model** | off | Marks the instance as an embedding model. Embedding instances are **excluded** from the `LLAMAMAN_MAX_MODELS` count and will never be evicted by the proxy's LRU policy. |\n| **GPU Devices** | `0` | Comma-separated GPU indices for multi-GPU setups (e.g. `0,1`). |\n| **Extra Args** | _(empty)_ | Additional flags passed directly to llama-server (e.g. `--flash-attn`). |\n| **Proxy Sampling Overrides** | off | When enabled, the proxy forces the configured sampling parameters on every request forwarded to this instance, regardless of what the client sends. |\n| **Temperature** | `0.8` | Sampling temperature to enforce (range: `0.0`–`2.0`). Only active when proxy sampling overrides are enabled. |\n| **Top K** | `40` | Top-k sampling value to enforce (min: `0`). Only active when proxy sampling overrides are enabled. |\n| **Top P** | `0.95` | Top-p (nucleus) sampling value to enforce (range: `0.01`–`1.0`). Only active when proxy sampling overrides are enabled. |\n| **Presence Penalty** | `0.0` | Presence penalty to enforce (range: `-2.0`–`2.0`). Only active when proxy sampling overrides are enabled. |\n\n### Concurrency and queueing\n\nWhen **Max Concurrent** is set to a value greater than 0, LlamaMan places a concurrency gate in front of the instance. Requests that exceed the limit are held in a FIFO queue (up to **Max Queue Depth**). If the queue is also full, new requests are rejected with HTTP 429.\n\nThe gate tracks active and queued request counts, which are visible in the instance list via the API.\n\n**Parallel vs Max Concurrent:** `Parallel` controls how many sequences the llama-server processes internally (KV cache slots). `Max Concurrent` is an external gate that limits how many requests LlamaMan forwards to the server at once. You can use both together - for example, `Parallel=4` with `Max Concurrent=4` ensures the server always has enough KV slots for the requests it receives.\n\n## Idle Timeout\n\nSet **Idle Timeout min** in the launch form (0 = disabled). When enabled:\n\n- The manager proxies the instance port (transparent to clients)\n- After N minutes of no requests, the llama-server is stopped to free VRAM\n- On the next request, the server auto-relaunches with the same config\n- Client sees the same port/API with just a cold-start delay\n\nFor instances managed by the llamaman proxy (OpenWebUI), use the `LLAMAMAN_IDLE_TIMEOUT` env var instead.\n\n## Download Settings\n\nThe UI provides download-related options under **Settings \u003e\u003e Download Settings**:\n\n- **Auto-retry failed downloads** - automatically retries downloads that fail due to network errors or interruptions. Off by default.\n- **Retry count per failed download** - how many times to retry before marking a download as permanently failed (default: 3, min: 1). Only active when auto-retry is enabled.\n\n## Cleanup Settings\n\nThe UI provides automatic cleanup under **Settings \u003e\u003e Cleanup Settings**:\n\n- **Auto-clean completed/failed downloads** - removes download records older than a configurable number of hours (default: 24). Only affects completed, failed, or cancelled downloads - active downloads are never touched.\n- **Auto-clean stopped instances** - removes stopped instance records older than a configurable number of hours (default: 24). Only affects stopped instances - running instances are never removed.\n- **Auto-remove stale instance records** - periodically checks all `starting`/`healthy`/`sleeping` instance records against their actual OS process. Records whose backing process is no longer alive are marked stopped. Configurable check interval (default: 5 minutes). Useful for catching crashes the normal health-check loop may have missed.\n\nCleanup runs periodically in the background. These settings only remove or update records in the UI/state - they do not delete model files.\n\n## OpenWebUI Integration (llamaman proxy)\n\nThe llamaman proxy exposes an Ollama-compatible API on port **42069** (configurable). Point OpenWebUI at it:\n\n```yaml\nopen-webui:\n  environment:\n    - OLLAMA_BASE_URL=http://llamaman:42069\n```\n\n**How it works:**\n\n1. OpenWebUI calls `/api/tags` -\u003e LlamaMan returns all available GGUF models\n2. User selects a model in OpenWebUI -\u003e `/api/chat` request arrives\n3. LlamaMan auto-launches a llama-server (using saved preset or defaults)\n4. Waits for healthy, then proxies the request with format translation\n5. When `LLAMAMAN_MAX_MODELS` limit is reached, the least-recently-used **Ollama-managed** model is evicted. Admin UI launched models are never evicted by the Ollama API by default (see [Model eviction policy](#model-eviction-policy))\n\nSupported Ollama endpoints: `/api/tags`, `/api/chat`, `/api/generate`, `/api/show`, `/api/version`, `/api/ps`\n\nAlso supports OpenAI-compatible endpoints with auto-start: `/v1/models`, `/v1/chat/completions`\n\n### Model eviction policy\n\nThe `LLAMAMAN_MAX_MODELS` limit controls how many **chat** models the proxy will keep loaded simultaneously. When a new model is requested and the limit is reached, the least-recently-used (LRU) chat model is evicted to make room.\n\n#### Priority rules\n\nAdmin UI launched models have ultimate priority. The two API surfaces have different eviction rights:\n\n| Launcher | Eviction behaviour | Cannot evict |\n|----------|--------------------|--------------|\n| **Admin UI** | Evicts Ollama-managed models first (LRU), then admin UI models if needed | - |\n| **Ollama API** (`/api/chat`, `/api/generate`) | Evicts Ollama-managed models (LRU) | Admin UI launched models (by default) |\n| **OpenAI API** (`/v1/chat/completions`) | No eviction - starts model only if a slot is free | Everything |\n\nIf the cap is full, requests that cannot evict return HTTP 503:\n```\nmodel limit reached (LLAMAMAN_MAX_MODELS=N); admin-launched models cannot be evicted via the API\nmodel limit reached (LLAMAMAN_MAX_MODELS=N); the OpenAI API does not evict running models\n```\n\n#### App Settings toggles\n\nTwo toggles in **Settings \u003e\u003e App Settings** control eviction behaviour:\n\n- **Enforce `LLAMAMAN_MAX_MODELS` for admin UI launches** - when on, the admin UI silently evicts the LRU model (Ollama-managed first) before launching. When off (default), the UI prompts you to confirm before exceeding the cap.\n- **Allow Ollama API to evict admin-launched models** - when on, the Ollama API can also evict admin UI launched models as a fallback if no Ollama-managed models are available to evict. Off by default. Has no effect on the OpenAI API, which never evicts.\n\n#### Other details\n\n- **All running instances count toward the limit** - both admin UI and proxy-managed instances. If you manually launch 2 models and `LLAMAMAN_MAX_MODELS=1`, the proxy sees you are already over the limit.\n- **Embedding models are excluded.** Instances marked as **Embedding Model** do not count toward the limit and are never evicted. This lets you keep an embedding model loaded permanently alongside your chat models.\n- **`LLAMAMAN_MAX_MODELS=0` (default) disables eviction entirely.** The proxy will launch models on demand without ever stopping existing ones.\n\n## Storage Backends\n\n### JSON (default)\n\nZero-config. Stores data in JSON files under `DATA_DIR` (`/data`):\n- `state.json` - instances and downloads\n- `presets.json` - per-model launch presets\n- `users.json` - user accounts\n- `settings.json` - global settings\n- `api_keys.json` - API key hashes\n\nInstance and download logs are written to `LOGS_DIR` (`/tmp/llama-logs`), which is separate from persistent data.\n\n### MariaDB / MySQL\n\nSet `DATABASE_URL` to enable:\n\n```yaml\nenvironment:\n  - DATABASE_URL=mysql+pymysql://user:password@host:3306/llamaman\n```\n\nTables are auto-created on first connection. Requires `sqlalchemy` and `pymysql` (included in requirements).\n\n## Environment Variables\n\n| Variable | Default | Description |\n|---|---|---|\n| `MODELS_DIR` | `/models` | Directory scanned for model files |\n| `DATA_DIR` | `/data` | Directory for persistent config/state (JSON files) |\n| `LOGS_DIR` | `/tmp/llama-logs` | Directory for instance and download logs |\n| `PORT_RANGE_START` | `8000` | Start of public llama-server/proxy port pool |\n| `PORT_RANGE_END` | `8020` | End of public llama-server/proxy port pool |\n| `INTERNAL_PORT_RANGE_START` | `9000` | Start of internal llama-server port pool used when proxy mode is enabled |\n| `INTERNAL_PORT_RANGE_END` | `9020` | End of internal llama-server port pool used when proxy mode is enabled |\n| `LLAMAMAN_PROXY_PORT` | `42069` | Port for the Ollama-compatible proxy |\n| `LLAMAMAN_MAX_MODELS` | `0` | Max concurrent **chat** models via the proxy (LRU eviction, 0 = unlimited) |\n| `LLAMAMAN_IDLE_TIMEOUT` | `0` | Idle timeout in minutes for proxy-managed instances (0 = disabled) |\n| `SECRET_KEY` | _(auto)_ | Flask session secret. Auto-derived from machine-id if unset. Set this for multi-replica deployments. |\n| `DATABASE_URL` | _(unset)_ | MariaDB/MySQL connection string. Unset = use JSON files. |\n| `HEALTH_CHECK_TIMEOUT` | `3` | Timeout in seconds for instance health checks |\n| `MODEL_LOAD_TIMEOUT` | `300` | Seconds to wait for a model to become healthy during launch/relaunch. Increase for very large models. |\n| `REQUEST_TIMEOUT` | `300` | Timeout in seconds for upstream requests to llama-server and gate acquire waits. Increase if requests are being cut off under heavy concurrency. |\n\n## REST API\n\nAll endpoints return and accept JSON.\n\n**Authentication:** Management endpoints require either a session cookie (from browser login) or an `Authorization: Bearer \u003ckey\u003e` header. When `require_auth` is enabled (default), model-serving endpoints also require a bearer token.\n\n### Authentication\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/login` | Login page |\n| `POST` | `/login` | Authenticate (`username`, `password` form data) |\n| `GET` | `/setup` | First-run setup page |\n| `POST` | `/setup` | Create first user account |\n| `GET` | `/logout` | End session |\n\n### API Keys\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/api-keys` | List all API keys (hashes stripped) |\n| `POST` | `/api/api-keys` | Create a new API key (`{\"name\": \"...\"}`) |\n| `DELETE` | `/api/api-keys/\u003cid\u003e` | Revoke an API key |\n\n### Instances\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/instances` | List all instances |\n| `POST` | `/api/instances` | Launch a new instance |\n| `GET` | `/api/instances/\u003cid\u003e` | Get a single instance |\n| `DELETE` | `/api/instances/\u003cid\u003e` | Stop and remove an instance |\n| `POST` | `/api/instances/\u003cid\u003e/restart` | Restart a stopped/sleeping instance |\n| `DELETE` | `/api/instances/\u003cid\u003e/remove` | Remove a stopped instance from the list |\n| `GET` | `/api/instances/\u003cid\u003e/logs` | Last N log lines |\n| `GET` | `/api/instances/\u003cid\u003e/logs/stream` | SSE live log tail |\n| `GET` | `/api/next-port` | Get next available port from the pool |\n\n**Launch body** (`POST /api/instances`):\n```json\n{\n  \"model_path\": \"/models/my-model.gguf\",\n  \"port\": 8000,\n  \"n_gpu_layers\": -1,\n  \"ctx_size\": 4096,\n  \"threads\": null,\n  \"parallel\": null,\n  \"extra_args\": \"--flash-attn\",\n  \"gpu_devices\": \"0\",\n  \"idle_timeout_min\": 0,\n  \"max_concurrent\": 0,\n  \"max_queue_depth\": 200,\n  \"share_queue\": false,\n  \"proxy_sampling_override_enabled\": false,\n  \"proxy_sampling_temperature\": 0.8,\n  \"proxy_sampling_top_k\": 40,\n  \"proxy_sampling_top_p\": 0.95,\n  \"proxy_sampling_presence_penalty\": 0.0\n}\n```\n\n### Downloads\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/downloads` | List all downloads |\n| `POST` | `/api/downloads` | Start a new download |\n| `GET` | `/api/downloads/\u003cid\u003e` | Get a single download |\n| `DELETE` | `/api/downloads/\u003cid\u003e` | Cancel an active download |\n| `DELETE` | `/api/downloads/\u003cid\u003e/remove` | Remove a completed/failed entry |\n| `GET` | `/api/downloads/\u003cid\u003e/logs` | Download log output |\n| `GET` | `/api/downloads/\u003cid\u003e/logs/stream` | SSE live log tail |\n\n**Download body** (`POST /api/downloads`):\n```json\n{\n  \"repo_id\": \"bartowski/Mistral-7B-Instruct-v0.3-GGUF\",\n  \"filename\": \"Mistral-7B-Instruct-v0.3-Q4_K_M.gguf\",\n  \"hf_token\": \"hf_...\",\n  \"speed_limit_mbps\": 0\n}\n```\n\nLeave `filename` blank to download the full repository.\n\n### Models\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/models` | List discovered models in `MODELS_DIR` (includes `repo_id` when source is known) |\n| `POST` | `/api/models/delete` | Delete a model from disk (`{\"path\": \"/models/...\"}`) |\n| `GET` | `/api/model-layers?path=\u003cpath\u003e` | Read layer count from GGUF metadata |\n| `GET` | `/api/disk-space` | Free/used space on the models volume |\n\n### Presets\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/presets` | List all saved presets |\n| `GET` | `/api/presets/\u003cmodel_path\u003e` | Get preset for a model |\n| `PUT` | `/api/presets/\u003cmodel_path\u003e` | Save/update a preset |\n| `DELETE` | `/api/presets/\u003cmodel_path\u003e` | Delete a preset |\n\n### Settings\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/settings` | Get global settings |\n| `POST` | `/api/settings` | Save global settings |\n\n**Settings body** (`POST /api/settings`):\n```json\n{\n  \"require_auth\": true,\n  \"admin_ui_enforce_max_models\": false,\n  \"allow_ollama_api_override_admin\": false,\n  \"auto_retry_failed_downloads\": false,\n  \"retry_count_per_failed_download\": 3,\n  \"cleanup\": {\n    \"downloads_enabled\": true,\n    \"downloads_max_age_hours\": 24,\n    \"downloads_last_run_at\": 1710000000,\n    \"instances_enabled\": false,\n    \"instances_max_age_hours\": 48,\n    \"instances_last_run_at\": 1710000000,\n    \"stale_records_enabled\": false,\n    \"stale_records_interval_min\": 5,\n    \"stale_records_last_run_at\": null\n  }\n}\n```\n\n### System\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/system-info` | CPU usage, core count, RAM usage |\n| `GET` | `/api/gpu-info` | Per-GPU VRAM usage via nvidia-smi |\n| `GET` | `/health` | Health check (`{\"status\": \"ok\"}`) - always open, no auth required |\n\n### Ollama-compatible (llamaman)\n\n| Method | Endpoint | Description |\n|---|---|---|\n| `GET` | `/api/tags` | List available models (Ollama format) |\n| `GET` | `/api/version` | Version info |\n| `POST` | `/api/show` | Model metadata |\n| `GET` | `/api/ps` | Running models |\n| `POST` | `/api/chat` | Chat completion (auto-starts model) |\n| `POST` | `/api/generate` | Text generation (auto-starts model) |\n| `GET` | `/v1/models` | List models (OpenAI format) |\n| `POST` | `/v1/chat/completions` | Chat completion (OpenAI format, auto-starts model) |\n\n## Troubleshooting\n\n| Symptom | Fix |\n|---|---|\n| _\"llama-server binary not found\"_ | The base image must be `ghcr.io/ggml-org/llama.cpp:server-cuda` (or `server-rocm` for AMD). Rebuild with `--no-cache`. |\n| Instance stuck on **starting** | Check logs via the Logs button. Common causes: OOM, model path typo, corrupt GGUF. |\n| No GPU / CUDA error | Ensure the NVIDIA Container Toolkit is installed and `docker run --gpus all` works. |\n| No GPU / ROCm error | Ensure `/dev/kfd` and `/dev/dri` exist on the host and your user is in the `video`/`render` groups. The ROCm image is experimental and not tested. |\n| Port conflict | The form auto-suggests an unused port; adjust if needed. |\n| Model not showing in OpenWebUI | Ensure `OLLAMA_BASE_URL` points to `http://llamaman:42069`. Check `/api/tags` returns models. |\n| OpenWebUI gets 401 errors | `require_auth` is on (default). Create an API key in the UI and set `OPENAI_API_KEYS` in OpenWebUI's environment. |\n| _\"API key required\"_ on all requests | Either create an API key, or turn off the \"Require authentication\" toggle in the API Keys section. |\n\n## Credits\n\nThis work would not be possible without the work of [ggml-org/llama.cpp](https://github.com/ggml-org/llama.cpp)\n\n## License\n\nLlamaMan is licensed under the [Elastic License 2.0](LICENSE). You may use, copy, distribute, and modify the software, subject to the following limitations:\n\n- You may **not** provide it as a hosted or managed service\n- You may **not** remove or circumvent license key functionality\n- You may **not** alter or remove licensing, copyright, or other notices\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnullata%2Fllamaman","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnullata%2Fllamaman","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnullata%2Fllamaman/lists"}