https://github.com/mostlygeek/llama-swap
Model swapping for llama.cpp (or any local OpenAPI compatible server)
https://github.com/mostlygeek/llama-swap
golang llama llamacpp localllama localllm openai openai-api vllm
Last synced: about 2 months ago
JSON representation
Model swapping for llama.cpp (or any local OpenAPI compatible server)
- Host: GitHub
- URL: https://github.com/mostlygeek/llama-swap
- Owner: mostlygeek
- License: mit
- Created: 2024-10-04T04:06:01.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-07T19:37:18.000Z (about 1 year ago)
- Last Synced: 2025-04-11T23:19:14.020Z (about 1 year ago)
- Topics: golang, llama, llamacpp, localllama, localllm, openai, openai-api, vllm
- Language: Go
- Homepage:
- Size: 539 KB
- Stars: 507
- Watchers: 7
- Forks: 28
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE.md
Awesome Lists containing this project
- awesome-llm-services - llama-swap
- awesome-hacking-lists - mostlygeek/llama-swap - transparent proxy server on demand model swapping for llama.cpp (or any local OpenAPI compatible server) (Go)
- awesome-ChatGPT-repositories - llama-swap - Model swapping for llama.cpp (or any local OpenAPI compatible server) (Langchain)
- awesome-mac-mini-homeserver - llama-swap - Go proxy that hot-swaps llama.cpp or any OpenAI-compatible backend on demand.    (Agentic AI  / Local LLM runtimes)
- awesome-llm-tools - llama-swap - swap?style=flat-square) | MIT | (12. Miscellaneous / Data & Alignment Tools)
- awesome-local-llm - llama-swap - reliable model swapping for any local OpenAI compatible server - llama.cpp, vllm, etc. (Tools / Models)
- awesome-private-ai - llama-swap - Model swapping for llama.cpp (or any local OpenAPI compatible server). (Inference Runtimes & Backends)
- awesome-opensource-ai - llama-swap - Intelligent model swapping proxy for llama.cpp. Enables seamless hot-swapping between different GGUF models without restarting the server, with automatic model loading/unloading and OpenAI-compatible API. MIT licensed.  (3. Inference Engines & Serving)
- StarryDivineSky - mostlygeek/llama-swap - swap是一个专为本地OpenAI/Anthropic兼容服务器设计的模型切换工具,支持在llama.cpp、vllm等框架中实现可靠且高效的模型替换功能。该项目的核心特色是能够在不重启服务的情况下动态切换不同模型,例如从Llama-3切换到Mistral或Mixtral等主流大模型,特别适合需要频繁测试或对比模型效果的开发场景。其工作原理基于对本地模型服务器的插件式集成,通过预加载模型权重和优化内存管理机制,确保切换过程零中断且资源占用合理。 项目支持多种模型框架的兼容性,用户可自定义模型路径和参数配置,同时提供自动化脚本简化切换流程。相比传统手动重启服务的切换方式,llama-swap通过动态加载模型权重和智能资源调度技术,显著降低了模型切换的时间成本。开发者可通过简单的API或命令行接口完成操作,适用于本地部署的AI服务器、研究测试环境或需要多模型协作的场景。项目还注重稳定性,通过严格的错误检测机制防止因模型不兼容导致的服务崩溃,同时支持监控模型运行状态和资源使用情况。对于使用llama.cpp或vllm等轻量化框架的用户,该项目提供了针对性的优化方案,确保在有限硬件资源下也能流畅运行。整体设计兼顾了易用性与高性能需求,是本地模型部署和测试的实用工具。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
README




# llama-swap
Run multiple generative AI models on your machine and hot-swap between them on demand. llama-swap works with any OpenAI and Anthropic API compatible server and is used by thousands of people to power their local AI workflows.
Built in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.
## Features:
- ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies
- ✅ On-demand model switching
- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, stable-diffusion.cpp, etc.)
- future proof, upgrade your inference servers at any time.
- ✅ OpenAI API supported endpoints:
- `v1/completions`
- `v1/chat/completions`
- `v1/responses`
- `v1/embeddings`
- `v1/models` - list available models
- `v1/audio/speech` ([#36](https://github.com/mostlygeek/llama-swap/issues/36))
- `v1/audio/transcriptions` ([docs](https://github.com/mostlygeek/llama-swap/issues/41#issuecomment-2722637867))
- `v1/audio/voices`
- `v1/images/generations`
- `v1/images/edits`
- ✅ Anthropic API supported endpoints:
- `v1/messages`
- `v1/messages/count_tokens`
- ✅ llama-server (llama.cpp) supported endpoints
- `v1/rerank`, `v1/reranking`, `/rerank`
- `/infill` - for code infilling
- `/completion` - for completion endpoint
- ✅ SDAPI via [stable-diffusion.cpp's server](https://github.com/leejet/stable-diffusion.cpp/tree/master/examples/server)
- `/sdapi/v1/txt2img`
- `/sdapi/v1/img2img`
- `/sdapi/v1/loras` - requires `model` in request body to fetch the correct loras
- ✅ llama-swap API
- `/ui` - web UI
- `/upstream/:model_id` - direct access to upstream server ([demo](https://github.com/mostlygeek/llama-swap/pull/31))
- `/running` - list currently running models ([#61](https://github.com/mostlygeek/llama-swap/issues/61))
- `POST /api/models/unload` - manually unload all running models ([#58](https://github.com/mostlygeek/llama-swap/issues/58))
- `POST /api/models/unload/:model_id` - unload a specific model
- `/logs` - remote log monitoring
- `GET /logs` returns buffered plain text logs.
- If `Accept: text/html` is sent, `/logs` redirects to `/ui/`.
- `GET /logs/stream` keeps the connection open for live log streaming.
- Stream endpoints send buffered history first by default; add `?no-history` to stream only new lines.
- `GET /logs/stream/proxy` streams proxy logs only.
- `GET /logs/stream/upstream` streams upstream process logs only.
- `GET /logs/stream/{model_id}` streams logs for one model (including IDs with slashes, like `author/model`).
- `/health` - just returns "OK"
- `/metrics` - system and GPU metrics for prometheus
- ✅ API Key support - define keys to restrict access to API endpoints
- ✅ Customizable
- Run concurrent models with a custom DSL swap matrix ([#643](https://github.com/mostlygeek/llama-swap/issues/643))
- Automatic unloading of models after timeout by setting a `ttl`
- Docker and Podman support using `cmd` and `cmdStop` together
- Preload models on startup with `hooks` ([#235](https://github.com/mostlygeek/llama-swap/pull/235))
- Apply filters to requests to control inference with `stripParams`, `setParams` and `setParamsByID`
### Web UI
llama-swap includes a real time web interface with a playground for testing out all sorts of local models:

View detailed token metrics:

Inspect request and responses:

Manually load and unload models:

Real time log streaming:

## Installation
llama-swap can be installed in multiple ways
1. Docker
2. Homebrew (OSX and Linux)
3. WinGet
4. From release binaries
5. From source
### Docker Install ([download images](https://github.com/mostlygeek/llama-swap/pkgs/container/llama-swap))
Two types of container images are built nightly for llama-swap:
1. A unified container with llama-server, ik-llama-server, stable-diffusion.cpp, whisper.cpp and llama-swap built from source. This is only available for cuda and vulkan but has more capabilities. This one is recommended for use.
2. A legacy image that is based on llama.cpp's images and llama-swap copied into the container. Use this one if you prefer to stay close to llama.cpp's container images.
#### Unified container (Recommended)
```shell
$ docker pull ghcr.io/mostlygeek/llama-swap:unified-cuda
# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/etc/llama-swap/config/config.yaml \
ghcr.io/mostlygeek/llama-swap:unified-cuda
```
#### Legacy container
```shell
$ docker pull ghcr.io/mostlygeek/llama-swap:cuda
# run with a custom configuration and models directory
$ docker run -it --rm --runtime nvidia -p 9292:8080 \
-v /path/to/models:/models \
-v /path/to/custom/config.yaml:/app/config.yaml \
ghcr.io/mostlygeek/llama-swap:cuda
```
more examples
```shell
# pull latest images per platform
docker pull ghcr.io/mostlygeek/llama-swap:cpu
docker pull ghcr.io/mostlygeek/llama-swap:cuda
docker pull ghcr.io/mostlygeek/llama-swap:vulkan
docker pull ghcr.io/mostlygeek/llama-swap:intel
docker pull ghcr.io/mostlygeek/llama-swap:musa
# tagged llama-swap, platform and llama-server version images
docker pull ghcr.io/mostlygeek/llama-swap:v166-cuda-b6795
# non-root cuda
docker pull ghcr.io/mostlygeek/llama-swap:cuda-non-root
```
### Homebrew Install (macOS/Linux)
```shell
brew tap mostlygeek/llama-swap
brew install llama-swap
llama-swap --config path/to/config.yaml --listen localhost:8080
```
### WinGet Install (Windows)
> [!NOTE]
> WinGet is maintained by community contributor [Dvd-Znf](https://github.com/Dvd-Znf) ([#327](https://github.com/mostlygeek/llama-swap/issues/327)). It is not an official part of llama-swap.
```shell
# install
C:\> winget install llama-swap
# upgrade
C:\> winget upgrade llama-swap
```
### Pre-built Binaries
Binaries are available on the [release](https://github.com/mostlygeek/llama-swap/releases) page for Linux, Mac, Windows and FreeBSD.
### Building from source
1. Building requires Go and Node.js (for UI).
1. `git clone https://github.com/mostlygeek/llama-swap.git`
1. `make clean all`
1. look in the `build/` subdirectory for the llama-swap binary
## Configuration
```yaml
# minimum viable config.yaml
models:
model1:
cmd: llama-server --port ${PORT} --model /path/to/model.gguf
```
That's all you need to get started:
1. `models` - holds all model configurations
2. `model1` - the ID used in API calls
3. `cmd` - the command to run to start the server.
4. `${PORT}` - an automatically assigned port number
Almost all configuration settings are optional and can be added one step at a time:
- Advanced features
- `matrix` to run concurrent models with a custom swap logic DSL
- `hooks` to run things on startup
- `macros` reusable snippets
- Model customization
- `ttl` to automatically unload models
- `aliases` to use familiar model names (e.g., "gpt-4o-mini")
- `env` to pass custom environment variables to inference servers
- `cmdStop` gracefully stop Docker/Podman containers
- `useModelName` to override model names sent to upstream servers
- `${PORT}` automatic port variables for dynamic port assignment
- `filters` rewrite parts of requests before sending to the upstream server
See the [configuration documentation](docs/configuration.md) for all options.
## How does llama-swap work?
When a request is made to an OpenAI compatible endpoint, llama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the "swap" part comes in. The upstream server is automatically swapped to handle the request correctly.
In the most basic configuration llama-swap handles one model at a time. For more advanced use cases, using a `matrix` allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.
## Reverse Proxy Configuration (nginx)
If you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. ([#236](https://github.com/mostlygeek/llama-swap/issues/236))
Recommended nginx configuration snippets:
```nginx
# SSE for UI events/logs
location /api/events {
proxy_pass http://your-llama-swap-backend;
proxy_buffering off;
proxy_cache off;
}
# Streaming chat completions (stream=true)
location /v1/chat/completions {
proxy_pass http://your-llama-swap-backend;
proxy_buffering off;
proxy_cache off;
}
```
As a safeguard, llama-swap also sets `X-Accel-Buffering: no` on SSE responses. However, explicitly disabling `proxy_buffering` at your reverse proxy is still recommended for reliable streaming behavior.
## Monitoring Logs on the CLI
```sh
# sends up to the last 10KB of logs
$ curl http://host/logs
# streams combined logs
curl -Ns http://host/logs/stream
# stream llama-swap's proxy status logs
curl -Ns http://host/logs/stream/proxy
# stream logs from upstream processes that llama-swap loads
curl -Ns http://host/logs/stream/upstream
# stream logs only from a specific model
curl -Ns http://host/logs/stream/{model_id}
# stream and filter logs with linux pipes
curl -Ns http://host/logs/stream | grep 'eval time'
# appending ?no-history will disable sending buffered history first
curl -Ns 'http://host/logs/stream?no-history'
```
## Do I need to use llama.cpp's server (llama-server)?
Any OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.
For Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals for proper shutdown.
## Star History
> [!NOTE]
> Thank you to everyone who has given this project a ⭐️!
[](https://www.star-history.com/#mostlygeek/llama-swap&Date)