https://github.com/michaelkrauty/llamesh

OpenAI-compatible mesh proxy for llama.cpp
https://github.com/michaelkrauty/llamesh

ai inference llama-cpp llm load-balancer openai-api proxy rust

Last synced: 24 days ago
JSON representation

OpenAI-compatible mesh proxy for llama.cpp

Host: GitHub
URL: https://github.com/michaelkrauty/llamesh
Owner: michaelkrauty
License: apache-2.0
Created: 2026-03-20T05:25:34.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-04-19T08:50:39.000Z (30 days ago)
Last Synced: 2026-04-19T10:26:51.899Z (30 days ago)
Topics: ai, inference, llama-cpp, llm, load-balancer, openai-api, proxy, rust
Language: Rust
Size: 366 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# llamesh

An OpenAI-compatible mesh proxy for [llama.cpp](https://github.com/ggml-org/llama.cpp). It manages `llama-server` instances across one or more machines, handling spawn/evict lifecycle, load balancing, and cluster routing — while exposing a standard OpenAI API to clients.

Point any OpenAI-compatible client at llamesh and it handles the rest: spinning up the right model, routing to the best available instance, and tearing it down when idle.

## Features

- **OpenAI-compatible API** — `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`
- **Automatic instance management** — on-demand spawn, idle eviction, health monitoring
- **Multi-node mesh** — zero-config LAN discovery (mDNS) or explicit WAN peers, encrypted with Noise Protocol
- **Model profiles** — configure multiple profiles per model (e.g. `fast` vs `quality`) with different llama-server args
- **Resource guardrails** — device-wide VRAM telemetry and system memory tracking prevent OOM
- **Hot-reload cookbook** — add/modify models without restarting
- **Auto-build llama.cpp** — clones, builds, smoke tests, and atomically swaps binaries
- **Hugging Face integration** — download models automatically via `hf_repo`/`hf_file`
- **Streaming** — SSE streaming with backpressure, forwarded verbatim
- **Metrics & health** — Prometheus metrics, JSON snapshots, `/healthz` and `/readyz` probes
- **Security** — TLS, API key auth, Noise Protocol encryption for inter-node traffic

## Quick Start

### Build

```bash
git clone https://github.com/michaelkrauty/llamesh.git
cd llamesh
cargo build --release
```

**Requirements:** Rust 1.80+, CMake, C/C++ compiler, git

### Configure

Create a minimal `config.yaml`:

```yaml
node_id: "my-node"
listen_addr: "0.0.0.0:8080"
max_vram_mb: 24000
max_sysmem_mb: 64000

llama_cpp:
repo_url: "https://github.com/ggml-org/llama.cpp.git"
build_args:
- "-DGGML_CUDA=ON"
enabled: true
```

When NVIDIA NVML is available, llamesh counts device-wide VRAM usage against
`max_vram_mb`, including GPU memory used by processes it does not manage.

Create a `cookbook.yaml` with your models:

```yaml
models:
- name: "my-model"
profiles:
- id: "default"
model_path: "./models/my-model.gguf"
llama_server_args: "-c 32768 -fa on"
```

### Run

```bash
./target/release/llamesh --config ./config.yaml --cookbook ./cookbook.yaml
```

### Use

```bash
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'
```

Or with the OpenAI Python client:

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")

response = client.chat.completions.create(
model="my-model",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)

for chunk in response:
print(chunk.choices[0].delta.content or "", end="", flush=True)
```

## Multi-Node Mesh

Enable clustering to spread load across machines. Nodes discover each other and route requests to wherever capacity is available.

**Zero-config LAN** — just enable it:

```yaml
cluster:
enabled: true
```

**Explicit WAN peers:**

```yaml
cluster:
enabled: true
peers: ["other-node.example.com:8080"]
```

Inter-node traffic is encrypted with Noise Protocol (keys auto-generated, Trust-On-First-Use by default).

## Model Profiles

Request a specific profile with `model:profile` syntax:

```json
{ "model": "my-model:fast", ... }
```

Define profiles in the cookbook to trade off speed vs quality, context size, quantization, etc. — each profile maps to a distinct set of `llama-server` args.

Profiles default to enabled. Set `enabled: false` on a profile to keep it in
the cookbook without listing, routing, prewarming, or advertising that profile:

```yaml
models:
- name: "my-model"
enabled: true
profiles:
- id: "default"
model_path: "./models/my-model.gguf"
llama_server_args: "-c 32768 -fa on"
- id: "experimental"
enabled: false
model_path: "./models/my-model.gguf"
llama_server_args: "-c 65536 -fa on"
```

## Hugging Face Models

Download models automatically instead of managing files manually:

```yaml
models:
- name: "qwen2.5-0.5b"
profiles:
- id: "default"
hf_repo: "ggml-org/Qwen2.5-0.5B-Instruct-GGUF"
hf_file: "qwen2.5-0.5b-instruct-q4_k_m.gguf"
llama_server_args: "-c 32768 -fa on"
```

## Environment Variable Overrides

Config values can be overridden with environment variables using the `LLAMESH_` prefix:

```bash
LLAMESH_NODE_ID=my-node LLAMESH_MAX_VRAM_MB=48000 ./target/release/llamesh --config ./config.yaml --cookbook ./cookbook.yaml
```

Use `__` (double underscore) for nested fields: `LLAMESH_CLUSTER__ENABLED=true`.

## API Endpoints

| Endpoint | Description |
|---|---|
| `POST /v1/chat/completions` | Chat completions (streaming supported) |
| `POST /v1/completions` | Text completions (streaming supported) |
| `POST /v1/embeddings` | Embeddings (requires `--embedding` profile) |
| `POST /v1/rerank` | Reranking (requires `--reranking` profile) |
| `GET /v1/models` | List available models |
| `GET /healthz` | Health check |
| `GET /readyz` | Readiness check |
| `GET /metrics` | Prometheus metrics |
| `GET /metrics/json` | JSON metrics snapshot |
| `GET /cluster/nodes` | Cluster state |
| `POST /admin/prewarm` | Pre-warm a model/profile |
| `POST /admin/rebuild-llama` | Trigger llama.cpp rebuild |

## Documentation

- **[SPEC.md](SPEC.md)** — Full technical specification: configuration reference, request lifecycle, routing algorithms, resource management, cluster design, security model, and implementation details.
- **[config.example.yaml](config.example.yaml)** — Annotated example configuration.
- **[cookbook.example.yaml](cookbook.example.yaml)** — Annotated example cookbook with model definitions.

## License

Licensed under the [Apache License, Version 2.0](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/michaelkrauty/llamesh

Awesome Lists containing this project

README