{"id":49239371,"url":"https://github.com/michaelkrauty/llamesh","last_synced_at":"2026-04-24T19:00:36.596Z","repository":{"id":345677961,"uuid":"1186896061","full_name":"michaelkrauty/llamesh","owner":"michaelkrauty","description":"OpenAI-compatible mesh proxy for llama.cpp","archived":false,"fork":false,"pushed_at":"2026-04-19T08:50:39.000Z","size":375,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-19T10:26:51.899Z","etag":null,"topics":["ai","inference","llama-cpp","llm","load-balancer","openai-api","proxy","rust"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/michaelkrauty.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-20T05:25:34.000Z","updated_at":"2026-04-19T08:50:31.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/michaelkrauty/llamesh","commit_stats":null,"previous_names":["michaelkrauty/llamesh"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/michaelkrauty/llamesh","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelkrauty%2Fllamesh","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelkrauty%2Fllamesh/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelkrauty%2Fllamesh/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelkrauty%2Fllamesh/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/michaelkrauty","download_url":"https://codeload.github.com/michaelkrauty/llamesh/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/michaelkrauty%2Fllamesh/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32236744,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","inference","llama-cpp","llm","load-balancer","openai-api","proxy","rust"],"created_at":"2026-04-24T19:00:20.938Z","updated_at":"2026-04-24T19:00:36.583Z","avatar_url":"https://github.com/michaelkrauty.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# llamesh\n\nAn OpenAI-compatible mesh proxy for [llama.cpp](https://github.com/ggml-org/llama.cpp). It manages `llama-server` instances across one or more machines, handling spawn/evict lifecycle, load balancing, and cluster routing — while exposing a standard OpenAI API to clients.\n\nPoint any OpenAI-compatible client at llamesh and it handles the rest: spinning up the right model, routing to the best available instance, and tearing it down when idle.\n\n## Features\n\n- **OpenAI-compatible API** — `/v1/chat/completions`, `/v1/completions`, `/v1/embeddings`, `/v1/models`\n- **Automatic instance management** — on-demand spawn, idle eviction, health monitoring\n- **Multi-node mesh** — zero-config LAN discovery (mDNS) or explicit WAN peers, encrypted with Noise Protocol\n- **Model profiles** — configure multiple profiles per model (e.g. `fast` vs `quality`) with different llama-server args\n- **Resource guardrails** — device-wide VRAM telemetry and system memory tracking prevent OOM\n- **Hot-reload cookbook** — add/modify models without restarting\n- **Auto-build llama.cpp** — clones, builds, smoke tests, and atomically swaps binaries\n- **Hugging Face integration** — download models automatically via `hf_repo`/`hf_file`\n- **Streaming** — SSE streaming with backpressure, forwarded verbatim\n- **Metrics \u0026 health** — Prometheus metrics, JSON snapshots, `/healthz` and `/readyz` probes\n- **Security** — TLS, API key auth, Noise Protocol encryption for inter-node traffic\n\n## Quick Start\n\n### Build\n\n```bash\ngit clone https://github.com/michaelkrauty/llamesh.git\ncd llamesh\ncargo build --release\n```\n\n**Requirements:** Rust 1.80+, CMake, C/C++ compiler, git\n\n### Configure\n\nCreate a minimal `config.yaml`:\n\n```yaml\nnode_id: \"my-node\"\nlisten_addr: \"0.0.0.0:8080\"\nmax_vram_mb: 24000\nmax_sysmem_mb: 64000\n\nllama_cpp:\n  repo_url: \"https://github.com/ggml-org/llama.cpp.git\"\n  build_args:\n    - \"-DGGML_CUDA=ON\"\n  enabled: true\n```\n\nWhen NVIDIA NVML is available, llamesh counts device-wide VRAM usage against\n`max_vram_mb`, including GPU memory used by processes it does not manage.\n\nCreate a `cookbook.yaml` with your models:\n\n```yaml\nmodels:\n  - name: \"my-model\"\n    profiles:\n      - id: \"default\"\n        model_path: \"./models/my-model.gguf\"\n        llama_server_args: \"-c 32768 -fa on\"\n```\n\n### Run\n\n```bash\n./target/release/llamesh --config ./config.yaml --cookbook ./cookbook.yaml\n```\n\n### Use\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"my-model\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],\n    \"stream\": true\n  }'\n```\n\nOr with the OpenAI Python client:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http://localhost:8080/v1\", api_key=\"unused\")\n\nresponse = client.chat.completions.create(\n    model=\"my-model\",\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n    stream=True,\n)\n\nfor chunk in response:\n    print(chunk.choices[0].delta.content or \"\", end=\"\", flush=True)\n```\n\n## Multi-Node Mesh\n\nEnable clustering to spread load across machines. Nodes discover each other and route requests to wherever capacity is available.\n\n**Zero-config LAN** — just enable it:\n\n```yaml\ncluster:\n  enabled: true\n```\n\n**Explicit WAN peers:**\n\n```yaml\ncluster:\n  enabled: true\n  peers: [\"other-node.example.com:8080\"]\n```\n\nInter-node traffic is encrypted with Noise Protocol (keys auto-generated, Trust-On-First-Use by default).\n\n## Model Profiles\n\nRequest a specific profile with `model:profile` syntax:\n\n```json\n{ \"model\": \"my-model:fast\", ... }\n```\n\nDefine profiles in the cookbook to trade off speed vs quality, context size, quantization, etc. — each profile maps to a distinct set of `llama-server` args.\n\nProfiles default to enabled. Set `enabled: false` on a profile to keep it in\nthe cookbook without listing, routing, prewarming, or advertising that profile:\n\n```yaml\nmodels:\n  - name: \"my-model\"\n    enabled: true\n    profiles:\n      - id: \"default\"\n        model_path: \"./models/my-model.gguf\"\n        llama_server_args: \"-c 32768 -fa on\"\n      - id: \"experimental\"\n        enabled: false\n        model_path: \"./models/my-model.gguf\"\n        llama_server_args: \"-c 65536 -fa on\"\n```\n\n## Hugging Face Models\n\nDownload models automatically instead of managing files manually:\n\n```yaml\nmodels:\n  - name: \"qwen2.5-0.5b\"\n    profiles:\n      - id: \"default\"\n        hf_repo: \"ggml-org/Qwen2.5-0.5B-Instruct-GGUF\"\n        hf_file: \"qwen2.5-0.5b-instruct-q4_k_m.gguf\"\n        llama_server_args: \"-c 32768 -fa on\"\n```\n\n## Environment Variable Overrides\n\nConfig values can be overridden with environment variables using the `LLAMESH_` prefix:\n\n```bash\nLLAMESH_NODE_ID=my-node LLAMESH_MAX_VRAM_MB=48000 ./target/release/llamesh --config ./config.yaml --cookbook ./cookbook.yaml\n```\n\nUse `__` (double underscore) for nested fields: `LLAMESH_CLUSTER__ENABLED=true`.\n\n## API Endpoints\n\n| Endpoint | Description |\n|---|---|\n| `POST /v1/chat/completions` | Chat completions (streaming supported) |\n| `POST /v1/completions` | Text completions (streaming supported) |\n| `POST /v1/embeddings` | Embeddings (requires `--embedding` profile) |\n| `POST /v1/rerank` | Reranking (requires `--reranking` profile) |\n| `GET /v1/models` | List available models |\n| `GET /healthz` | Health check |\n| `GET /readyz` | Readiness check |\n| `GET /metrics` | Prometheus metrics |\n| `GET /metrics/json` | JSON metrics snapshot |\n| `GET /cluster/nodes` | Cluster state |\n| `POST /admin/prewarm` | Pre-warm a model/profile |\n| `POST /admin/rebuild-llama` | Trigger llama.cpp rebuild |\n\n## Documentation\n\n- **[SPEC.md](SPEC.md)** — Full technical specification: configuration reference, request lifecycle, routing algorithms, resource management, cluster design, security model, and implementation details.\n- **[config.example.yaml](config.example.yaml)** — Annotated example configuration.\n- **[cookbook.example.yaml](cookbook.example.yaml)** — Annotated example cookbook with model definitions.\n\n## License\n\nLicensed under the [Apache License, Version 2.0](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelkrauty%2Fllamesh","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmichaelkrauty%2Fllamesh","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmichaelkrauty%2Fllamesh/lists"}