{"id":50632421,"url":"https://github.com/hadihonarvar/flock","last_synced_at":"2026-06-11T05:00:59.511Z","repository":{"id":362617756,"uuid":"1259962837","full_name":"hadihonarvar/flock","owner":"hadihonarvar","description":"Self-hosted LLM gateway. One Go binary turns your Macs and Linux boxes into a private inference cluster — multi-machine routing, sharding via llama.cpp-RPC, per-user keys + quotas + audit, OpenAI- and Anthropic-compatible APIs behind one endpoint. Point Cursor / Claude Code / Aider / SDKs at it.","archived":false,"fork":false,"pushed_at":"2026-06-09T21:48:42.000Z","size":1056,"stargazers_count":38,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-10T04:22:01.679Z","etag":null,"topics":["ai-gateway","aider","anthropic","claude-code","cursor","gguf","golang","inference","llama-cpp","llm","local-llm","mlx","multi-tenant","ollama","openai-compatible","opentelemetry","prometheus","self-hosted","sharded-inference","vllm"],"latest_commit_sha":null,"homepage":"https://flockllm.com","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hadihonarvar.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-05T03:06:23.000Z","updated_at":"2026-06-09T22:07:12.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/hadihonarvar/flock","commit_stats":null,"previous_names":["hadihonarvar/flock"],"tags_count":74,"template":false,"template_full_name":null,"purl":"pkg:github/hadihonarvar/flock","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hadihonarvar%2Fflock","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hadihonarvar%2Fflock/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hadihonarvar%2Fflock/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hadihonarvar%2Fflock/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hadihonarvar","download_url":"https://codeload.github.com/hadihonarvar/flock/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hadihonarvar%2Fflock/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34183109,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-gateway","aider","anthropic","claude-code","cursor","gguf","golang","inference","llama-cpp","llm","local-llm","mlx","multi-tenant","ollama","openai-compatible","opentelemetry","prometheus","self-hosted","sharded-inference","vllm"],"created_at":"2026-06-06T23:01:45.149Z","updated_at":"2026-06-11T05:00:59.474Z","avatar_url":"https://github.com/hadihonarvar.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Flock\n\n\u003e **Self-hosted AI for your team. One endpoint. Your hardware.**\n\n[**flockllm.com**](https://flockllm.com) · [GitHub](https://github.com/hadihonarvar/flock) · Maintained by [Hadi Honarvar Nazari](https://www.linkedin.com/in/hadi-honarvar-nazari/) · Apache-2.0\n\n\u003e Flock is the **self-hosted control plane for LLMs**. One Go binary turns your Macs and Linux boxes into a private inference cluster — multi-machine routing, per-user keys, daily quotas, full audit log, and a built-in admin dashboard, behind one endpoint that speaks both the **OpenAI** and **Anthropic** APIs.\n\u003e\n\u003e Engine-agnostic: bring **Ollama**, **vLLM**, **MLX-LM**, or **llama.cpp-RPC**. Run open-weight models (Qwen, Llama, DeepSeek, …) on your own hardware, shard a giant model across several machines via llama.cpp-RPC, and transparently fall back to paid Claude / GPT only when you choose.\n\u003e\n\u003e Point Cursor, Claude Code, Aider, Continue, or any OpenAI/Anthropic SDK at Flock. It just works.\n\n## 🗺️ Where Flock sits\n\n```\n           ┌──────────────────────────────────────────────────────────────┐\n           │                       YOUR USE CASES                         │\n           │             (the tools your team already uses)               │\n           └──────────────────────────────────────────────────────────────┘\n                  │           │          │             │            │\n                  ▼           ▼          ▼             ▼            ▼\n            ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐\n            │  Cursor  │ │  Claude  │ │  Aider   │ │  Custom  │ │   curl   │\n            │          │ │   Code   │ │          │ │ Python   │ │  scripts │\n            │          │ │          │ │          │ │   SDK    │ │          │\n            └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘\n                 │  OpenAI    │ Anthropic  │  OpenAI    │  Either    │  HTTP\n                 └────────────┴────────────┴────────────┴────────────┘\n                                          │\n                                          │   ONE URL · ONE API KEY\n                                          ▼\n      ╔══════════════════════════════════════════════════════════════════════╗\n      ║                  ⬢ ⬢ ⬢   FLOCK   ⬢ ⬢ ⬢                              ║\n      ║                  (this is what we built)                             ║\n      ║  ════════════════════════════════════════════════════════════════    ║\n      ║  Gateway     OpenAI + Anthropic on /v1/chat/completions              ║\n      ║              per-user keys · daily quotas · full audit log           ║\n      ║              admin dashboard at :8080                                ║\n      ║                                                                      ║\n      ║  Router      Same model on N nodes  → load-balance                   ║\n      ║              Different models per node → route by placement          ║\n      ║              Model bigger than any node → split via llama.cpp-RPC    ║\n      ║              Claude / GPT requested → proxy to vendor                ║\n      ║              Engine error or timeout  → retry catalog fallback chain ║\n      ╚═════════════════════════════╤════════════════════════════════════════╝\n                                    │\n              ┌─────────────────────┼─────────────────────┐\n              ▼                     ▼                     ▼\n       ┌─────────────┐       ┌─────────────┐       ┌─────────────┐\n       │   Engines   │       │   Engines   │       │   Egress    │\n       │  (any mix)  │       │  (any mix)  │       │   proxy     │\n       │  • Ollama   │       │  • Ollama   │       │             │\n       │  • vLLM     │       │  • vLLM     │       │ api.anthro- │\n       │  • MLX-LM   │       │  • MLX-LM   │       │ pic.com     │\n       │  • llama.cpp│       │  • llama.cpp│       │ api.openai  │\n       └──────┬──────┘       └──────┬──────┘       │ .com        │\n              │                     │              └──────┬──────┘\n              ▼                     ▼                     ▼\n      ┌──────────────────────────────────────────────────────────────────────┐\n      │                    UNDERLYING LLMs / WEIGHTS                         │\n      │                                                                      │\n      │   YOUR HARDWARE                              VENDOR APIs             │\n      │   • Mac Studio · Mac Mini                    • Claude (Anthropic)    │\n      │   • Linux + RTX GPU                          • GPT, o3, o4 (OpenAI)  │\n      │                                                                      │\n      │   37 curated catalog models (Qwen 3.6,        Each request routed   │\n      │   gpt-oss, Llama 4, Gemma 4, DeepSeek V4,     to EITHER your hard-  │\n      │   Kimi K2.6, Nemotron 3 Ultra, vision +       ware OR a vendor —    │\n      │   embedding models)                           you pay vendors only  │\n      │   + any HuggingFace or Ollama model.          when YOU chose to.    │\n      └──────────────────────────────────────────────────────────────────────┘\n```\n\n**One-sentence version:** Flock is the layer that lets your tools talk to *any* LLM — open-weight on your hardware, or hosted Claude / GPT — through **one URL and one API key**, with the team controls (quotas, audit, per-user keys) that the raw vendor APIs don't give you.\n\n---\n\n## 🚀 Try it in 60 seconds\n\nFlock is engine-agnostic. The quickest path uses **Ollama** as the local engine — but vLLM, MLX-LM, and llama.cpp-RPC all work. See [Choose your engine](#choose-your-engine) below for the alternatives.\n\n### 🍎 macOS (Apple Silicon — M1/M2/M3/M4)\n\n```bash\n# 1. install Flock\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh\nexport PATH=\"$HOME/.local/bin:$PATH\"   # if the installer says so\n\n# 2. install an engine (pick one) — Ollama is the simplest default\nbrew install --cask ollama \u0026\u0026 open -a Ollama\n# alternatives: pip install mlx-lm  ·  or run llama.cpp's llama-server  ·  or run vLLM in Docker\n\n# 3. start Flock with a tiny model (~1 GB, fast download)\nFLOCK_DEFAULT_MODEL=llama-3.2-1b flock up\n```\n\n### 🐧 Linux (x86_64 or arm64) — including Raspberry Pi, NAS, edge boxes\n\n**Option A — `.deb` / `.rpm` package** (recommended for Debian / Ubuntu / Raspbian / QNAP / Asustor / Fedora / RHEL):\n\n```bash\n# Debian / Ubuntu / Raspbian (arm64 example — also amd64)\ncurl -LO https://github.com/hadihonarvar/flock/releases/latest/download/flock_VERSION_linux_arm64.deb\nsudo dpkg -i flock_VERSION_linux_arm64.deb\n# Binary at /usr/bin/flock, catalog at /usr/share/flock/catalog\n# Recommends llama.cpp for sharding — install via apt if you want it.\n\n# Fedora / RHEL / CentOS\nsudo rpm -i https://github.com/hadihonarvar/flock/releases/latest/download/flock_VERSION_linux_amd64.rpm\n```\n\n(Replace `VERSION` with the latest from [Releases](https://github.com/hadihonarvar/flock/releases). The package version stays current via your distro's normal upgrade path — `flock update` also works as an in-place binary swap for non-package installs.)\n\n**Option B — install.sh** (works everywhere; drops binary in `~/.local/bin/` and catalog in `~/.flock/catalog/`):\n\n```bash\n# 1. install Flock\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh\necho 'export PATH=\"$HOME/.local/bin:$PATH\"' \u003e\u003e ~/.bashrc \u0026\u0026 source ~/.bashrc\n\n# 2. install an engine (pick one) — Ollama is the simplest default\ncurl -fsSL https://ollama.com/install.sh | sh \u0026\u0026 sudo systemctl enable --now ollama\n# alternatives: vLLM in Docker for NVIDIA  ·  llama.cpp's llama-server  ·  MLX-LM (Apple Silicon only)\n\n# 3. start Flock with a tiny model (~1 GB, fast download)\nFLOCK_DEFAULT_MODEL=llama-3.2-1b flock up\n```\n\n\u003e 💡 Not sure which engine to install? Run `flock doctor` after step 1 — it inspects your hardware and tells you the single command to run.\n\n### What you should see (both platforms)\n\nFlock prints something like:\n\n```\n✔ default model: llama-3.2-1b\n✔ engine: ollama at http://127.0.0.1:11434\n  Flock is ready.\n  API:    http://localhost:8080/v1\n  Admin API key:   sk-orc-xK9p…\n```\n\n**Every command supports `--help`** — `flock \u003ccmd\u003e --help` prints usage, flags, and examples.\n\n**Copy that admin key.** In another terminal:\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer sk-orc-xK9p…\" \\\n  -d '{\"model\":\"auto\",\"messages\":[{\"role\":\"user\",\"content\":\"hi in 5 words\"}]}'\n```\n\nYou should see a JSON response with a 5-word reply. 🎉\n\n**Or use the web dashboard**: open `http://localhost:8080` and paste the admin key.\n\n**Or wire up Claude Code**: in any terminal where you use Claude Code, set:\n\n```bash\nexport ANTHROPIC_BASE_URL=http://localhost:8080\nexport ANTHROPIC_AUTH_TOKEN=sk-orc-xK9p…\nclaude\n```\n\n…and Claude Code talks to your local model instead of paying for the API.\n\n**If something breaks**, run `flock doctor` — it tells you exactly what to fix. Common issues are in the [Troubleshooting installation](#troubleshooting-installation) section.\n\n---\n\n| | |\n|---|---|\n| **Status** | Beta — single-node verified end-to-end (curl, dashboard, CLI); multi-node routing has in-process E2E coverage (`internal/controlplane/two_node_e2e_test.go`); real two-machine verification via the [30-sec smoke script](scripts/two-node-smoke.sh) + [manual walkthrough](docs/TWO_NODE_VERIFICATION.md). Auto-released on every `feat:` / `fix:` commit (see [Releases](https://github.com/hadihonarvar/flock/releases)). |\n| **License** | Apache 2.0 |\n| **Language** | Go (orchestrator + embedded HTML UI) |\n| **Platforms** | macOS (Apple Silicon), Linux (x86_64, arm64) |\n\n## What's shipped\n\n### Core (single-node, works today)\n\n- ✅ Single binary (`go build ./cmd/flock` → 23 MB) — no Python or Docker required\n- ✅ **OpenAI-compatible** API (`/v1/chat/completions`, `/v1/models`) — Cursor, Aider, Continue, Zed, OpenAI SDK\n- ✅ **Anthropic-compatible** API (`/v1/messages`, `/v1/messages/count_tokens`) — Claude Code, Anthropic SDK\n- ✅ Streaming (SSE) for both protocols, with proper client-disconnect handling (no goroutine leaks)\n- ✅ **Hybrid fallback** — requests for `claude-*` or `gpt-*` transparently proxy to the real Anthropic / OpenAI API (set `ANTHROPIC_API_KEY` / `OPENAI_API_KEY`); protocol mismatch (e.g., Claude model on OpenAI route) returns a clear 400\n- ✅ Engine drivers: **Ollama**, **vLLM**, **MLX-LM**, **llama.cpp** (single-node *and* RPC mode; llama-server is **auto-spawned** when the catalog entry has `source.repo` set — no manual `llama-server` step)\n- ✅ Engine endpoints + API keys configurable per engine via env (`FLOCK_VLLM_ENDPOINT`, `VLLM_API_KEY`, …)\n- ✅ Hardware auto-detection (mac + linux + NVIDIA) and auto-pick a default model\n- ✅ Catalog with 37 curated model entries spanning Llama, Qwen, Gemma, MiMo, DeepSeek, GPT-OSS, Mistral, Phi, Kimi, GLM, Nemotron, StepFun, Moondream, Pixtral families — with `released:` dates and license metadata enforced by CI\n- ✅ Interactive picker (`flock model add|info|remove`, `flock connect` with no ID launches a fuzzy-filter picker — ↑↓/enter)\n- ✅ Shell completion (`flock completion bash|zsh|fish`)\n- ✅ Colored CLI output (auto-detects TTY; respects `NO_COLOR` / `FLOCK_NO_COLOR`)\n- ✅ `--json` on every read command (`model search/ls/info`, `status`, `usage`, `audit`) for scripting\n- ✅ `flock usage --summary` / `flock audit --summary` aggregate views (top models, p50/p95/p99, error rate, sparkline) — same data as the dashboard home view\n- ✅ First-run wizard on `flock up` (picker-driven starter-model install; skip with `--no-wizard`)\n- ✅ Real progress bar on `flock model add` with bytes/sec + ETA\n- ✅ `--dry-run` on `flock model add` (preview download size, engine, RAM check, ETA without pulling weights)\n- ✅ Confirmation prompt on `flock model remove` / `flock node remove` / `flock shard remove` (skip with `--yes`)\n- ✅ Did-you-mean for top-level subcommand typos (Damerau-Levenshtein over the command list)\n\n### Multi-node (cross-node routing — landed, untested with 2 real boxes)\n\n- ✅ `flock token create --node` issues a worker join token\n- ✅ `flock join \u003cleader\u003e?token=…` registers + starts a worker HTTP server bound to the LAN/tailnet address\n- ✅ Workers run their own engine (Ollama / vLLM / MLX); leader proxies inference requests to them\n- ✅ **Router** picks the right node per request: local-preferred if the model is loaded locally, otherwise least-loaded worker that has the model\n- ✅ **Heartbeat carries loaded models** every 5s; leader reconciles the placements table automatically\n- ✅ Agent handles auth errors gracefully (401 → exit, 404 → re-register, transient → exponential backoff)\n- ✅ **Sharding auto-orchestration** — `flock shard create \u003cmodel\u003e \u003cN\u003e` picks N workers, launches `rpc-server` on each via the worker process-supervisor API, launches the coordinator `llama-server --rpc \u003clist\u003e` locally, registers the placement, and the Router routes requests to the coordinator transparently. Web UI exposes the same in the Shards tab.\n- ✅ Process supervisor (`internal/agent/supervisor.go`) — Start/Stop/Logs with TCP-port readiness probe, used by the leader for the coordinator and by workers for rpc-server.\n- ⚠️ Tailscale `tsnet` mesh backend — interface defined; LAN backend ships, tsnet implementation pending\n\n### Multi-tenant + observability\n\n- ✅ Per-user API keys with scopes (admin / user / node), daily token quotas, audit log\n- ✅ Usage metering — every request recorded with model/protocol/tokens/latency; metrics fire even in dev mode (no key required)\n- ✅ Prometheus metrics at `/metrics`\n- ✅ Embedded web UI (single HTML, Tailwind via CDN) — dashboard home with sparkline + p50/p95/p99 + error rate + top model + recent activity strip; live polling (5s) on Nodes/Models/Usage/Audit; persistent top-bar chips for role + engine reachability + node/model counts; filterable catalog browser on the Models tab; \"Add a worker\" modal with one-time join token + copy-pasteable install-and-join snippets\n- ⚠️ OIDC for the UI — UI uses pasted admin key for now\n\n### Release + ops\n\n- ✅ GitHub Actions CI workflow\n- ✅ GoReleaser config + release workflow (auto-builds darwin/linux × arm64/amd64, creates Homebrew formula)\n- ✅ Homebrew formula template\n- ✅ install.sh (`curl … | sh`) script — pulls latest from GH Releases when you tag one\n\n### Verified to work\n\n- ✅ `go build ./cmd/flock` — clean on go 1.25 / darwin-arm64\n- ✅ `go vet ./...` — clean\n- ✅ `flock up` boots, bootstraps admin key, starts gateway\n- ✅ `flock up` → `curl /v1/models` returns the auto-picked model\n- ✅ `curl /v1/chat/completions` reaches Ollama and translates errors back as proper OpenAI shape\n- ⚠️ Actual model inference response — Homebrew's `ollama` formula on arm64 is broken (missing internal `llama-server` binary); use `brew install --cask ollama` or `curl -fsSL https://ollama.com/install.sh | sh` for a working Ollama install\n\n**For new users**: see [QUICKSTART.md](QUICKSTART.md) — 3-minute install + first chat completion.\n**For full usage docs**: keep reading this file.\n**For contributors**: see [ARCHITECTURE.md](ARCHITECTURE.md).\n**For the dev team's roadmap**: see [TASKS.md](TASKS.md).\n\n---\n\n## Table of contents\n\n- [Why Flock?](#why-flock)\n- [60-second quick start](#60-second-quick-start)\n- [Who is this for?](#who-is-this-for)\n- [Architecture overview](#architecture-overview)\n- [Features](#features)\n- [Supported models](#supported-models)\n- [Supported clients](#supported-clients)\n- [Hardware recommendations](#hardware-recommendations)\n- [Installation](#installation)\n- [Configuration](#configuration)\n- [Cluster operations](#cluster-operations)\n- [Managing models](#managing-models)\n- [Connecting clients](#connecting-clients)\n- [API reference](#api-reference)\n- [CLI reference](#cli-reference)\n- [Web UI](#web-ui)\n- [Troubleshooting](#troubleshooting)\n- [FAQ](#faq)\n- [License](#license)\n\n---\n\n## Why Flock?\n\nAI coding tools are the new dev tax. Cursor, Claude Code, Copilot, custom agents — every team uses them, and the bill grows with usage. A single engineer running modern agentic tools heavily can burn $200–500/month in API tokens. For a team of 10 that's $30–60k a year, and rising. Every request also sends proprietary code to a third party.\n\nThere are excellent open-weight models now — Qwen3-Coder, Llama 3.3, DeepSeek-V3 — that match or exceed paid APIs for most coding work. But running them across a few machines, exposing them through one API, routing traffic intelligently, and making it all feel as easy as `pip install` is *not* solved.\n\n**Flock is the orchestration layer.** It does for self-hosted LLMs what Kubernetes did for web services — minus the YAML. One binary. One install command. Auto-discovery. Auto-placement. Drop-in compatibility with every tool you already use.\n\n### Design principles\n\n1. **One binary, zero dependencies.** Static Go executable. No Python, no Docker (unless you want it), no virtualenv. Curl it down and run.\n2. **Zero config to first response.** Smart defaults everywhere. Hardware auto-detected. Model auto-picked. Network auto-meshed.\n3. **The UI tells you the next step.** Every state in the web UI has a clear, copy-pasteable next action. Juniors should never stare at a blank prompt.\n4. **Heterogeneous is invisible.** Mac, NVIDIA, AMD — the user picks models, not hardware.\n5. **OpenAI- and Anthropic-compatible from day one.** Same endpoint serves both protocols.\n6. **Permissive open source.** Apache 2.0. No open-core gotchas.\n7. **The CLI is the source of truth.** Every user-facing capability ships as a `flock` CLI command first. The web UI is a thin wrapper — it invokes the same Go functions the CLI invokes, never reimplements logic. If you can do it in the UI, you can do it in CI / scripts / SSH sessions, and vice versa.\n8. **Adding or switching a model is one action.** No hand-written YAML, no manual GGUF downloads, no separate worker-side setup. `flock model add hf:owner/repo` does the rest — picks engine, picks quant, shards if needed, distributes weights, warms the model. The default model is auto-picked from hardware on first `flock up`; to change it later, set `router.default_model` in `~/.flock/config.yaml` and restart, or `FLOCK_DEFAULT_MODEL=\u003cid\u003e flock up`.\n\n---\n\n## 60-second quick start\n\n### On the first machine (becomes the leader)\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh\nflock up\n```\n\nYou'll see:\n\n```\n▶ detected darwin/arm64 · 24 GB RAM · 8 cores\n✔ default model: qwen-coder-7b\n✔ engine: ollama at http://127.0.0.1:11434\n▶ pulling qwen-coder-7b · downloading [████████████████████] 4.7/4.7 GB · 85 MB/s · ETA 0:00\n✔ model ready: qwen-coder-7b\n\n  Flock is ready.\n\n  Dashboard: http://localhost:8080\n  API:    http://localhost:8080/v1\n  Key:    sk-orc-xK9p…  (also in UI)\n\n  Add another machine:\n    curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh -s -- join flock-7f3a.ts.net?token=…\n```\n\n### On any additional machine\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh -s -- join flock-7f3a.ts.net?token=…\n```\n\nThe agent auto-joins the mesh, registers its capabilities, and the leader assigns it a model. You don't pick anything; you don't open any firewall ports.\n\n### Test it from your terminal\n\n```bash\ncurl http://localhost:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer sk-orc-xK9p…\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\n    \"model\": \"auto\",\n    \"messages\": [{\"role\":\"user\",\"content\":\"write fizzbuzz in rust\"}]\n  }'\n```\n\n### Use it from Claude Code\n\n```bash\nexport ANTHROPIC_BASE_URL=http://localhost:8080\nexport ANTHROPIC_AUTH_TOKEN=sk-orc-xK9p…\nclaude\n```\n\nClaude Code is now talking to your local Qwen-Coder. Same UX, your hardware.\n\n---\n\n## Who is this for?\n\n| You are… | Flock helps you… |\n|---|---|\n| A **10–50 person dev team** spending $30k+/yr on Claude/GPT APIs | Run the same workflows on hardware that pays for itself in \u003c6 months |\n| A **regulated org** (legal, health, defense) that can't send code to third parties | Keep 100% of inference on-prem; optional opt-in fallback to vendor APIs |\n| An **AI/ML lab** with mixed-spec workstations and lab Macs | Pool all of it into one cluster behind one API |\n| A **solo developer** who wants one endpoint covering their laptop, home server, and lab GPU | Use Cursor/Claude Code anywhere with the same key |\n| A **classroom or research group** | Give every student a real LLM endpoint without per-seat costs |\n| An **MSP or platform team** | Offer \"internal Claude\" as a service to product teams without lock-in |\n\n### Non-goals\n\n- **Training or fine-tuning** — Flock serves inference. Use Axolotl / Unsloth / torchtune for training, import the adapter.\n- **Replacing real Claude Opus** — open models won't match Anthropic's frontier for long agentic runs. Flock makes the hybrid clean, not the choice unnecessary.\n- **A SaaS product** — Flock is the software you run. The OSS is always complete.\n\n---\n\n## Architecture overview\n\n```\n   CLIENTS  (Cursor · Claude Code · Aider · SDKs · curl)\n                       │\n                       ▼  one endpoint, one key\n   ┌──────────────────────────────────────────────────┐\n   │  GATEWAY      OpenAI + Anthropic compatible      │\n   │               auth · routing · streaming · log   │\n   └────────────────────┬─────────────────────────────┘\n                        │\n        ┌───────────────┼──────────────────┐\n        ▼               ▼                  ▼\n   ┌────────────┐ ┌────────────┐    ┌──────────────────┐\n   │ Worker A   │ │ Worker B   │    │ External APIs    │\n   │ Linux+GPU  │ │ Mac Mini   │    │ (Claude, GPT…    │\n   │ vLLM       │ │ MLX-LM     │    │  fallback)       │\n   └────────────┘ └────────────┘    └──────────────────┘\n        ▲               ▲\n        │               │  heartbeats, assignments\n   ┌────┴───────────────┴──────────────────────────────┐\n   │  CONTROL PLANE                                    │\n   │  node registry · model registry · scheduler · UI  │\n   └───────────────────────────────────────────────────┘\n                        ▲\n                        │ embedded Tailscale mesh\n                        │ (mTLS, NAT-traversed)\n```\n\nSee [ARCHITECTURE.md](ARCHITECTURE.md) for the full design.\n\n---\n\n## Features\n\n### Inference\n\n- OpenAI-compatible API (`/v1/chat/completions`, `/v1/embeddings`, `/v1/models`)\n- Anthropic-compatible API (`/v1/messages`, `/v1/messages/count_tokens`)\n- SSE streaming\n- Tool / function calling (pass-through for capable models)\n- Vision (image input) on multimodal models — `image_url` content blocks on `/v1/chat/completions` route through the Ollama engine path\n- Structured output (JSON schema)\n- `model=auto` smart routing\n- Sticky sessions by user/session ID for KV cache reuse\n- Typed `engine_unreachable` errors with engine name, endpoint, and start-hint (e.g. `ollama serve`) when the upstream engine isn't responding\n- Engine health watchdog on auto-spawned engines (force-restart after 3 consecutive failures, covers hung llama-server)\n- LoRA adapter hot-loading (planned)\n- `/v1/completions`, `/v1/audio/transcriptions`, `/v1/rerank` (planned)\n\n### Cluster\n\n- Auto-discovery — a node joins by running one command with a token\n- Auto-placement — scheduler picks which node(s) host which model\n- Heterogeneous sharding via llama.cpp RPC for models larger than any single node — `flock shard create \u003cmodel\u003e \u003cN\u003e` orchestrates the coordinator + every rpc-server end-to-end\n- Live model migration (planned)\n- Cross-platform workers: Mac (MLX), Linux+NVIDIA (vLLM), Linux+AMD (vLLM ROCm — planned), CPU (llama.cpp fallback)\n- HA leader (planned)\n\n### Multi-tenancy\n\n- Per-user API keys with revocation and scopes (admin / user / node)\n- Daily token quotas per key with usage metering\n- Audit log of every admin mutation\n- OIDC login for the web UI (Google, GitHub, Okta) — **planned**; the UI currently uses a pasted admin key\n\n### Hybrid local + cloud\n\n- Built-in egress adapters for Anthropic + OpenAI; vendor model IDs (`claude-*`, `gpt-*`) transparently proxy upstream when `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` is set\n- Failure-based fallback chain: any catalog entry can declare `fallback: [next-id, …]` and the router will try the chain in order on engine errors, 503s, or timeouts (transparent to the client)\n- **AWS Bedrock**: SigV4 signing for `anthropic.*` models (non-streaming). Streaming body translation for other families pending.\n- **GCP Vertex**: ADC auth probe wired. Body translation for `generateContent` pending.\n\n### Observability\n\n- Prometheus metrics endpoint (`/metrics`) — per-model RPS, latency, tokens, errors\n- Per-call usage records (model, protocol, tokens, latency, outcome) via `flock usage` and the Usage tab\n- Admin audit log via `flock audit` and the Audit tab\n- Reference Grafana dashboards in [`dashboards/`](dashboards/) — `cluster-overview.json`, `per-model.json`, `per-node.json`. Import any of them into Grafana 10+ and point at your Prometheus scrape of Flock's `/metrics`.\n- OpenTelemetry / OTLP traces. Set `observability.otlp_endpoint` (or `FLOCK_OTLP_ENDPOINT`) to your collector — e.g. `http://localhost:4318` — and Flock emits a full span hierarchy per request: `http.request` → `router.Chat` (covers the whole stream) → `router.Chat.attempt` (one per fallback retry) → `\u003cengine\u003e.Chat` (engine call with prompt/completion token counts). All four engine drivers (ollama, vllm, mlx, llamacpp) export the same span shape. W3C `traceparent` propagation is always on so Flock participates correctly between two services that both export. Empty endpoint = no-op (zero overhead beyond the NoopTracerProvider).\n\n### Developer experience\n\n- One-line install (`curl | sh`)\n- One-line model add (`flock model add qwen3.6-27b`) with a real progress bar and `--dry-run` preview\n- One-line client config (UI generates per-tool snippets)\n- Interactive picker for `flock model add|info|remove` and `flock connect` — no need to memorize IDs\n- Shell completion for bash / zsh / fish (`flock completion \u003cshell\u003e`)\n- Sensible defaults, no required flags\n- Embedded web UI — no separate frontend to deploy\n\n---\n\n## Supported models\n\n\u003e **For the complete per-model walkthrough** (system requirements, performance per platform, install + use snippets for every client) see **[MODELS.md](MODELS.md)**.\n\nFlock ships a curated catalog of **37 open-weight models** in `catalog/*.yaml`, spanning everything from 1 B edge models to 1 T-parameter sharded frontier MoE. Any other model also works via `flock model add hf:\u003cowner\u003e/\u003crepo\u003e` (HuggingFace direct) or `flock model add ollama:\u003cname\u003e` (any Ollama-pullable tag). See [catalog/README.md](catalog/README.md) for the YAML schema if you want to PR an entry.\n\n\u003e 📋 **Picker table — what to install** — full table with size, RAM, chat/code/reasoning/vision/audio/context ratings and license per model: **[MODELS.md → Picker table](MODELS.md#-picker-table--what-to-install)**.\n\n### Shipped catalog at a glance\n\n| Tier | Models |\n|---|---|\n| **Edge (≤2 GB RAM)** | `llama-3.2-1b`, `llama-3.2-3b` |\n| **Small / laptop (8-16 GB)** | `qwen-coder-7b`, `deepseek-r1-8b`, `lfm2.5-8b-a1b` ⭐, `qwen3-8b`, `mellum2-12b`, `mistral-nemo-12b`, `gemma4-12b` (multimodal), `qwen3-14b`, `qwen-coder-14b`, `phi-4-14b` |\n| **Consumer big (16-32 GB)** | `gpt-oss-20b` ⭐, `qwen3.6-27b` ⭐, `gemma4-26b`, `qwen3-30b`, `qwen3-coder-30b`, `qwen-coder-32b` |\n| **Single 80 GB GPU** | `llama-3.3-70b-sharded`, `gpt-oss-120b`, `llama-4-scout` (10M ctx, multimodal) |\n| **Sharded frontier (≥128 GB combined)** | `step-3.7-flash-sharded` ⭐ (Apache-2.0), `deepseek-v4-flash-sharded`, `nemotron-3-ultra-sharded` (Mamba-MoE, 1M ctx), `glm-5.1-sharded`, `kimi-k2.6-sharded` |\n\n⭐ = current top picks (June 2026).\n\nRun `flock model search` to list everything live with sizes and capabilities, or `flock model info \u003cid\u003e` for one model's full spec. Add `--sort=released` for newest-first, `--since 2026-01-01` to filter by date, or `--json` for machine-readable output. `flock model ls`, `flock status`, `flock usage`, and `flock audit` also accept `--json`. Running any `flock model add|info|remove` or `flock connect` with no ID launches an interactive picker (type to filter; arrow keys to navigate). Output is colored when stdout is a TTY; set `NO_COLOR=1` (or `FLOCK_NO_COLOR=1`) to disable.\n\nThe dashboard at `http://localhost:8080` mirrors the CLI: persistent top-bar chips show role + engine reachability + node/model counts (polled every 5 s); the Home tab summarizes traffic (requests-per-minute sparkline, p50/p95/p99, error rate, top model, recent activity); the Models tab includes a filterable catalog browser with per-row install; Nodes / Models / Usage / Audit refresh live while their tab is active; and \"Add a worker\" generates a one-time join token with copy-pasteable install-and-join snippets.\n\nThe same aggregates are available from the CLI: `flock usage --summary` and `flock audit --summary` print the top-models / p50-p95-p99 / error-rate / sparkline view that the dashboard renders. Both also accept `--json`.\n\nEngine reliability: when Flock auto-spawned the engine itself (`flock up` with `FLOCK_ENGINE=llamacpp`), a health watchdog polls every 30 s and force-restarts the process after three consecutive failures — so a hung `llama-server` no longer requires manual intervention. For user-managed engines (Ollama, vLLM) Flock leaves the process alone but `/v1/chat/completions` now returns a typed `engine_unreachable` error with the engine name, endpoint, and the exact command to start it (`ollama serve`, `mlx_lm.server …`, etc.) when the engine isn't responding.\n\n### Proxied (paid APIs — shipped, works today)\n\nWhen a request's model name matches one of these, Flock proxies to the upstream vendor with **your** API key (env-configured) and logs the call as usage like any other request:\n\n- **Anthropic upstream**: any `claude-*` model id\n- **OpenAI upstream**: `gpt-*`, `o1*`, `o3*`, `o4*` model ids\n\nRouting logic lives in `internal/api/egress.go`; vendor detection in `internal/router/router.go`.\n\n### Roadmap — model families not yet in catalog\n\nThese work today via `flock model add hf:owner/repo` but don't have curated YAML entries with hardware specs:\n\n- **Larger general / agent models** — Qwen3-235B, MiniMax-M2.7, MiMo-V2 sharded variants — pending sharded YAML entries.\n- **Speech / transcription** — `/v1/audio/transcriptions` not yet shipped.\n- **Rerank** — `/v1/rerank` not yet shipped (capability declared in catalog schema for future use).\n\nShipped recently (don't fall in this list):\n- **Vision (image input)** — `gemma4-12b`, `gemma4-26b`, `gemma4-31b`, `gemma4-e2b`, `gemma4-e4b`, `qwen3-vl-8b`, `qwen3-vl-32b`, `pixtral-12b`, `moondream3`, `mimo-vl-7b`, `llama-4-scout` all serve through `/v1/chat/completions` with `image_url` content blocks.\n- **Embeddings (for RAG)** — `/v1/embeddings` is live; install `nomic-embed-text` and call it from any OpenAI-shape embedding client.\n- **Audio (input)** — `mimo-audio`, `gemma4-e2b`, `gemma4-e4b` declare `audio` capability for future routing; today they serve as `chat` models.\n\n---\n\n## Supported clients\n\nThe web UI generates a copy-pasteable config snippet for each tool.\n\n| Client | Protocol | Config |\n|---|---|---|\n| **Cursor** | OpenAI | Settings → Models → Override OpenAI Base URL |\n| **Continue.dev** | OpenAI or Anthropic | `~/.continue/config.json` → `apiBase` |\n| **Aider** | OpenAI | `aider --openai-api-base http://flock:8080/v1` |\n| **Zed** | OpenAI | `language_models.openai_compatible.api_url` |\n| **Cline / Roo Code** (VS Code) | OpenAI or Anthropic | Provider settings panel |\n| **Claude Code** | Anthropic | `ANTHROPIC_BASE_URL` env var |\n| **OpenAI Python SDK** | OpenAI | `OpenAI(base_url=…, api_key=…)` |\n| **Anthropic Python SDK** | Anthropic | `Anthropic(base_url=…, api_key=…)` |\n| **LangChain / LlamaIndex** | Either | `openai_api_base` or `anthropic_api_url` |\n| **`qwen-code` / `OpenCode`** | Anthropic | Same as Claude Code |\n| **curl** | Either | Direct |\n\n---\n\n## Hardware recommendations\n\n### Solo / dev (1 node)\n\n| Hardware | Models that fit | Good for |\n|---|---|---|\n| MacBook M2/M3, 16 GB | 3–7B Q4 | Autocomplete, learning |\n| MacBook M3/M4 Pro, 24–36 GB | 7–14B Q4 | Real coding work |\n| Mac Mini M4 Pro, 64 GB | up to 32B Q4 | Solo agent-grade |\n| Linux + RTX 4090 (24 GB) | up to 32B AWQ | Solo agent-grade, batched |\n\n### Team of ~10 (recommended)\n\n| Role | Box | Cost |\n|---|---|---|\n| Big chat/agent model | Linux + 2× RTX 5090 (64 GB total), Threadripper, 128 GB RAM | ~$11k |\n| Code completion #1 | Mac Mini M4 Pro 64 GB | ~$2k |\n| Code completion #2 | Mac Mini M4 Pro 64 GB | ~$2k |\n| Control plane | Mac Mini base / NUC | ~$1k |\n| Network | 10 GbE switch + cables | ~$0.5k |\n| **Total** | | **~$16k** |\n\nServes ~10 heavy users with headroom. Power draw ~300 W idle, ~900 W peak. Fits one 20 A circuit. Breaks even vs. typical Claude/GPT spend in ~5 months.\n\n### Larger team / production\n\n- 1× H100 80 GB or 2× A100 80 GB for the flagship model\n- 2× Mac Mini for completion\n- 1× dedicated control box\n\nServes 25–50 users comfortably.\n\n---\n\n## Installation\n\n### Prerequisites — read first\n\nFlock is a **gateway** — it doesn't include an LLM engine. You need one of:\n- **Ollama** (recommended for most users; works on Mac + Linux + NVIDIA + CPU)\n- vLLM (for NVIDIA GPUs at scale — Linux only)\n- MLX-LM (for fastest perf on Apple Silicon)\n\n\u003e ⚠️ **Apple Silicon heads-up:** the Homebrew `ollama` formula is currently missing the internal `llama-server` binary — model inference fails with `500: llama-server binary not found`. Use the **cask** (`brew install --cask ollama`) or the official installer instead. The Flock installer detects this and warns you.\n\n### macOS (Apple Silicon)\n\n```bash\n# 1. install Ollama (use cask, NOT plain `brew install ollama`)\nbrew install --cask ollama\nopen -a Ollama                      # starts the daemon\n\n# 2. install Flock\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh\n\n# 3. add the install dir to PATH if the installer says so, e.g.:\nexport PATH=\"$HOME/.local/bin:$PATH\"\n\n# 4. start Flock\nflock up\n```\n\n### Linux (x86_64 or arm64)\n\n```bash\n# 1. install Ollama\ncurl -fsSL https://ollama.com/install.sh | sh\nsudo systemctl enable --now ollama   # or just: ollama serve \u0026\n\n# 2. install Flock\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh\n\n# 3. add install dir to PATH if needed\necho 'export PATH=\"$HOME/.local/bin:$PATH\"' \u003e\u003e ~/.bashrc\nsource ~/.bashrc\n\n# 4. start Flock\nflock up\n```\n\n### What the installer does\n\n1. Detects your OS + architecture (must be macOS/arm64, Linux/x86_64, or Linux/arm64)\n2. Checks for required shell tools (curl, tar)\n3. Checks whether Ollama is installed and warns with the install command if not\n4. Detects the broken-Homebrew-ollama case on macOS and tells you how to fix it\n5. Fetches the **latest release** binary from GitHub Releases\n6. Verifies SHA-256 against `checksums.txt`\n7. Installs to `~/.local/bin/flock` (or `/usr/local/bin/flock` with sudo)\n8. Drops the bundled model catalog (`*.yaml`) into `~/.flock/catalog/` so `flock up` works without further setup\n9. Prints next steps + tells you if PATH needs updating\n\n### Installer flags (after `| sh -s --`)\n\n```bash\n--help                  show usage\n--version \u003cvX.Y.Z\u003e      install a specific version\n--install-dir \u003cpath\u003e    install to a specific dir\n--no-engine             skip the Ollama check\n--dry-run               show what would happen, no writes\n```\n\n### Installer env vars (alternative to flags)\n\n```bash\n# pin a specific version (skips the GH API lookup — also avoids the 60/hr rate limit)\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh \\\n  | FLOCK_VERSION=v1.14.0 sh\n\n# install to a custom dir\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh \\\n  | FLOCK_INSTALL_DIR=/opt/flock/bin sh\n\n# skip the Ollama check (CI, custom engine setups)\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh \\\n  | FLOCK_SKIP_ENGINE=1 sh\n```\n\nInstall **and** join a cluster in one command:\n\n```bash\ncurl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | \\\n    sh -s -- join https://leader.local:8080?token=\u003cTOKEN\u003e\n```\n\n### Upgrade / uninstall\n\n```bash\n# upgrade in place (no need to re-run the installer)\nflock update              # downloads latest release, verifies SHA-256, swaps binary\nflock update --check      # just check, don't install\n\n# uninstall — remove binary, catalog, and data dir\nrm -f ~/.local/bin/flock       # or /usr/local/bin/flock if you sudo-installed\nrm -rf ~/.flock                 # catalog + data + config (destructive)\n```\n\n### Build from source\n\n```bash\ngit clone https://github.com/hadihonarvar/flock\ncd flock\ngo build -o flock ./cmd/flock\n./flock version\n```\n\nRequires Go 1.25+. See [ARCHITECTURE.md → Build from source](ARCHITECTURE.md#build-from-source) for cross-compile + release builds.\n\n### System requirements\n\n- **macOS** 13+ on Apple Silicon (M1 or newer). Intel Macs not tested.\n- **Linux** x86_64 or arm64 (Ubuntu 22.04+, Debian 12+, Fedora 39+, RHEL 9+).\n- **Linux + NVIDIA**: NVIDIA driver 535+ (for vLLM); CUDA installed via the standard NVIDIA repos.\n- **RAM**: 8 GB minimum, 16+ GB recommended; whatever model you load needs to fit.\n- **Disk**: 50 GB for the binary + configs + small model cache; 200+ GB if you'll cache 70B-class models.\n- **Network**: outbound HTTPS to GitHub + HuggingFace for downloading.\n\n### Troubleshooting installation\n\n| Symptom | Cause | Fix |\n|---|---|---|\n| `curl: (22) … 404` from installer | No release yet for your platform | Check https://github.com/hadihonarvar/flock/releases ; specify `--version` if needed |\n| `command not found: flock` after install | Install dir not on PATH | `export PATH=\"$HOME/.local/bin:$PATH\"` in your shell rc |\n| `flock up` works, but chat returns 502 `llama-server binary not found` | Homebrew `ollama` formula on Apple Silicon | `brew uninstall ollama \u0026\u0026 brew install --cask ollama` |\n| `flock up` says \"engine not reachable\" | Ollama daemon not running | `ollama serve \u0026` (Linux: `sudo systemctl start ollama`) |\n| `Port 8080 in use` | Another process is using the port | `FLOCK_LISTEN=:8081 flock up` |\n| `checksum MISMATCH` | Corrupt download or tampering | Re-run installer; if it persists, file a security report (see SECURITY.md) |\n| GH API rate-limited during install | Anonymous GH API limit (60/hr) | Wait, or set `FLOCK_VERSION=v0.x.y` to skip the lookup |\n\n---\n\n## Configuration\n\nFlock follows a strict \"no config required for defaults\" rule. Every flag has a sensible default. The config file is YAML at `~/.flock/config.yaml`, or use env vars (`FLOCK_LISTEN`, `FLOCK_DATA_DIR`, …).\n\n### Minimal config (auto-generated on first `flock up`)\n\n```yaml\n# ~/.flock/config.yaml\nlisten: \":8080\"\ndata_dir: \"~/.flock\"\nauth:\n  require_keys: true   # set false for local-only dev mode\n```\n\nThe initial admin key is auto-generated on first `flock up` and printed to stderr — copy it then. There is no `auth.initial_admin_key` field; the key lives in the SQLite store, not the YAML.\n\n### Full reference\n\nEvery field below is parsed by `internal/config/config.go`. Anything not in this list is silently ignored.\n\n```yaml\nlisten: \":8080\"                       # HTTP listen address (used by leader and workers)\nexternal_url: \"\"                      # public URL printed in UI; empty → use listen addr\ndata_dir: \"~/.flock\"                  # root for state.db, models, logs\nlog_level: \"info\"                     # debug | info | warn | error\ncatalog_dir: \"\"                       # empty → built-in catalog/ directory\n\nstorage:\n  type: \"sqlite\"                      # only sqlite ships today\n  dsn: \"~/.flock/state.db\"\n  models_dir: \"~/.flock/models\"\n\nauth:\n  require_keys: true                  # set false to disable API-key auth (dev only)\n\nengine:\n  preferred: \"ollama\"                 # ollama | vllm | mlx | llamacpp\n  ollama_endpoint:   \"http://127.0.0.1:11434\"\n  vllm_endpoint:     \"http://127.0.0.1:8000\"\n  mlx_endpoint:      \"http://127.0.0.1:8080\"\n  llamacpp_endpoint: \"http://127.0.0.1:8089\"   # llama-server (single-node or RPC coordinator) — port chosen to avoid Flock leader :8080 and worker :8081\n\nrouter:\n  default_model: \"\"                   # empty → auto-pick on first up\n  sticky_sessions: true\n  latency_fallback_p95_seconds: 0     # 0 = disabled. When \u003e0, the router\n                                       # walks the catalog `fallback:` chain\n                                       # for a faster candidate FIRST whenever\n                                       # the primary's recent p95 latency\n                                       # exceeds this many seconds. Bet #1.\n  fallback:\n    enabled: false                    # true → forward unknown claude-*/gpt-* models to vendor\n    anthropic_url: \"https://api.anthropic.com\"\n    openai_url:    \"https://api.openai.com\"\n    # Bedrock (AWS) — signed via aws-sdk-go-v2 using the standard AWS\n    # credentials chain (env, shared config, instance role). v0.6 supports\n    # the anthropic.* model family non-streaming; amazon.*/meta.*/mistral.*\n    # return 501 (body translation arrives v0.7).\n    bedrock_region: \"\"                # e.g. us-east-1\n    # Vertex (GCP) — ADC auth probe wired; body translation for\n    # generateContent lands v0.7. Set the project and a 501 with ADC\n    # status returns until then.\n    vertex_project:  \"\"               # GCP project id\n    vertex_location: \"us-central1\"\n\nobservability:\n  otlp_endpoint: \"\"                   # e.g. http://localhost:4318 — empty disables tracing (no-op overhead)\n```\n\n### Environment variables\n\n| Var | Overrides |\n|---|---|\n| `FLOCK_LISTEN` | `listen` |\n| `FLOCK_DATA_DIR` | `data_dir` |\n| `FLOCK_LOG_LEVEL` | `log_level` |\n| `FLOCK_EXTERNAL_URL` | `external_url` |\n| `FLOCK_ENGINE` | `engine.preferred` |\n| `FLOCK_OLLAMA_ENDPOINT` / `FLOCK_VLLM_ENDPOINT` / `FLOCK_MLX_ENDPOINT` / `FLOCK_LLAMACPP_ENDPOINT` | corresponding `engine.*_endpoint` |\n| `VLLM_API_KEY` | bearer token sent to a vLLM server (no YAML equivalent) |\n| `FLOCK_REQUIRE_KEYS` | `auth.require_keys` (truthy `1/true/yes`) |\n| `FLOCK_DEFAULT_MODEL` | `router.default_model` |\n| `ANTHROPIC_API_KEY` / `OPENAI_API_KEY` | enables `router.fallback` for the matching vendor |\n| `FLOCK_CATALOG_DIR` | `catalog_dir` — overrides catalog lookup. Default search order: `$FLOCK_CATALOG_DIR` → `./catalog` → `\u003cexe-dir\u003e/catalog` → `~/.flock/catalog` (curl installer) → `/usr/local/share/flock/catalog` → `/usr/share/flock/catalog` (.deb/.rpm) |\n| `FLOCK_OTLP_ENDPOINT` | `observability.otlp_endpoint` (OTLP/HTTP collector URL or bare `host:port`) |\n| `FLOCK_COORDINATOR_NODE` | which node hosts the `llama-server` coordinator for sharded models; `local` forces leader, otherwise a node id. Default: highest-RAM worker. |\n| `FLOCK_REJECT_BEARER` | set to `1` on a worker to refuse the bearer-fallback auth path and require HMAC for every `/v1/process/*` call. Use once every leader is on v0.5+. |\n| `FLOCK_BEDROCK_REGION` | `router.fallback.bedrock_region` — enables Bedrock with real SigV4 signing for the anthropic.* family (v0.6); other families return 501 |\n| `FLOCK_VERTEX_PROJECT` | `router.fallback.vertex_project` — wires ADC auth check; body translation lands v0.7 |\n| `FLOCK_VERTEX_LOCATION` | `router.fallback.vertex_location` (default `us-central1`) |\n| `FLOCK_LATENCY_P95_SECONDS` | `router.latency_fallback_p95_seconds` — when primary p95 exceeds this, prefer a faster fallback. 0 = disabled (default) |\n\n### Not yet configurable (roadmap)\n\nThese features are mentioned elsewhere in this README but have no YAML knob today. The list is here so you don't waste time guessing.\n\n- **Mesh backend selection** — only the LAN backend ships in v0.4. The `tailscale` (tsnet) backend has an interface defined in `internal/mesh/` but no implementation. Tracked in [ROADMAP.md](ROADMAP.md).\n- **OIDC for the UI** — `internal/auth/` ships API keys only. The UI uses a pasted admin key for now.\n- **Scheduler policy / replication / drain timeout** — `internal/scheduler/` ships sharding orchestration only; placement is naive least-loaded with no tunables.\n- **Per-model fallback routing** — the fallback chain is all-or-nothing today (any unknown `claude-*` → Anthropic, any unknown `gpt-*` → OpenAI). Per-model whitelists are not parsed.\n- **Observability endpoints / OTLP** — Prometheus is hardcoded to the main `/metrics` endpoint; no OTLP exporter, no separate Prometheus listener.\n- **Per-node config (`~/.flock/node.yaml`)** — not read. Workers inherit engine endpoints from the leader's config or their own env vars.\n\n### Per-node engine override\n\nWorkers run their own engine binary. To point a worker at a non-default endpoint, set env vars before `flock join`:\n\n```bash\nFLOCK_ENGINE=vllm FLOCK_VLLM_ENDPOINT=http://127.0.0.1:8000 flock join http://leader:8080?token=...\n```\n\n---\n\n## Cluster operations\n\n### Start the leader\n\n```bash\nflock up\n```\n\nIdempotent. Re-running it shows status if already running.\n\n### Add a node\n\n1. From the leader: click **Add Node** in the UI, or run `flock token create --node`\n2. On the new machine: `curl -fsSL https://raw.githubusercontent.com/hadihonarvar/flock/main/installer/install.sh | sh -s -- join \u003cleader-url\u003e?token=\u003ctoken\u003e`\n\nThe token is a single-use, time-limited JWT that includes the tailnet auth key. The new node joins the mesh, registers with the leader, and waits for a model assignment.\n\n### Remove a node\n\n```bash\nflock node drain \u003cnode-id\u003e   # gracefully migrate models off\nflock node remove \u003cnode-id\u003e  # forget it\n```\n\n### End-to-end multi-node walkthrough\n\nFor a leader + one worker on the same LAN:\n\n```bash\n# === on the leader machine ===\nbrew install --cask ollama          # working Ollama (not the broken formula)\nollama serve \u0026\nflock up                            # bootstraps admin key, starts gateway on :8080\nflock model add llama-3.2-3b        # pulls on the leader's Ollama\nflock token create --node           # prints the worker join token\n\n# === on the worker machine ===\nbrew install --cask ollama\nollama serve \u0026\nflock join http://\u003cleader-host\u003e:8080?token=\u003ctoken\u003e   # registers + starts worker HTTP server\nflock model add qwen-coder-7b        # pulls on the worker's Ollama (reported back via heartbeat)\n\n# === back on the leader ===\nflock node ls                        # both nodes visible\n# requests for \"llama-3.2-3b\" stay local\n# requests for \"qwen-coder-7b\" get proxied to the worker automatically\n\n# === from your laptop ===\ncurl http://\u003cleader-host\u003e:8080/v1/chat/completions \\\n  -H \"Authorization: Bearer sk-orc-...\" \\\n  -d '{\"model\":\"qwen-coder-7b\",\"messages\":[{\"role\":\"user\",\"content\":\"hi\"}]}'\n# served by the worker, transparently\n```\n\n### Sharded models (split one brain across multiple machines)\n\nFor a model too large to fit on any single machine, Flock can split it across N workers using `llama.cpp`'s RPC backend. Flock orchestrates the whole thing — no SSHing into each box.\n\n**Prereqs:**\n- `brew install llama.cpp` on the leader (provides `llama-server` for the coordinator).\n- `rpc-server` on PATH on every worker that will host a shard. (At time of writing this binary needs a source build of llama.cpp with `cmake --preset rpc`; the Homebrew bottle doesn't include it yet.)\n- A catalog entry with `sharding.required: true` and `source.path` pointing at a local GGUF file the leader can read (see `catalog/llama-3.3-70b-sharded.yaml`).\n- N workers already joined and `ready` (`flock node ls`).\n\n**One command on the leader:**\n\n```bash\nflock model add llama-3.3-70b-sharded\n# auto-detects sharding.required=true → delegates to `flock shard create`\n\n# or explicitly:\nflock shard create llama-3.3-70b-sharded 2\n```\n\nWhat Flock does:\n\n1. Picks the 2 workers with the most free RAM\n2. Sends `POST /v1/process/start` to each worker → launches `rpc-server -p 50052`\n3. Waits for both rpc-servers to be TCP-reachable (readiness probe)\n4. On the leader, launches `llama-server -m \u003cgguf\u003e --rpc \u003cworker1\u003e:50052,\u003cworker2\u003e:50052 --port 9001`\n5. Waits for the coordinator to be reachable\n6. Persists shard rows + a `placements` row pointing the model at the local coordinator\n7. The Router routes any request for `llama-3.3-70b-sharded` to the coordinator, which fans out to the rpc-server shards internally\n\n**Manage from the CLI or web UI:**\n\n```bash\nflock shard ls                              # show every shard + coordinator\nflock shard remove llama-3.3-70b-sharded    # stops coordinator + every rpc-server, deletes rows\n```\n\nOr open `http://leader:8080` → **Shards** tab → \"Create sharded model\" form + per-model \"Tear down\" buttons.\n\n**Caveats (v0.4):**\n- Shard crash recovery is automatic for up to 5 restarts with exponential backoff (1s, 2s, 4s, 8s, 16s). After that the process enters `crashloop` state and the admin must intervene — typically by re-running `flock shard create`. Both `rpc-server` and the `llama-server` coordinator restart this way. See `internal/agent/supervisor.go`.\n- Coordinator always runs on the leader.\n- Worker bin-packing is naive (descending free-RAM); doesn't factor GPU memory or current load.\n\n### List nodes\n\n```bash\nflock node ls\n# ID            HOSTNAME      HARDWARE          ENGINE   MODEL              STATE\n# n_abc123      mac-mini-1    M4 Pro / 64 GB    mlx      qwen-coder-14b     ready\n# n_def456      gpu-tower     2× RTX 5090       vllm     qwen3-72b          ready\n# n_ghi789      lab-mac       M2 Pro / 32 GB    mlx      —                  idle\n```\n\n### Inspect a node\n\n```bash\nflock node show n_abc123\n```\n\nShows: hardware specs, current models, recent requests, error log, resource utilization.\n\n---\n\n## Managing models\n\n### Browse the catalog\n\n```bash\nflock model search coding\nflock model search vision\n```\n\n### Add a model\n\n```bash\nflock model add qwen3-coder           # from catalog\nflock model add hf:Qwen/Qwen3-72B-AWQ # from HuggingFace\nflock model add file:./my-finetune.gguf\n```\n\nThis:\n1. Checks `catalog/\u003cid\u003e.yaml`'s `hardware.min_ram_gb` (and `min_vram_gb`) against the cluster — installs that overshoot the floor are refused with a clear error. Pass `--force` to override (e.g. when you know swap or a quantization knob will save you).\n2. Records the model in the registry\n3. Picks the best node(s) to host it (or shards across multiple)\n4. Pulls the weights to those nodes (with resume support)\n5. Launches the right inference engine\n6. Flips the gateway routing to make the model available\n\n### List active models\n\n```bash\nflock model ls\n# MODEL              NODES                   STATE    REQUESTS/MIN   TOK/S\n# qwen-coder-14b     n_abc123, n_ghi789      serving  4.2            42\n# qwen3-72b          n_def456                serving  1.1            68\n```\n\n### Remove a model\n\n```bash\nflock model remove qwen-coder-14b\n```\n\n### Add a LoRA adapter (planned, v0.5)\n\nLoRA adapter loading (`flock model adapter add`) is on the roadmap; see TASKS.md.\n\n---\n\n## Connecting clients\n\nYou have **three ways** to wire up a tool: the CLI, the dashboard, or copy-paste from the snippets below. All three produce the same config — they all invoke the same `internal/control/` code path.\n\n### Fastest: `flock connect \u003cclient\u003e`\n\n```bash\nflock connect claude-code                          # Anthropic-shape: Claude Code, qwen-code, hermes\nflock connect cursor                               # OpenAI-shape: Cursor, Aider, Zed, OpenClaw, Codex CLI, …\nflock connect hermes                               # Nous Research's CLI agent w/ persistent memory\nflock connect open-webui                           # self-hosted ChatGPT-style web UI (Docker)\nflock connect open-notebook                        # OSS NotebookLM clone (sources → chat + podcast)\nflock connect goose                                # Block's OSS terminal agent\nflock connect plandex                              # terminal-native agentic planner (MIT)\nflock connect openhands                            # autonomous coding agent (formerly OpenDevin)\nflock connect codex-cli                            # OpenAI's official CLI\nflock connect opencode                             # terminal coding agent w/ per-provider baseURL\nflock connect --list                               # full client roster (19 today)\n\n# Overrides\nflock connect cursor --model qwen-coder-14b        # suggest a specific model\nflock connect aider --base-url https://flock.lan   # override gateway URL\nFLOCK_TOKEN=sk-orc-… flock connect aider           # use a non-default token\nflock connect aider --token sk-orc-…               # same, via flag\n```\n\nAnything that speaks OpenAI or Anthropic's API shape connects with one line. The full roster today: **claude-code**, **cursor**, **aider**, **continue**, **zed**, **cline**, **qwen-code**, **hermes**, **openclaw**, **opencode**, **open-webui**, **open-notebook**, **goose**, **plandex**, **openhands**, **codex-cli**, **openai-sdk**, **anthropic-sdk**, **curl**.\n\nToken comes from `--token`, then `$FLOCK_TOKEN`, then `~/.flock/admin.key` (written when you ran `flock up`). Base URL comes from `--base-url`, then `external_url` in `~/.flock/config.yaml`, then `http://localhost:\u003clisten\u003e`.\n\n### Reversing: `flock disconnect \u003cclient\u003e`\n\n```bash\nflock disconnect claude-code        # prints the unset + sk-ant-… export commands\nflock disconnect cursor             # GUI steps to clear the override\nflock disconnect --list             # same 19 clients\n```\n\nPrints the exact commands to roll back whatever `flock connect` set up — does NOT modify any shell, editor, or config file. You run the commands when you're ready. Once disconnected, the client talks straight to the vendor (`api.anthropic.com`, `api.openai.com`); nothing about your Flock host needs to change. Re-run `flock connect \u003cclient\u003e` anytime to go back.\n\n### For a teammate: `flock invite \u003cname\u003e`\n\n```bash\nflock invite hadi --quota 100000\n# Creates a user-scope token with a 100k tokens/day cap.\n# Prints a paste-into-Slack markdown card with snippets for every supported client.\n# Recipient picks the tool they use and pastes — done.\n\n# Filter the share card to specific clients\nflock invite alice --clients claude-code,cursor,curl\n\n# Suggest a specific default model in the snippets\nflock invite bob --model qwen-coder-14b\n\n# Override the gateway URL printed in the card (useful behind a reverse proxy)\nflock invite carol --base-url https://flock.example.com\n\n# Machine-readable output for scripting\nflock invite dave --format json | jq '.token'\n```\n\nFlags: `--quota N` (daily token cap, 0 = unlimited), `--clients id1,id2,…` (subset of clients to include), `--format markdown|json`, `--base-url \u003curl\u003e`, `--model \u003cid\u003e`. The token is shown exactly once — capture it then. Revoke later with `flock token revoke \u003cid\u003e`.\n\n### In the dashboard\n\nOpen `http://localhost:8080` after `flock up`. Tabs:\n\n- **Connect** — pick a tool from a dropdown, copy the snippet, click \"Test connection\" to verify the gateway works end-to-end\n- **Playground** — in-browser chat box: pick a model, send a message, see the streaming response. Useful sanity check before configuring Cursor.\n- **Tokens → + Invite teammate** — same as `flock invite`, with a modal that copies the share card as markdown.\n\n### Reference snippets (manual)\n\nIf you can't run `flock connect`, the snippets below are the same content you'd get from the CLI. Substitute your own base URL + token where shown.\n\n### Cursor\n\nSettings → Models → Add Model:\n- Name: `flock`\n- Provider: OpenAI Compatible\n- Base URL: `http://flock.your-tailnet.ts.net/v1`\n- API Key: `sk-orc-…`\n\n### Claude Code\n\n```bash\nexport ANTHROPIC_BASE_URL=http://flock.your-tailnet.ts.net\nexport ANTHROPIC_AUTH_TOKEN=sk-orc-…\nclaude\n```\n\nAdd to `~/.zshrc` or `~/.bashrc` to make permanent.\n\n### Continue.dev\n\n`~/.continue/config.json`:\n\n```json\n{\n  \"models\": [\n    {\n      \"title\": \"Flock - Qwen3-Coder\",\n      \"provider\": \"openai\",\n      \"model\": \"qwen3-coder\",\n      \"apiBase\": \"http://flock.your-tailnet.ts.net/v1\",\n      \"apiKey\": \"sk-orc-…\"\n    }\n  ]\n}\n```\n\n### Aider\n\n```bash\naider --openai-api-base http://flock.your-tailnet.ts.net/v1 \\\n      --openai-api-key sk-orc-… \\\n      --model openai/qwen3-coder\n```\n\n### OpenAI Python SDK\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http://flock.your-tailnet.ts.net/v1\",\n    api_key=\"sk-orc-…\",\n)\n\nresp = client.chat.completions.create(\n    model=\"auto\",\n    messages=[{\"role\": \"user\", \"content\": \"write a haiku about caching\"}],\n)\nprint(resp.choices[0].message.content)\n```\n\n### Anthropic Python SDK\n\n```python\nfrom anthropic import Anthropic\n\nclient = Anthropic(\n    base_url=\"http://flock.your-tailnet.ts.net\",\n    api_key=\"sk-orc-…\",\n)\n\nresp = client.messages.create(\n    model=\"qwen3-coder\",\n    max_tokens=1024,\n    messages=[{\"role\": \"user\", \"content\": \"explain CRDTs\"}],\n)\nprint(resp.content[0].text)\n```\n\n---\n\n## API reference\n\n### OpenAI surface\n\n| Method | Path | Notes |\n|---|---|---|\n| `POST` | `/v1/chat/completions` | Streaming + non-streaming; accepts `image_url` content blocks (Ollama path). Returns typed `engine_unreachable` errors with engine name + start hint when the upstream engine is down. |\n| `POST` | `/v1/embeddings` | Ollama embedding models (e.g. `nomic-embed-text`) |\n| `GET` | `/v1/models` | Lists available models |\n\n(Planned: `/v1/completions`, `/v1/audio/transcriptions`, `/v1/rerank`.)\n\n### Anthropic surface\n\n| Method | Path | Notes |\n|---|---|---|\n| `POST` | `/v1/messages` | Streaming (SSE) + non-streaming |\n| `POST` | `/v1/messages/count_tokens` | Pre-flight token count |\n\n### Flock admin surface\n\n| Method | Path | Notes |\n|---|---|---|\n| `GET` | `/healthz` `/readyz` | Liveness / readiness |\n| `GET` | `/metrics` | Prometheus exposition |\n| `GET` | `/admin/v1/nodes` | List nodes |\n| `POST` | `/admin/v1/nodes/register` | (scope=admin or node) Worker registration |\n| `POST` | `/admin/v1/nodes/heartbeat` | (scope=admin or node) Worker heartbeat with loaded models |\n| `POST` | `/admin/v1/nodes/{id}/drain` | Mark node as draining |\n| `DELETE` | `/admin/v1/nodes/{id}` | Forget a node |\n| `GET` | `/admin/v1/models` | List installed models |\n| `GET` | `/admin/v1/catalog` | List catalog entries |\n| `POST` | `/admin/v1/models` | Install a model (auto-delegates to shard orch if `sharding.required`) |\n| `DELETE` | `/admin/v1/models/{id}` | Uninstall (auto-handles sharded teardown) |\n| `GET` | `/admin/v1/tokens` | List API keys (no hash, no plaintext) |\n| `POST` | `/admin/v1/tokens` | Create a key — returns plaintext ONCE |\n| `DELETE` | `/admin/v1/tokens/{id}` | Revoke a key |\n| `GET` | `/admin/v1/shards` | List shards across all models |\n| `POST` | `/admin/v1/shards/create` | Orchestrate a sharded model |\n| `DELETE` | `/admin/v1/shards/{model_id}` | Tear down a sharded model |\n| `GET` | `/admin/v1/usage/recent` | Recent inference records |\n| `GET` | `/admin/v1/usage/summary` | Aggregate stats (top models, p50/p95/p99, error rate, RPM sparkline) |\n| `GET` | `/admin/v1/audit/recent` | Recent admin actions |\n| `GET` | `/admin/v1/audit/summary` | Top actors + top actions |\n| `GET` | `/admin/v1/config` | Effective config, secrets redacted |\n| `GET` | `/admin/v1/status` | Compact role + engine reachability + node/model counts (powers dashboard top-bar chips) |\n\nAll admin endpoints require an admin key (`flock token create --admin`).\n\n### Model routing rules\n\n`model` field in the request determines backend:\n\n| Model name | Routes to |\n|---|---|\n| exact catalog ID (`qwen3-coder`) | local cluster, that model |\n| `auto` | local; gateway picks based on heuristics |\n| `claude-…` | Anthropic API (proxied) |\n| `gpt-…`, `o3`, `o4` | OpenAI API (proxied) |\n| `hf:…` | local, if the model is loaded |\n\n---\n\n## CLI reference\n\nEvery admin action is available via the CLI **and** the web UI — full parity. Most subcommands launch an interactive picker (type to filter, ↑↓/enter) when called with no argument or an unknown ID, so you rarely need to memorize an ID.\n\n```\n# --- lifecycle (CLI only — UI can't kill the process running the UI) ---\nflock up [--no-wizard] [--auto-pull=false]   Start the local node (first-run wizard\n                                              picker installs a starter model unless\n                                              --no-wizard is set)\nflock down                        Stop the local node\nflock status [--json]             Show local + cluster status\nflock join \u003curl\u003e?token=…          Join an existing cluster as a worker\nflock doctor                      Diagnose common problems\nflock update [--check]            Check / install the latest Flock release\nflock upgrade                     Alias for `update`\nflock completion \u003cbash|zsh|fish\u003e  Print a shell completion script\nflock version                     Print version\n\n# --- nodes ---\nflock node ls                     List nodes\nflock node show \u003cid\u003e              Inspect a node\nflock node drain \u003cid\u003e             Drain a node (no new requests routed to it)\nflock node remove \u003cid\u003e [--yes]    Forget a node (prompts unless --yes)\n\n# --- models (non-sharded) ---\nflock model search [q] [--sort=released] [--since YYYY-MM-DD] [--json]\n                                  Search catalog with optional date filters\nflock model ls [--json]           List installed models\nflock model add \u003cid\u003e [--force] [--dry-run]\n                                  Install a model. --dry-run previews size/RAM/\n                                  engine/ETA without pulling weights.\nflock model info \u003cid\u003e [--json]    Full details for one catalog model\nflock model remove \u003cid\u003e [--yes]   Uninstall a model (prompts unless --yes)\n\n# --- sharded models (one model split across N machines) ---\nflock shard create \u003cmodel\u003e [N]    Orchestrate a sharded model across N workers\nflock shard ls                    List shards across all sharded models\nflock shard remove \u003cmodel\u003e [--yes]  Tear down a sharded model (prompts unless --yes)\n\n# --- API keys / tokens ---\nflock token create [name]         Issue an API key (--admin, --node)\nflock token ls                    List API keys\nflock token revoke \u003cid\u003e           Revoke a key\n\n# --- observability ---\nflock usage [--limit N] [--user X] [--summary] [--json]\n                                  Recent inference records, or aggregate summary\n                                  (top models, p50/p95/p99, error rate, sparkline)\nflock audit [--limit N] [--actor X] [--summary] [--json]\n                                  Recent admin audit entries, or top-actors/top-actions\n                                  summary\n\n# --- config ---\nflock config show [--json]        Show effective runtime config (secrets redacted)\nflock config path                 Print config file path\nflock config edit                 Print the editor command for the config file\n```\n\nOutput is colored when stdout is a TTY. Set `NO_COLOR=1` (or `FLOCK_NO_COLOR=1`) to disable. Top-level subcommand typos get a \"did you mean ...\" suggestion via Damerau-Levenshtein over the registered subcommand list.\n\n---\n\n## Web UI\n\nThe UI is shipped embedded in the Go binary via `//go:embed`. It is *not* a separate deployment. Open `http://localhost:8080` and paste the admin key.\n\nAll admin actions are also doable via CLI — see the [CLI reference](#cli-reference).\n\nPersistent top-bar chips (every view) show: role (leader/worker), engine reachability, node count, model count — polled every 5 s. Most tabs auto-refresh every 5 s while visible (pauses when the browser tab is hidden).\n\n| Tab | Capabilities |\n|---|---|\n| **Dashboard (home)** | 4 KPI cards (nodes, models, requests, tokens served); latency card with p50/p95/p99; tier-colored error-rate card; top-model card; full-width SVG sparkline of requests-per-minute over the last 60 minutes; recent-activity strip (last 6 requests with outcome badges); copy-paste curl example |\n| **Nodes** | List + status; **Add a worker** modal generates a one-time node-scope token and shows both an install-and-join curl one-liner and a `flock join` command for boxes that already have the binary; per-row **drain** and **remove** with confirmation |\n| **Models** | Installed models table; **filterable catalog browser** (search, sort by size/newest/id, hide-installed toggle, color-coded license badge, per-row Install button); per-row **remove** button with confirmation (auto-handles sharded teardown) |\n| **Shards** | List shards grouped by sharded model; **Create sharded model** form (id + shard count); per-model **Tear down** button |\n| **Tokens** | List API keys (id/name/scope/quota/status); **Create** form with name + scope (user/admin/node) + daily quota; **Revoke** button per row; new keys shown ONCE in a modal |\n| **Usage** | Recent inference records: time, user, model, protocol, tokens, latency, outcome (live polling) |\n| **Audit** | Recent admin actions with actor + action + target (live polling) |\n| **Settings** | Read-only effective config with secrets redacted; instructions for editing `~/.flock/config.yaml` and the env vars (`ANTHROPIC_API_KEY`, `OPENAI_API_KEY`, `FLOCK_*`) |\n\nMutating actions surface results via a toast notification (bottom-right, 3 s auto-dismiss) instead of inline error sprawl.\n\n## CLI vs UI parity\n\nEvery cluster action is available both ways. Pick whichever fits your workflow:\n\n| Action | CLI | UI |\n|---|---|---|\n| Add node | `flock token create --node` → `flock join \u003curl\u003e?token=…` on worker | Nodes tab → \"Add node…\" |\n| Drain node | `flock node drain \u003cid\u003e` | Nodes tab → row's \"drain\" |\n| Remove node | `flock node remove \u003cid\u003e` | Nodes tab → row's \"remove\" |\n| Install model | `flock model add \u003cid\u003e` | Models tab → catalog picker → \"Install\" |\n| Remove model | `flock model remove \u003cid\u003e` | Models tab → row's \"remove\" |\n| Create sharded model | `flock shard create \u003cmodel\u003e [N]` | Shards tab → \"Create sharded model\" |\n| Tear down sharded model | `flock shard remove \u003cmodel\u003e` | Shards tab → \"Tear down\" |\n| Create API key | `flock token create \u003cname\u003e` | Tokens tab → \"Create\" form |\n| Revoke API key | `flock token revoke \u003cid\u003e` | Tokens tab → row's \"revoke\" |\n| View recent usage | `flock usage` | Usage tab |\n| View audit log | `flock audit` | Audit tab |\n| View effective config | `flock config show` | Settings tab |\n| Edit config | edit `~/.flock/config.yaml`, restart | (read-only via UI; CLI shows the path) |\n\n**The only thing that can't be done from the UI**: starting / stopping `flock up` itself — the UI is served by that process, so it can't safely tear itself down. Use `flock up` / `flock down` from the terminal.\n\n---\n\n## Troubleshooting\n\n### `flock up` fails to start\n\n```bash\nflock doctor\n```\n\nCommon issues:\n\n- Port 8080 in use → set `listen: \":8081\"` in config\n- macOS firewall blocking mesh → System Settings → Privacy \u0026 Security → allow Flock\n- Insufficient memory → pick a smaller model (`flock model add llama-3.2-3b`)\n\n### A node won't join\n\n- Token expired (5-minute TTL by default) — generate a fresh one in the UI\n- Clock skew \u003e5 minutes between leader and node — fix NTP\n- Tailscale already running on the node — set `mesh.backend: lan` to use direct LAN\n\n### Slow inference\n\n- Check GPU utilization (`flock node show \u003cid\u003e`). If pinned at 100% under load: add a replica or upgrade.\n- Sticky sessions disabled? Re-enable for better KV cache reuse.\n- Model is CPU-falling-back? Check the leader's stderr where `flock up` is running — engine driver errors are logged there. Per-node log streaming is on the roadmap.\n\n### Claude Code shows \"model not found\"\n\n- Make sure the model ID in your request matches a local catalog ID, or one of the proxied vendor IDs.\n- `flock model ls` to confirm what's loaded.\n\n### Slow inference?\n\n- Check engine reachability: `flock doctor`\n- Add a node + install the model there: `flock node` / `flock model add` (router auto-load-balances)\n- For sharded large models: `flock shard create`\n\n---\n\n## FAQ\n\n**Can I run Claude or GPT on my hardware?**\nNo — those are closed-weight proprietary models. Flock proxies to their APIs when you ask for them, so they appear in the same endpoint, but inference happens at Anthropic/OpenAI and you pay per token.\n\n**Do I need a GPU?**\nFor real coding work, yes — either an NVIDIA GPU on Linux or an Apple Silicon Mac. CPU-only works via llama.cpp for tiny models (3B and under) and is useful for testing only.\n\n**Can I mix Macs and NVIDIA boxes in one cluster?**\nYes. That's a core design goal. The scheduler treats them as distinct pools and assigns models that fit each.\n\n**Does Flock work without internet?**\nYes, after initial model download. The mesh requires a Tailscale coordination server reachable from each node for *joining*; once joined, traffic is direct. For air-gapped deployments, use Headscale (open-source Tailscale control server) or set `mesh.backend: lan`.\n\n**How is this different from Ollama?**\nOllama is a great single-node inference engine. Flock is the *orchestration layer* across many machines. Flock uses Ollama as one of its supported engine backends.\n\n**How is this different from vLLM?**\nvLLM is a single-node inference server. Flock orchestrates vLLM (and others) across your fleet.\n\n**How is this different from exo?**\nexo is the closest project conceptually. Flock differs by: (1) Anthropic-API compatibility for Claude Code, (2) explicit hybrid local+vendor routing, (3) multi-tenant API keys / quotas / audit log (OIDC planned), (4) embedded UI and observability stack, (5) Go single-binary install.\n\n**Does Flock train models?**\nNo. Use Axolotl / Unsloth / torchtune for training. Bring back a LoRA adapter; Flock will serve it.\n\n**Why Go and not Rust?**\nGo ships a static binary as fast as Rust for this workload, with a faster development loop. We may rewrite hot paths in Rust if measurements justify it.\n\n**Is there a hosted version?**\nNot initially. The product is the software you run.\n\n**Can I use my own Tailscale account?**\nYes — set `mesh.tailnet_name` and `mesh.auth_key` to your tailnet. Otherwise Flock spins up a dedicated tailnet for the cluster.\n\n**Does Flock support AMD GPUs?**\nLinux + ROCm via vLLM-ROCm is on the roadmap.\n\n**Can I run this on Windows?**\nWorkers no (no MLX, no native vLLM). Leader/CLI yes via WSL2. Native Windows isn't a near-term priority.\n\n---\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE).\n\nYou can use Flock commercially, modify it, fork it, embed it, redistribute it. The only requirements are (a) keep the license + notice, (b) state significant changes you made. No copyleft.\n\n## Acknowledgments\n\nFlock stands on the shoulders of:\n\n- **vLLM** — for fast NVIDIA inference\n- **MLX-LM** — for Apple Silicon inference\n- **llama.cpp** — for the universal fallback\n- **Ollama** — for proving the developer-experience bar\n- **Tailscale** — for the mesh and the `tsnet` library\n- **LiteLLM** — for cross-provider protocol translation\n- **Hugging Face** — for the open-weight model ecosystem\n- The teams behind **Qwen, Llama, DeepSeek, Mistral, GLM, Phi, Gemma, StarCoder** — for releasing open weights\n\n---\n\n**Project links**\n\n- Website: https://flockllm.com\n- GitHub: https://github.com/hadihonarvar/flock\n- Maintainer: [Hadi Honarvar Nazari](https://www.linkedin.com/in/hadi-honarvar-nazari/) — `hadi.work.ca@gmail.com`\n- Security disclosures: see [SECURITY.md](SECURITY.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhadihonarvar%2Fflock","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhadihonarvar%2Fflock","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhadihonarvar%2Fflock/lists"}