{"id":46676659,"url":"https://github.com/defai-digital/ax-serving","last_synced_at":"2026-04-06T13:01:03.119Z","repository":{"id":342757633,"uuid":"1172204618","full_name":"defai-digital/ax-serving","owner":"defai-digital","description":"Offline OpenAI-compatible serving and orchestration plane for AX Fabric on Apple Silicon, with runtime model lifecycle, routing, metrics, and multi-worker control.","archived":false,"fork":false,"pushed_at":"2026-04-06T02:24:08.000Z","size":1031,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-06T04:09:42.857Z","etag":null,"topics":["apple-silicon","automatosx","ax-fabric","control-plane","enterprise-ai","llm-serving","model-lifecycle","model-routing","offline-ai","openai-compatible","orchestration","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"agpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/defai-digital.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-04T03:38:20.000Z","updated_at":"2026-04-06T02:24:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/defai-digital/ax-serving","commit_stats":null,"previous_names":["defai-digital/ax-serving"],"tags_count":22,"template":false,"template_full_name":null,"purl":"pkg:github/defai-digital/ax-serving","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-serving","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-serving/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-serving/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-serving/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/defai-digital","download_url":"https://codeload.github.com/defai-digital/ax-serving/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-serving/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31473271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-06T08:36:52.050Z","status":"ssl_error","status_checked_at":"2026-04-06T08:36:51.267Z","response_time":112,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apple-silicon","automatosx","ax-fabric","control-plane","enterprise-ai","llm-serving","model-lifecycle","model-routing","offline-ai","openai-compatible","orchestration","rust"],"created_at":"2026-03-08T23:02:32.773Z","updated_at":"2026-04-06T13:01:03.107Z","avatar_url":"https://github.com/defai-digital.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AX Serving\n\n**Category:** Department-Scale Private AI Fleet Control Plane\n\n**Product:** The serving and orchestration layer for multi-model private AI fleets operated by SMEs and enterprise departments.\n\n\n[![macOS 14+](https://img.shields.io/badge/macOS-14%2B-black)](https://github.com/defai-digital/ax-serving)\n[![rust-1.88+](https://img.shields.io/badge/rust-1.88%2B-orange)](https://www.rust-lang.org)\n[![Tests: 384 passing](https://img.shields.io/badge/tests-384%20passing-brightgreen)](https://github.com/defai-digital/ax-serving/actions/workflows/ci.yml)\n[![license-AGPL-3.0-or-later](https://img.shields.io/badge/license-AGPL--3.0--or--later-blue)](LICENSE)\n\nAX Serving is the serving and orchestration control plane behind\n[AX Fabric](https://github.com/defai-digital/ax-fabric). It is designed for\ndepartment-scale private AI fleets that need OpenAI-compatible APIs, runtime\nmodel lifecycle control, scheduling, metrics, audit surfaces, and multi-worker\nrouting across heterogeneous workers.\n\nFor inference execution, AX Serving uses:\n- `llama.cpp` by default for all model loads\n- `ax-engine` when explicitly requested via `native` backend override\n\nAX Fabric is the product-facing layer for retrieval, knowledge, and grounded\nagent workflows. AX Serving is the infrastructure layer that makes that stack\ndeployable and operable across Mac-led and mixed-worker environments.\n\nStatus: production-ready Rust workspace for Apple Silicon\n(`aarch64-apple-darwin`) with OpenAI-compatible REST, gRPC, runtime model\nmanagement, and multi-worker orchestration oriented around department-scale\nprivate AI serving.\n\n## Market Focus\n\nAX Serving is built to win in three adjacent niches:\n\n- department-scale private AI fleet control planes\n- Mac-native serving and orchestration for single-node and Mac-grid deployments\n- enterprise mixed-worker orchestration across NVIDIA / Thor-class, Mac Studio-class, and future workers\n- serving infrastructure for governed private AI stacks such as AX Fabric\n\nWho it is for:\n\n- SMEs and enterprise departments with fewer than ~100 users or operators\n- platform and infra teams running private AI fleets\n- operators who need more than a single local runtime process\n- teams that care about model lifecycle, routing, metrics, health, audit, and fleet operations\n- private deployments that need an OpenAI-compatible serving layer without a cloud-first dependency\n\nWhat it is not:\n\n- not an end-user desktop chat app\n- not a generic CUDA hyperscale serving stack\n- not the low-level token-generation engine itself\n\nDeployment fit:\n\n- `Single Mac`: the default open-source deployment path\n- `Mac grid`: the default open-source multi-worker deployment path\n- `Enterprise heterogeneous fleet`: commercial path for NVIDIA / Thor-class workers, governed mixed-node deployments, and enterprise delivery requirements\n\nFor market positioning, competitive analysis, and ICP details, see:\n\n- [docs/market-positioning.md](docs/market-positioning.md)\n- [docs/competitive-landscape.md](docs/competitive-landscape.md)\n- [docs/icp-and-demand.md](docs/icp-and-demand.md)\n- [docs/prd/PRD-AX-SERVING-v3.0.md](docs/prd/PRD-AX-SERVING-v3.0.md)\n- [docs/prd/PRD-AX-SERVING-OSS-ENTERPRISE-BOUNDARY-v1.0.md](docs/prd/PRD-AX-SERVING-OSS-ENTERPRISE-BOUNDARY-v1.0.md)\n- [docs/prd/PRD-AX-SERVING-ENTERPRISE-EXECUTION-v1.0.md](docs/prd/PRD-AX-SERVING-ENTERPRISE-EXECUTION-v1.0.md)\n- [docs/contracts/ax-serving-public-contract-inventory.md](docs/contracts/ax-serving-public-contract-inventory.md)\n- [docs/runbooks/enterprise-private-repo-bootstrap.md](docs/runbooks/enterprise-private-repo-bootstrap.md)\n- [docs/runbooks/enterprise-release-governance.md](docs/runbooks/enterprise-release-governance.md)\n- [docs/maintainability-refactor-plan.md](docs/maintainability-refactor-plan.md)\n\n* * *\n\n## Licensing And Commercial Use\n\nAX Serving is dual-licensed:\n\n- Open-source use: `AGPL-3.0-or-later`\n- Commercial use: available under separate written license\n\nCommercial licensing is intended for organizations that want to use AX Serving\nas a proprietary serving backend, private inference/control plane, embedded\nruntime, OEM component, managed fleet, or enterprise integration layer without\nAGPL obligations.\n\nCommercial engagements may include:\n\n- commercial runtime licensing\n- private deployment rights\n- OEM / embedded redistribution rights\n- enterprise fleet and mixed-node integration work\n- support, service, and deployment terms\n\n### Open-Source And Enterprise Boundary\n\nThe public repository is the open-source core of AX Serving.\n\nThe default open-source product scope is:\n\n- single-Mac serving\n- Mac-led local serving\n- Mac worker grids\n- core serving, orchestration, worker, metrics, and admin protocols\n\nCommercial offerings cover one or both of the following:\n\n- non-AGPL licensing rights for the AX Serving core itself\n- separate enterprise modules, deployment bundles, and supported integrations\n\nThe intended enterprise expansion path is:\n\n- NVIDIA / Thor-class workers\n- heterogeneous Mac + accelerator fleets\n- enterprise auth, governance, and deployment packaging\n- supported private integrations and fleet operations tooling\n\nThe public repository contains the public source distribution, including\nsingle-node and multi-worker serving/orchestration capabilities. Commercial\nagreements govern usage outside AGPL obligations, private packaging, and\nenterprise delivery terms. The recommended technical boundary is service-level\nintegration, not private crates mixed into the public workspace.\n\nSee [LICENSING.md](LICENSING.md) and\n[LICENSE-COMMERCIAL.md](LICENSE-COMMERCIAL.md).\n\nExecution artifacts for the open-source / enterprise split:\n\n- [docs/contracts/ax-serving-public-contract-inventory.md](docs/contracts/ax-serving-public-contract-inventory.md)\n- [docs/contracts/enterprise-compatibility-metadata.example.yaml](docs/contracts/enterprise-compatibility-metadata.example.yaml)\n- [docs/runbooks/enterprise-private-repo-bootstrap.md](docs/runbooks/enterprise-private-repo-bootstrap.md)\n- [docs/runbooks/enterprise-release-governance.md](docs/runbooks/enterprise-release-governance.md)\n\n* * *\n\n## Quick Start\n\nPrerequisites:\n- Apple Silicon macOS\n- Rust toolchain\n- `llama-server` on `PATH` for `llama.cpp` fallback and explicit `llama_cpp` loads\n- a GGUF model file\n\nValidate your environment:\n\n```bash\ncargo check --workspace\nwhich llama-server\n```\n\nBackend model:\n- `native` = explicit `ax-engine`\n- `llama_cpp` = `llama-server` (default when backend is omitted)\n- `auto` = try native first, then `llama.cpp` on unsupported architectures\n\nStart the simplest local runtime:\n\n```bash\nAXS_ALLOW_NO_AUTH=true \\\ncargo run -p ax-serving-cli --bin ax-serving -- serve \\\n  -m ./models/\u003cmodel\u003e.gguf \\\n  --model-id default \\\n  --host 127.0.0.1 \\\n  --port 18080\n```\n\nSend a request:\n\n```bash\ncurl -sS http://127.0.0.1:18080/v1/chat/completions \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"model\": \"default\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Give me three short points about Rust.\"}],\n    \"stream\": false,\n    \"max_tokens\": 96\n  }'\n```\n\nFor fuller setup paths, see [QUICKSTART.md](QUICKSTART.md):\n- single runtime\n- authenticated offline deployment\n- gateway + workers\n- model management\n- embeddings\n\nTypeScript SDK (Zod-validated):\n\n```bash\ncd sdk/javascript\nnpm install\nnpm run build\n```\n\n---\n\n## Why AX Serving\n\nMost local runtimes focus on single-process inference. AX Serving focuses on the operational layer above inference:\n\n- OpenAI-compatible REST and gRPC serving\n- runtime model load/unload/reload\n- admission queueing and concurrency control\n- metrics, dashboard, diagnostics, and audit surfaces\n- multi-worker orchestration in the public repo\n- benchmark and soak tooling in the same repo\n\nPositioning:\n- AX Fabric is the product layer\n- AX Serving is the serving and orchestration layer underneath it\n- inference runtimes such as `ax-engine` and `llama.cpp` remain lower-level execution backends\n\n### Backend Architecture\n\nAX Serving is not itself the token-generation engine. It is the serving layer that routes requests into lower-level runtimes.\n\n- `llama.cpp` is the default backend for model loading across families.\n- `ax-engine` remains an explicit opt-in path for environments that can benefit from native execution.\n- routing between those backends is controlled through [`config/backends.yaml`](config/backends.yaml)\n- `ax-engine` is pinned to v1.2.2-compatible `0959a65` because `v1.3.1` regressed the shipped `gdn.metal` file and the `v1.3.2` commit path does not currently compile cleanly in this workspace snapshot.\n\nIn practice, this means AX Serving owns the APIs, scheduling, orchestration, health, metrics, and model lifecycle, while model execution defaults to `llama.cpp` with `ax-engine` as an explicit override.\n\n### Best With AX Fabric\n\nAX Serving is designed to work with AX Fabric as part of one complete system.\n\n- AX Serving: execution control plane, model lifecycle, routing, scheduling, APIs\n- AX Fabric: document ingestion, vector search, BM25/hybrid retrieval, MCP-native data access\n- Together: AX Fabric is the product layer; AX Serving is the execution layer underneath it\n\n---\n\n## Core Capabilities\n\n| Capability | AX Serving |\n|---|---|\n| OpenAI-compatible chat/completions/embeddings | ✅ |\n| Streaming SSE + non-streaming responses | ✅ |\n| Runtime model management (`/v1/models`) | ✅ |\n| Multi-worker orchestration (`ax-serving-api`) | ✅ |\n| Dispatch policies (`least_inflight`, `weighted_round_robin`, `model_affinity`, `token_cost`) | ✅ |\n| Scheduler queue/inflight controls | ✅ |\n| Prometheus + JSON metrics | ✅ |\n| Embedded dashboard (`/dashboard`) | ✅ |\n| Built-in benchmarking (`ax-serving-bench`) | ✅ |\n\n---\n\n## Run Modes\n\n### 1. Single Inference CLI\n\n```bash\ncargo run -p ax-serving-cli --bin ax-serving -- \\\n  -m ./models/\u003cmodel\u003e.gguf \\\n  -p \"Hello from AX Serving\" \\\n  -n 128\n```\n\n### 2. Single Runtime (`ax-serving serve`)\n\n```bash\nAXS_ALLOW_NO_AUTH=true \\\ncargo run -p ax-serving-cli --bin ax-serving -- serve \\\n  -m ./models/\u003cmodel\u003e.gguf \\\n  --model-id default \\\n  --port 18080\n```\n\n### 3. Gateway + Workers (`ax-serving-api` + workers)\n\nGateway:\n\n```bash\nAXS_ALLOW_NO_AUTH=true \\\ncargo run -p ax-serving-cli --bin ax-serving-api -- \\\n  --port 18080 \\\n  --internal-port 19090 \\\n  --policy least_inflight\n```\n\nWorker:\n\n```bash\nAXS_ALLOW_NO_AUTH=true \\\ncargo run -p ax-serving-cli --bin ax-serving -- serve \\\n  -m ./models/\u003cmodel\u003e.gguf \\\n  --model-id default \\\n  --port 18081 \\\n  --orchestrator http://127.0.0.1:19090\n```\n\nThis gateway + worker path is part of the open-source Mac-native deployment\nstory. Enterprise fleet products build on the same serving contracts while\nadding supported NVIDIA / Thor-class worker integrations, deployment bundles,\nand governance layers under commercial terms.\n\n---\n\n## API Surface\n\n### Serving runtime (`ax-serving serve`)\n\n- `POST /v1/chat/completions`\n- `POST /v1/completions`\n- `POST /v1/embeddings`\n- `GET /v1/models`\n- `POST /v1/models`\n- `DELETE /v1/models/{id}`\n- `POST /v1/models/{id}/reload`\n- `GET /health`\n- `GET /v1/metrics`\n- `GET /metrics`\n- `GET /dashboard`\n- `GET /v1/license`\n- `POST /v1/license`\n- `GET /v1/admin/status`\n- `GET /v1/admin/startup-report`\n- `GET /v1/admin/diagnostics`\n- `GET /v1/admin/audit`\n- `GET /v1/admin/policy`\n\n### Orchestrator (`ax-serving-api`)\n\n- `POST /v1/chat/completions`\n- `POST /v1/completions`\n- `POST /v1/embeddings`\n- `GET /v1/models`\n- `GET /health`\n- `GET /v1/metrics`\n- `GET /v1/license`\n- `POST /v1/license`\n- `GET /v1/admin/status`\n- `GET /v1/admin/startup-report`\n- `GET /v1/admin/diagnostics`\n- `GET /v1/admin/audit`\n- `GET /v1/admin/policy`\n- `GET /v1/admin/fleet`\n- `GET /v1/workers`\n- `GET /v1/workers/{id}`\n- `POST /v1/workers/{id}/drain`\n- `POST /v1/workers/{id}/drain-complete`\n- `DELETE /v1/workers/{id}`\n\nRuntime health contract:\n- `GET /health` is liveness plus readiness, not just process-up status\n- `status=ok` means the runtime is ready and at least one model is available\n- `status=degraded` means the process is alive but either no model is loaded or the runtime is thermally constrained\n\nAX Fabric integration contract:\n- documented in [docs/contracts/ax-fabric-runtime-contract.md](docs/contracts/ax-fabric-runtime-contract.md)\n\nAdmin/control-plane notes:\n- all authenticated admin responses preserve `X-Request-ID`\n- `GET /v1/admin/status` gives an operational summary\n- `GET /v1/admin/startup-report` and `GET /v1/admin/diagnostics` are for runtime inspection\n- worker inventory and drain APIs are orchestrator-only\n\n### v1.4 Runtime Controls\n\n- `AXS_SPLIT_SCHEDULER=true`\n  - enables prefill/decode activity tracking in scheduler metrics\n\nRelevant scheduler metrics:\n- `prefill_tokens_active`\n- `decode_sequences_active`\n- `split_scheduler_enabled`\n\n---\n\n## Authentication\n\n- If `AXS_API_KEY` is set, protected endpoints require bearer auth.\n- If `AXS_API_KEY` is unset, startup requires `AXS_ALLOW_NO_AUTH=true`.\n\nRecommended offline enterprise startup:\n\n```bash\nAXS_CONFIG=config/serving.offline-enterprise.yaml \\\nAXS_API_KEY=\"change-me\" \\\nAXS_MODEL_ALLOWED_DIRS=\"/absolute/path/to/models\" \\\ncargo run -p ax-serving-cli --bin ax-serving -- serve \\\n  -m /absolute/path/to/models/\u003cmodel\u003e.gguf \\\n  --model-id default\n```\n\n```bash\nAXS_API_KEY=\"token1,token2\" cargo run -p ax-serving-cli --bin ax-serving -- serve -m ./models/\u003cmodel\u003e.gguf\n```\n\nClient header:\n\n```bash\nAuthorization: Bearer token1\n```\n\n---\n\n## Build, Lint, Test\n\n```bash\ncargo check --workspace\ncargo fmt --all -- --check\ncargo clippy --workspace --tests -- -D warnings\ncargo test --workspace\n```\n\nIntegration tests (no model required — uses in-process mock servers):\n\n```bash\nAXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test orchestration\nAXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test model_management\nAXS_ALLOW_NO_AUTH=true cargo test -p ax-serving-api --test graceful_shutdown\n```\n\nRelease build:\n\n```bash\ncargo build --workspace --release\n```\n\n### Test Coverage\n\nAll tests run automatically in CI on every push and pull request against `main`. No model file or GPU is required — tests use in-process backends (`NullBackend`, `EchoBackend`, `FailingUnloadBackend`) that exercise the full request path without hardware.\n\nExact test counts change over time. Use the linked CI badge and workflow runs as the source of truth.\n\n| Suite | What It Covers |\n|---|---|\n| **Unit — serving API** | Scheduler (permits, AIMD, TTFT histogram, split prefill/decode), model registry (lifecycle, idle eviction, capacity), orchestration (queue, dispatch policies, worker registry, DashMap), REST helpers (cache key normalisation, cache hit ratio), config (env layering, validation), gRPC status mapping, auth, metrics |\n| **Unit — engine** | Backend routing, GGUF metadata parsing, thermal state, memory budget |\n| **Unit — C shim** | Null-safe llama.h ABI compatibility |\n| **Integration — model\\_management** | Auth (Bearer, whitespace tolerance, 401+WWW-Authenticate), model load/unload/reload (201/200/409/404/503), health semantics (ok/degraded/critical-thermal/no-models), input validation (400/422 on every field), full inference path (chat + completions via EchoBackend), embeddings, security response headers, metrics JSON keys, dashboard HTML, license GET/SET |\n| **Integration — orchestration** | Worker register/heartbeat/eviction, dispatch (least-inflight, weighted round-robin, model-affinity, token-cost), queue admission and backpressure, reroute on 5xx, chaos (all workers fail → 503), overload (queue full → 429) |\n| **Integration — graceful\\_shutdown** | In-flight request drains to completion before server exits |\n\nEvery CI run posts a test summary to the GitHub Actions job summary page — see the [Actions tab](https://github.com/defai-digital/ax-serving/actions) for per-run results.\n\n---\n\n## Benchmarking\n\n```bash\ncargo run -p ax-serving-bench --release -- bench -m ./models/\u003cmodel\u003e.gguf\n```\n\nOther benchmark modes:\n\n- `profile`\n- `mixed`\n- `cache-bench`\n- `soak`\n- `compare`\n- `regression-check`\n- `multi-worker`\n\n---\n\n## Repository Layout\n\n- `crates/ax-serving-engine`: backend abstraction, routing, model internals\n- `crates/ax-serving-api`: REST/gRPC serving, scheduler, orchestration\n- `crates/ax-serving-cli`: `ax-serving` and `ax-serving-api` binaries\n- `crates/ax-serving-bench`: benchmark and soak runners\n- `crates/ax-serving-shim`: C-compatible shim\n- `crates/ax-serving-py`: Python bindings\n- `config/`: serving and routing configuration\n- `docs/`: runbooks and architecture notes\n\n---\n\n## Documentation\n\n- [QUICKSTART.md](QUICKSTART.md)\n- [docs/market-positioning.md](docs/market-positioning.md)\n- [docs/competitive-landscape.md](docs/competitive-landscape.md)\n- [docs/icp-and-demand.md](docs/icp-and-demand.md)\n- [docs/prd/PRD-AX-SERVING-v3.0.md](docs/prd/PRD-AX-SERVING-v3.0.md)\n- [docs/maintainability-refactor-plan.md](docs/maintainability-refactor-plan.md)\n- [docs/adr/README.md](docs/adr/README.md)\n- `docs/contracts/ax-fabric-runtime-contract.md`\n- `sdk/javascript/README.md` (TypeScript SDK with Zod validation)\n- `sdk/python/` (Python SDK)\n- `docs/runbooks/multi-worker.md`\n- `docs/perf/service-tuning.md`\n\n---\n\n## Licensing\n\n- Open-source terms: [AGPL v3 text](LICENSE) and [licensing guide](LICENSING.md)\n- Commercial terms: [commercial licensing summary](LICENSE-COMMERCIAL.md)\n- Issue reporting policy: [CONTRIBUTING.md](CONTRIBUTING.md)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefai-digital%2Fax-serving","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefai-digital%2Fax-serving","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefai-digital%2Fax-serving/lists"}