{"id":47640697,"url":"https://github.com/defai-digital/ax-engine","last_synced_at":"2026-05-10T00:22:28.592Z","repository":{"id":346314324,"uuid":"1189204277","full_name":"defai-digital/ax-engine","owner":"defai-digital","description":"Mac-native Rust inference engine for running larger local GGUF models with more control on Apple Silicon M3+.","archived":false,"fork":false,"pushed_at":"2026-03-31T16:11:34.000Z","size":2220,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-03T00:39:41.291Z","etag":null,"topics":["ai-interface","apple-silicon","generative-ai","gguf","inference-engine","llama-cpp","llm","local-llm","macos","metal","metal-shaders","rust"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/defai-digital.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-23T04:42:30.000Z","updated_at":"2026-03-31T16:11:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/defai-digital/ax-engine","commit_stats":null,"previous_names":["defai-digital/ax-engine"],"tags_count":24,"template":false,"template_full_name":null,"purl":"pkg:github/defai-digital/ax-engine","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-engine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-engine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-engine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-engine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/defai-digital","download_url":"https://codeload.github.com/defai-digital/ax-engine/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/defai-digital%2Fax-engine/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31428645,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T02:22:46.605Z","status":"ssl_error","status_checked_at":"2026-04-05T02:22:33.263Z","response_time":75,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-interface","apple-silicon","generative-ai","gguf","inference-engine","llama-cpp","llm","local-llm","macos","metal","metal-shaders","rust"],"created_at":"2026-04-02T00:52:44.303Z","updated_at":"2026-05-10T00:22:28.578Z","avatar_url":"https://github.com/defai-digital.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AX Engine\n\n[![Preview Surfaces](https://github.com/defai-digital/ax-engine/actions/workflows/python-preview.yml/badge.svg?branch=main)](https://github.com/defai-digital/ax-engine/actions/workflows/python-preview.yml)\n[![Coverage Report](https://github.com/defai-digital/ax-engine/actions/workflows/coverage.yml/badge.svg?branch=main)](https://github.com/defai-digital/ax-engine/actions/workflows/coverage.yml)\n\nAX Engine is a Mac-first LLM inference runtime, local server, SDK layer, and\nbenchmark toolkit for Apple Silicon.\n\nIt is not \"AX MLX\" as a product. MLX is the primary Apple Silicon execution\nbackend for supported model families, while AX Engine also exposes explicit\ncompatibility routes for upstream `mlx-lm` and `llama.cpp` so users can stay on\none AX surface while model coverage grows.\n\n\u003e Requires **macOS 14 (Sonoma) or later** on **Apple Silicon M2 Max or newer** with **32 GB RAM minimum**.\n\u003e Rust 1.85+ for source builds.\n\n### Supported Hardware\n\nAX Engine targets high-memory Apple Silicon Macs running **macOS 14 (Sonoma) or later**.\n\n| Machine | Minimum spec | Suggested spec |\n|---|---|---|\n| Mac Mini | M4 Pro, 32 GB | M4 Pro, 64 GB |\n| MacBook Pro 14″ / 16″ | M2 Pro / M2 Max, 32 GB | M3 Max, 96 GB |\n| Mac Studio | M2 Max / M2 Ultra, 32 GB | M4 Max, 96 GB |\n\nM3, M4, M5 chip variants are supported across all three lines. M1 is not supported. M2 base chip (max 24 GB) is below the 32 GB minimum.\n\n## 30-Second Setup\n\nInstall the released command-line tools and open the local TUI cockpit:\n\n```bash\nbrew install defai-digital/ax-engine/ax-engine\nax-engine-manager --check\nax-engine-manager\n```\n\nThen connect it to a model and server:\n\n```bash\n# Download an mlx-community model and generate its manifest in one step\npython scripts/download_model.py mlx-community/Qwen3-4B-4bit\nMODEL_DIR=\"$HOME/.cache/ax-engine/models/mlx-community--Qwen3-4B-4bit\"\n\n# Start the server\nax-engine-server --mlx --mlx-model-artifacts-dir \"$MODEL_DIR\" --port 8080\n\n# In another terminal, open the TUI cockpit with live server metadata\nax-engine-manager --model-dir \"$MODEL_DIR\" --server-url http://127.0.0.1:8080\n```\n\nOr from Python (after `maturin develop` or `pip install ax-engine`):\n\n```python\nfrom ax_engine import download_model, Session\npath = download_model(\"mlx-community/Qwen3-4B-4bit\")\nwith Session(mlx=True, mlx_model_artifacts_dir=str(path)) as s:\n    print(s.generate([1, 2, 3], max_output_tokens=8).output_tokens)\n```\n\n`download_model()` downloads weights and auto-runs `ax-engine-bench generate-manifest`.\nSee [Getting a Model](#getting-a-model) for all paths including raw HF checkpoints,\nand see [AX Engine Manager](docs/MANAGER.md) for the full TUI workflow.\n\n## Why AX Engine\n\nAX Engine gives local inference work a stable runtime contract:\n\n- `ax-engine-server` exposes a local HTTP adapter over the runtime.\n- `ax-engine-bench` records workload contracts, route identity, correctness,\n  determinism, and performance evidence.\n- `ax-engine-sdk`, Python bindings, and the JavaScript preview client provide\n  thin integration surfaces over the same backend-resolution rules.\n- Repo-owned MLX execution is optimized for supported Qwen and Gemma families.\n- Delegated `mlx_lm.server` and `llama.cpp` routes cover explicit\n  compatibility cases without turning delegated results into AX-owned\n  throughput claims.\n\n[mlx_lm](https://github.com/ml-explore/mlx-lm) and\n[mlx-swift-lm](https://github.com/ml-explore/mlx-swift) remain the canonical\nMLX references. AX Engine compares against them, learns from them, and delegates\nto `mlx-lm` for unsupported MLX text models when requested. The AX-owned value\nis the runtime layer around supported workloads: request lifecycle, scheduling,\nKV/cache policy, n-gram acceleration, and auditable benchmark artifacts.\n\nFor supported transformer families on Apple Silicon, the AX-owned runtime layer\ncan produce higher effective throughput than the reference MLX runtimes on\nmatching benchmark shapes:\n\n- **N-gram acceleration** reaches up to 3.4x mlx_lm decode\n  throughput on high-hit benchmark rows — with no second draft model and no\n  model changes\n- **Coding-shaped decode is a natural fit when local repetition exists**:\n  completion, edit loops, structured diffs, JSON/tool output, imports,\n  indentation, and repeated identifiers often contain patterns that n-gram\n  acceleration can predict and the target model can verify. Novel, high-entropy,\n  or very short coding requests may see little or no gain.\n- **AX-owned request lifecycle** provides deterministic, auditable scheduling,\n  KV block management, and prefix reuse that upstream Python runtimes do not\n  expose as stable contracts\n- **workload-contract tooling** (`ax-engine-bench`) validates correctness,\n  determinism, route identity, and regression across checked-in manifests, not\n  just throughput snapshots\n\nThe thesis is not \"our MLX tensor ops are faster.\" MLX compiles and executes the\nsame compute graph either way. The thesis is that **AX's decode strategy above\nMLX** — how tokens are speculated, how requests are scheduled, how KV state is\nmaterialized — produces measurably higher effective throughput on supported\nworkloads.\n\n## Runtime Paths\n\n| Path | Use it for | Current scope |\n|---|---|---|\n| Repo-owned MLX runtime | Supported Qwen/Gemma MLX model artifacts and repo-owned performance claims | Local Apple Silicon inference, token-based server/SDK requests, benchmarked direct and n-gram acceleration modes |\n| `mlx_lm_delegated` | MLX text models that upstream `mlx-lm` supports before AX has a repo-owned graph | Blocking and SSE text generation through a user-provided `mlx_lm.server`; `/v1/generate`, `/v1/generate/stream`, and OpenAI-compatible completion/chat text endpoints. Streaming is delegated text compatibility evidence, not repo-owned token/KV performance |\n| `llama_cpp` | GGUF and non-MLX local inference | Delegated llama.cpp server/CLI compatibility; route-contract evidence, not repo-owned MLX throughput |\n\nThe runtime report exposes `selected_backend`, `support_tier`, and\n`resolution_policy` so callers and benchmark artifacts can distinguish these\npaths.\n\nFor the exact OpenAI-shaped endpoint contract, including what is and is not\ncompatible today, see `docs/API-COMPATIBILITY.md`.\n\n## Design\n\n### Execution Layer\n\nThe repo-owned MLX path uses MLX directly for tensor operations via the official\n`mlx-c` C API. Matrix multiply, quantized matmul, attention, RMSNorm, and RoPE\ngo through MLX's Apple-maintained Metal kernels. AX owns the runtime behavior\nabove that graph.\n\nWhat AX Engine adds around model execution:\n\n- **N-gram acceleration**: a bigram/trigram table built at runtime predicts\n  up to 4 draft tokens per step. The target model verifies them in one forward\n  pass over `[last_token, D1, …, D_n]`. An EMA accept-rate gate (α=0.1,\n  threshold 0.5) disables acceleration after a bad sequence and re-enables when\n  the table recovers. No second draft model required.\n- **Scheduler and KV manager**: request lifecycle, batching, memory-blocked\n  recovery, and execution planning live in `ax-engine-core` — deterministic,\n  async-free, no framework dependencies. See [`docs/SCHEDULER.md`](docs/SCHEDULER.md)\n  and [`docs/KV-CACHE.md`](docs/KV-CACHE.md) for design details.\n- **Chunked KV cache**: keys and values grow in pre-allocated backing buffers via\n  `slice_update`. Draft rollback is O(1) — only the sequence-length\n  pointer moves. After each decode step, all KV buffers are evaluated with the\n  output token to flatten the lazy-eval graph and prevent O(N²) graph depth.\n- **Graph compilation**: `mlx_enable_compile()` is called once at startup so\n  Metal shader compilation and dispatch tables are reused across steps with the\n  same shape — equivalent to `mx.compile()` in mlx_lm.\n- **GatedDelta linear attention**: hybrid architectures (Qwen3.5, Qwen3-Next)\n  use a custom SIMD-group Metal kernel for the recurrent GatedDelta state update.\n  All other ops in the same models (dense attention, FFN, projections) delegate\n  to MLX's hardware-optimized paths.\n\n### Memory Layer\n\n`mlx_set_wired_limit(recommendedMaxWorkingSetSize)` wires model weights into GPU\nmemory at startup, preventing Metal from paging them between requests. A\ndedicated GPU stream avoids cross-stream synchronization on the shared default\nstream.\n\nSee [`docs/KV-CACHE.md`](docs/KV-CACHE.md) for a detailed description of the\ntwo-layer KV cache architecture, prefix caching coordination, model-specific\ncache variants, and memory pressure handling.\n\n## Supported Models\n\n| Family | Model | Architecture notes |\n|---|---|---|\n| Gemma 4 | gemma-4-e2b-it, gemma-4-e4b-it, gemma-4-26b-a4b-it, gemma-4-31b-it | Dense, per-layer embedding, and MoE variants; MLX affine 4/5/6/8-bit weights, sliding-window + full attention, K=V full-attention layers, logit softcapping |\n| Qwen 3.5 | Qwen3.5-9B | Linear attention + MoE FFN, attn_output_gate per-head interleaving |\n| Qwen 3.6 / Coder Next | Qwen3.6-35B-A3B 4/5/6/8-bit MLX, Qwen3-Coder-Next-4bit | `qwen3_next` architecture: GatedDelta linear attention (3 of every 4 layers) + full attention with per-head sigmoid gate (every 4th layer) + sparse top-k MoE with shared expert |\n\nAll models use MLX safetensors format with the AX `model-manifest.json`\ndescriptor. Each supported architecture has a hand-written forward pass in\n`ax-engine-mlx`. Adding a new architecture means implementing the model graph,\nnot wiring up a generic loader.\n\nRecent community-model checks are tracked according to the evidence they have.\nOn 2026-05-06, `mlx-community/GLM-4.7-Flash-4bit` was promoted to a repo-owned\nMLX runtime path after the GLM MLA attention, sigmoid router, and latent-KV\ncache contracts landed and an AX server benchmark completed.\nSee\n`benchmarks/results/mlx-inference/2026-05-06/README.md` for commands and\nartifacts. Before promoting any additional architecture, run\n`scripts/probe_mlx_model_support.py --model-dir \u003cmodel-dir\u003e`: GLM now reports\n`repo_owned_runtime_ready` when the runtime-ready manifest and local reference\nfiles are present.\n\n## Limitations\n\n- **GatedDelta prefill (Qwen3.5)**: The recurrent state update in GatedDelta\n  linear-attention layers serializes over time steps and cannot be parallelized.\n  On **Qwen3.5 9B** this puts AX prefill ~9% behind mlx-swift-lm at 512 tokens;\n  decode throughput is unaffected. **Qwen3-Next (Coder Next) is not affected** —\n  AX prefill exceeds mlx-swift-lm by 2× on that architecture because the sparse\n  MoE forward path dominates the runtime, not the GatedDelta layers.\n- **Raw HuggingFace weights**: ax-engine loads MLX community (pre-sanitized)\n  weights only. For hybrid architectures (Qwen3.5, Qwen3-Next), loading an\n  unsanitized checkpoint now raises a hard error — norm weight mean is sampled at\n  load time and a clear remediation message is shown. Convert first with\n  `mlx_lm.convert`, or download a pre-sanitized model from mlx-community. See\n  [Getting a Model](#getting-a-model).\n- **N-gram acceleration rows**: effective-throughput measurements, not raw\n  model-kernel speedups. The n-gram hit rate is prompt- and output-pattern\n  dependent. Coding-shaped workloads with repeated local structure are the\n  intended high-value case; random, high-entropy, very short, or deliberately\n  diverse outputs may see little benefit, and the runtime backs off toward the\n  direct path when the accept rate drops below threshold.\n- **TurboQuant KV compression**: experimental and off by default. The\n  `turboquant-shadow` and `turboquant-fused-experimental` modes are evidence and\n  route-telemetry surfaces, not production support claims. The correctness quality\n  gate (K8/V4 fused path, zero fallbacks) now passes for Gemma 4 E2B; the\n  remaining blocker is a long-context performance promotion artifact (≥8192-token\n  context) required before public docs can drop the experimental label. Run\n  `scripts/check_turboquant_promotion_readiness.py` to see the current gate\n  status before changing any public support wording.\n\n## Performance ([methodology](docs/PERFORMANCE.md))\n\nAX Engine columns were refreshed on 2026-05-09 from\n`benchmarks/results/mlx-inference/2026-05-09-post-v4.5.0/`. This run covers 25\ncommits since the q-slice-fix run, the two most performance-relevant being native\ntop-p/top-k sampling (`d3a8615`) and a TurboQuant fused decode hot-path improvement\n(`685ca98`). The `mlx_lm` and `mlx_swift_lm` columns are matched reference rows\nreused from the previous run.\n\n**Prefill** — AX engine prefill is faster than mlx_lm on most models at short\nprompts (+40–170% at 128 tokens), driven by chunked KV allocation and a tuned\npipeline. At 512-token prompts the gap narrows; several Gemma quantizations\n(5-bit, 6-bit, E4B 4-bit) are 5–10% behind mlx_lm.\n\n**Decode** — Direct decode (n-gram disabled): Qwen 3.6 35B variants are +3–8%\nabove mlx_lm; Gemma 4-bit models and most others are within ±4%; Gemma 5–8-bit\nmodels are 5–15% below mlx_lm, a regression attributable to per-step sampling\noverhead introduced in `d3a8615`. With n-gram acceleration (the default),\neffective throughput reaches up to 3.1× mlx_lm; the speculator backs off on\nhigh-entropy outputs.\n\n**TTFT** — Qwen 3.6 and Coder Next TTFT leads are maintained: −37–63% vs\nmlx_lm across all prompt sizes. Gemma E2B at 128 tokens: −29–39%. Several\n512-token rows (E2B 5-bit, E2B 6-bit, E4B) are 7–12% above mlx_lm due to\nprefill parity or regression in this run. Source:\n`benchmarks/results/mlx-inference/2026-05-09-post-v4.5.0/`. mlx_lm TTFT is\nderived from reported prefill throughput; ax engine TTFT is measured directly\nfrom per-step runner timing.\n\nAdditional long-context validation artifacts are checked in separately from the\nshort/mid-prompt public tables. On 2026-05-07, `mlx-community/Qwen3-4B-4bit`\nwas run on Apple M5 Max through the P1 prefill-scaling gate and the P2\nstartup/concurrent-prefill gate:\n[P1 prefill scaling](benchmarks/results/mlx-inference/2026-05-07-real-p1/qwen3-4b-4bit-prefill-scaling/prefill-scaling.md),\n[P2 startup and concurrency](benchmarks/results/mlx-inference/2026-05-07-real-p2/qwen3-4b-4bit-p2-latency/p2-latency.md).\nThese artifacts measure direct AX MLX behavior, not n-gram decode acceleration:\nthe 8k P1 AX/MLX prefill ratio was 0.840x, and the 4-request P2 concurrent\nprefill row was classified as serialized. Treat them as expectation-management\nevidence for long-context serving claims, not as proof of continuous batching.\n\n### Prefill throughput (tok/s) — percentages vs mlx_lm\n\n| Model | MLX quantization | Prompt tok | mlx_lm | mlx_swift_lm | ax engine |\n|---|---|---:|---:|---:|---:|\n| Gemma 4 E2B | 4-bit · group=64 · affine | 128 | 2,265.8 | 2,450.4 (+8.1%) | 3,413.2 (+50.7%) |\n|    |    | 512 | 7,634.1 | 6,664.3 (-12.7%) | 7,744.1 (+1.4%) |\n| Gemma 4 E2B | 5-bit · group=64 · affine | 128 | 2,267.5 | 2,393.9 (+5.6%) | 3,306.6 (+45.8%) |\n|    |    | 512 | 8,405.7 | 6,742.6 (-19.8%) | 7,532.0 (-10.4%) |\n| Gemma 4 E2B | 6-bit · group=64 · affine | 128 | 2,156.3 | 3,436.8 (+59.4%) | 3,058.0 (+41.8%) |\n|    |    | 512 | 7,320.7 | 7,962.3 (+8.8%) | 6,833.5 (-6.7%) |\n| Gemma 4 E2B | 8-bit · group=64 · affine | 128 | 1,911.7 | 3,082.0 (+61.2%) | 3,113.2 (+62.9%) |\n|    |    | 512 | 6,582.8 | 6,758.1 (+2.7%) | 7,201.9 (+9.4%) |\n| Gemma 4 E4B | 4-bit · group=64 · affine | 128 | 1,586.0 | 2,006.2 (+26.5%) | 2,339.7 (+47.5%) |\n|    |    | 512 | 4,432.6 | 4,362.5 (-1.6%) | 4,101.3 (-7.5%) |\n| Gemma 4 26B A4B | 4-bit · group=64 · affine | 128 | 545.3 | 1,227.3 (+125.1%) | 1,127.2 (+106.7%) |\n|    |    | 512 | 1,620.7 | 2,938.6 (+81.3%) | 2,887.7 (+78.2%) |\n| Gemma 4 31B | 4-bit · group=64 · affine | 128 | 336.5 | 641.6 (+90.7%) | 510.4 (+51.7%) |\n|    |    | 512 | 563.5 | 760.6 (+35.0%) | 662.9 (+17.7%) |\n| Qwen 3.5 9B | 4-bit · group=64 · affine | 128 | 1,131.5 | 2,101.1 (+85.7%) | 1,924.5 (+70.1%) |\n|    |    | 512 | 2,285.3 | 3,165.8 (+38.5%) | 2,711.2 (+18.6%) |\n| Qwen 3.6 35B A3B | UD-MLX 4-bit · group=64 · affine | 128 | 531.7 | 963.2 (+81.1%) | 981.5 (+84.6%) |\n|    |    | 512 | 1,594.2 | 2,546.5 (+59.7%) | 2,517.2 (+57.9%) |\n| Qwen 3.6 35B A3B | MLX 5-bit · group=64 · affine | 128 | 474.4 | 861.8 (+81.7%) | 960.8 (+102.5%) |\n|    |    | 512 | 1,484.5 | 2,416.7 (+62.8%) | 2,434.7 (+64.0%) |\n| Qwen 3.6 35B A3B | MLX 6-bit · group=64 · affine | 128 | 420.0 | 762.4 (+81.5%) | 908.9 (+116.4%) |\n|    |    | 512 | 1,377.9 | 2,350.6 (+70.6%) | 2,328.1 (+69.0%) |\n| Qwen 3.6 35B A3B | MLX 8-bit · group=64 · affine | 128 | 393.1 | 617.7 (+57.1%) | 923.2 (+134.8%) |\n|    |    | 512 | 1,202.2 | 2,305.2 (+91.7%) | 2,275.8 (+89.3%) |\n| Qwen Coder Next | 4-bit · group=64 · affine | 128 | 267.1 | 384.9 (+44.1%) | 714.4 (+167.4%) |\n|    |    | 512 | 815.4 | 1,417.0 (+73.8%) | 1,665.1 (+104.2%) |\n| GLM 4.7 Flash | 4-bit · group=64 · affine | 128 | 502.9 | 1,045.0 (+107.8%) | 819.2 (+62.9%) |\n|    |    | 512 | 1,584.7 | 2,588.8 (+63.4%) | 2,230.9 (+40.8%) |\n\n### Decode throughput (tok/s) — generation=128 tokens, temp=0\n\nThe direct AX column is a same-policy diagnostic baseline with n-gram acceleration\ndisabled. The n-gram column is the default AX decode policy and the row to use for\nuser-facing throughput expectations. For Qwen 3.5 at 512 prompt tokens, the default\nn-gram row falls back to the direct pipeline after a no-draft probe window.\n\n| Model | MLX quantization | Prompt tok | mlx_lm | mlx_swift_lm | ax direct baseline | ax default n-gram |\n|---|---|---:|---:|---:|---:|---|\n| Gemma 4 E2B | 4-bit · group=64 · affine | 128 | 197.5 | 192.4 (-2.6%) | 192.1 (-2.7%) | **581.5 (+194.5%)** |\n|    |    | 512 | 191.9 | 179.5 (-6.5%) | 184.9 (-3.6%) | **575.1 (+199.6%)** |\n| Gemma 4 E2B | 5-bit · group=64 · affine | 128 | 182.9 | 174.1 (-4.8%) | 169.7 (-7.2%) | **457.4 (+150.0%)** |\n|    |    | 512 | 178.1 | 167.0 (-6.2%) | 164.6 (-7.6%) | **454.2 (+155.0%)** |\n| Gemma 4 E2B | 6-bit · group=64 · affine | 128 | 161.3 | 153.0 (-5.1%) | 137.2 (-14.9%) | **377.9 (+134.3%)** |\n|    |    | 512 | 154.2 | 147.1 (-4.6%) | 137.6 (-10.8%) | **403.3 (+161.5%)** |\n| Gemma 4 E2B | 8-bit · group=64 · affine | 128 | 139.4 | 134.9 (-3.2%) | 125.3 (-10.1%) | **412.5 (+195.9%)** |\n|    |    | 512 | 134.5 | 130.8 (-2.8%) | 128.2 (-4.7%) | **416.6 (+209.6%)** |\n| Gemma 4 E4B | 4-bit · group=64 · affine | 128 | 121.3 | 116.4 (-4.0%) | 109.9 (-9.4%) | **332.2 (+173.9%)** |\n|    |    | 512 | 120.0 | 117.9 (-1.7%) | 109.5 (-8.7%) | **340.6 (+184.0%)** |\n| Gemma 4 26B A4B | 4-bit · group=64 · affine | 128 | 118.3 | 109.4 (-7.5%) | 115.6 (-2.2%) | **259.2 (+119.2%)** |\n|    |    | 512 | 113.1 | 104.7 (-7.5%) | 111.0 (-1.8%) | **220.0 (+94.5%)** |\n| Gemma 4 31B | 4-bit · group=64 · affine | 128 | 26.2 | 24.8 (-5.5%) | 25.2 (-3.8%) | **57.3 (+118.4%)** |\n|    |    | 512 | 24.9 | 24.7 (-0.9%) | 23.8 (-4.5%) | **55.5 (+122.7%)** |\n| Qwen 3.5 9B | 4-bit · group=64 · affine | 128 | 95.2 | 93.7 (-1.6%) | 91.9 (-3.5%) | **186.4 (+95.8%)** |\n|    |    | 512 | 93.4 | 91.4 (-2.2%) | 89.9 (-3.8%) | 86.3 (-7.6%) |\n| Qwen 3.6 35B A3B | UD-MLX 4-bit · group=64 · affine | 128 | 107.6 | 103.6 (-3.7%) | 104.3 (-3.1%) | **250.1 (+132.4%)** |\n|    |    | 512 | 103.3 | 101.4 (-1.9%) | 107.1 (+3.7%) | **254.6 (+146.5%)** |\n| Qwen 3.6 35B A3B | MLX 5-bit · group=64 · affine | 128 | 116.8 | 110.2 (-5.6%) | 124.1 (+6.3%) | **261.6 (+123.9%)** |\n|    |    | 512 | 113.7 | 108.7 (-4.4%) | 122.6 (+7.8%) | **256.1 (+125.2%)** |\n| Qwen 3.6 35B A3B | MLX 6-bit · group=64 · affine | 128 | 102.9 | 99.1 (-3.6%) | 106.1 (+3.1%) | **259.6 (+152.4%)** |\n|    |    | 512 | 101.1 | 98.0 (-3.1%) | 106.0 (+4.9%) | **256.5 (+153.8%)** |\n| Qwen 3.6 35B A3B | MLX 8-bit · group=64 · affine | 128 | 93.6 | 89.3 (-4.6%) | 98.0 (+4.7%) | **227.5 (+143.1%)** |\n|    |    | 512 | 91.4 | 89.1 (-2.6%) | 97.6 (+6.8%) | **225.4 (+146.5%)** |\n| Qwen Coder Next | 4-bit · group=64 · affine | 128 | 92.2 | 89.4 (-3.0%) | 89.3 (-3.1%) | **223.2 (+142.2%)** |\n|    |    | 512 | 90.4 | 89.2 (-1.3%) | 89.2 (-1.3%) | **220.8 (+144.3%)** |\n| GLM 4.7 Flash | 4-bit · group=64 · affine | 128 | 93.0 | 88.0 (-5.4%) | 91.0 (-2.1%) | **250.5 (+169.3%)** |\n|    |    | 512 | 90.4 | 84.5 (-6.6%) | 88.3 (-2.3%) | **243.0 (+168.8%)** |\n\n### Time to first token (ms) — generation=128 tokens, temp=0\n\nLower is better. mlx_lm and mlx_swift_lm values are derived from reported prefill\nthroughput (`prompt_tokens / prefill_tok_s × 1000 ms`); ax engine values are directly\nmeasured from per-step runner timing in the SSE event stream. Source:\n`benchmarks/results/mlx-inference/2026-05-09-post-v4.5.0/`.\n\n| Model | MLX quantization | Prompt tok | mlx_lm | mlx_swift_lm | ax engine |\n|---|---|---:|---:|---:|---:|\n| Gemma 4 E2B | 4-bit · group=64 · affine | 128 | 56.5 | 52.2 (-7.5%) | **37.5 (-33.6%)** |\n|    |    | 512 | 67.1 | 76.8 (+14.6%) | **66.1 (-1.4%)** |\n| Gemma 4 E2B | 5-bit · group=64 · affine | 128 | 56.4 | 53.5 (-5.3%) | **38.7 (-31.4%)** |\n|    |    | 512 | 60.9 | 75.9 (+24.7%) | 68.0 (+11.6%) |\n| Gemma 4 E2B | 6-bit · group=64 · affine | 128 | 59.4 | 37.2 (-37.3%) | **41.9 (-29.5%)** |\n|    |    | 512 | 69.9 | 64.3 (-8.1%) | 74.9 (+7.1%) |\n| Gemma 4 E2B | 8-bit · group=64 · affine | 128 | 67.0 | 41.5 (-38.0%) | **41.1 (-38.6%)** |\n|    |    | 512 | 77.8 | 75.8 (-2.6%) | **71.1 (-8.6%)** |\n| Gemma 4 E4B | 4-bit · group=64 · affine | 128 | 80.7 | 63.8 (-20.9%) | **54.7 (-32.2%)** |\n|    |    | 512 | 115.5 | 117.4 (+1.6%) | 124.8 (+8.1%) |\n| Gemma 4 26B A4B | 4-bit · group=64 · affine | 128 | 234.7 | 104.3 (-55.6%) | **113.6 (-51.6%)** |\n|    |    | 512 | 315.9 | 174.2 (-44.8%) | **177.3 (-43.9%)** |\n| Gemma 4 31B | 4-bit · group=64 · affine | 128 | 380.4 | 199.5 (-47.6%) | **250.8 (-34.1%)** |\n|    |    | 512 | 908.7 | 673.1 (-25.9%) | **772.3 (-15.0%)** |\n| Qwen 3.5 9B | 4-bit · group=64 · affine | 128 | 113.1 | 60.9 (-46.1%) | **66.5 (-41.2%)** |\n|    |    | 512 | 224.0 | 161.7 (-27.8%) | **188.8 (-15.7%)** |\n| Qwen 3.6 35B A3B | UD-MLX 4-bit · group=64 · affine | 128 | 240.7 | 132.9 (-44.8%) | **130.4 (-45.8%)** |\n|    |    | 512 | 321.2 | 201.1 (-37.4%) | **203.4 (-36.7%)** |\n| Qwen 3.6 35B A3B | MLX 5-bit · group=64 · affine | 128 | 269.8 | 148.5 (-45.0%) | **133.2 (-50.6%)** |\n|    |    | 512 | 344.9 | 211.9 (-38.6%) | **210.3 (-39.0%)** |\n| Qwen 3.6 35B A3B | MLX 6-bit · group=64 · affine | 128 | 304.8 | 167.9 (-44.9%) | **140.8 (-53.8%)** |\n|    |    | 512 | 371.6 | 217.8 (-41.4%) | **219.9 (-40.8%)** |\n| Qwen 3.6 35B A3B | MLX 8-bit · group=64 · affine | 128 | 325.6 | 207.2 (-36.4%) | **138.7 (-57.4%)** |\n|    |    | 512 | 425.9 | 222.1 (-47.8%) | **225.0 (-47.2%)** |\n| Qwen Coder Next | 4-bit · group=64 · affine | 128 | 479.2 | 332.6 (-30.6%) | **179.2 (-62.6%)** |\n|    |    | 512 | 627.9 | 361.3 (-42.5%) | **307.5 (-51.0%)** |\n| GLM 4.7 Flash | 4-bit · group=64 · affine | 128 | 254.5 | 122.5 (-51.9%) | **156.2 (-38.6%)** |\n|    |    | 512 | 323.1 | 197.8 (-38.8%) | **229.5 (-29.0%)** |\n\n### Embedding throughput (tok/s) — runtime apples-to-apples\n\nMeasured on the same tokenized inputs with matching pooling (`last`) and normalization (`true`) settings across backends. Source: `benchmarks/results/embedding/ab-postfix/`.\n\nSingle-request median throughput (ax-engine-py vs mlx-lm, same session):\n\n| Model | mlx-lm (baseline) | ax-engine-py |\n|---|---:|---:|\n| Qwen3-Embedding 0.6B 8-bit | 1,410.3 | 1,398.8 (≈-6%) † |\n| Qwen3-Embedding 4B 4-bit | 536.6 | 444.3 (-17.2%) |\n| Qwen3-Embedding 8B 4-bit DWQ | 319.8 | 280.4 (-12.3%) |\n\n† The 0.6B model completes in ~6ms/sentence, making it sensitive to thermal variance. Run-to-run gap typically ranges from -5% to -10%.\n\n## Installation\n\n### Homebrew\n\nFor tagged macOS arm64 releases, install the preview command-line tools from\nthe AutomatosX tap:\n\n```bash\nbrew install defai-digital/ax-engine/ax-engine\n```\n\nThis installs:\n\n- `ax-engine-server`: local HTTP adapter over the SDK runtime\n- `ax-engine-bench`: workload-contract, readiness, direct-generate, and\n  benchmark-support CLI\n- `ax-engine-manager`: Ratatui local manager for readiness, server metadata,\n  benchmark artifacts, guarded job plans, and redacted support bundles\n- the Homebrew `mlx-c` runtime dependency required by the released binaries\n\nCheck the installed tools:\n\n```bash\nax-engine-server --help\nax-engine-bench doctor\nax-engine-manager --check\n```\n\nHomebrew is the quickest path for the released server and benchmark binaries.\nIf `ax-engine-bench doctor` fails with `Library not loaded:\n/opt/homebrew/opt/mlx-c/lib/libmlxc.dylib`, install or repair the runtime with\n`brew install mlx-c` and `brew reinstall defai-digital/ax-engine/ax-engine`.\nUse the source build when you need the full Rust workspace, Python extension,\nlocal examples, or changes that have not been tagged yet.\n\nThe release archive attached to GitHub is the Homebrew formula payload. It is\nnot a standalone installer with bundled dynamic libraries. Use Homebrew unless\nyou are prepared to provide `mlx-c` and its dynamic library path yourself.\n\n### Source\n\nDevelopment builds require Rust and the MLX C runtime on Apple Silicon:\n\n```bash\nbrew install mlx-c\ncargo build --workspace --release\n```\n\nPython bindings are built from source:\n\n```bash\nmaturin develop\npython -m unittest discover -s python/tests -v\n```\n\n## Quick Start\n\nThe fastest local workflow is:\n\n1. install or build the command-line tools;\n2. download a supported MLX model and generate its manifest;\n3. start the local server;\n4. open `ax-engine-manager` to inspect readiness, server metadata, benchmark\n   artifacts, guarded job plans, and redacted support bundles.\n\nFor a complete manager walkthrough, see [docs/MANAGER.md](docs/MANAGER.md).\n\nThe commands below use source-build paths. If you installed with Homebrew, use\n`ax-engine-server`, `ax-engine-bench`, and `ax-engine-manager` directly instead\nof `./target/release/...`.\n\n```bash\n# Download a model and generate its manifest\npython scripts/download_model.py mlx-community/Qwen3-4B-4bit\n# prints the local path when ready, e.g. ~/.cache/ax-engine/models/mlx-community--Qwen3-4B-4bit\nMODEL_DIR=\"$HOME/.cache/ax-engine/models/mlx-community--Qwen3-4B-4bit\"\n\n# Check readiness without entering terminal raw mode\n./target/release/ax-engine-manager --check --model-dir \"$MODEL_DIR\"\n\n# HTTP inference server (repo-owned MLX runtime)\n./target/release/ax-engine-server \\\n  --mlx \\\n  --mlx-model-artifacts-dir \"$MODEL_DIR\" \\\n  --port 8080\n\n# Local Ratatui cockpit\n./target/release/ax-engine-manager \\\n  --model-dir \"$MODEL_DIR\" \\\n  --server-url http://127.0.0.1:8080\n```\n\n```python\n# Python bindings (after maturin develop)\nimport ax_engine\n\npath = ax_engine.download_model(\"mlx-community/Qwen3-4B-4bit\")\nwith ax_engine.Session(mlx=True, mlx_model_artifacts_dir=str(path)) as s:\n    result = s.generate([1, 2, 3], max_output_tokens=32)\n    print(result.output_tokens)\n```\n\nFor an unsupported MLX text model that upstream `mlx-lm` can serve, keep AX\nEngine as the CLI/server surface and delegate the model execution explicitly:\n\n```bash\nmlx_lm.server --model /path/to/local/mlx-model --host 127.0.0.1 --port 8090\n\n./target/release/ax-engine-bench generate \\\n  --prompt \"Hello from mlx-lm\" \\\n  --support-tier mlx_lm_delegated \\\n  --mlx-lm-server-url http://127.0.0.1:8090\n```\n\n`mlx_lm_delegated` is a compatibility route, not a repo-owned MLX throughput\nclaim. It forwards text generation to upstream `mlx_lm.server`, preserves AX\nsampling fields such as `temperature`, `top_p`, `top_k`, `repetition_penalty`,\nand `seed`, and exposes blocking plus SSE text surfaces through AX. Streamed\nchunks are delegated text deltas; they are not AX-owned token IDs, KV state, or\nmodel-kernel throughput evidence. Tool calls and visual/multimodal inputs are\nnot yet AX compatibility contracts.\n\n```bash\n# Primary benchmark: AX vs mlx_lm vs mlx-swift-lm\npython3 scripts/bench_mlx_inference_stack.py \\\n  --model-dir /path/to/local/mlx-model \\\n  --prompt-tokens 128,512 --generation-tokens 128 \\\n  --ax-compare-policies --repetitions 3 \\\n  --mlx-swift-lm-command './scripts/mlx-swift-bench/.build/release/mlx-swift-bench \\\n    --model {model} --prompt-token-ids {prompt_token_ids_path} \\\n    --generation-tokens {generation_tokens} --trials {trials} \\\n    --delay {delay} --prefill-step-size {prefill_step_size}' \\\n  --output benchmarks/results/mlx-inference/2026-05-04/gemma-4-e2b-it-4bit.json\n\n# Secondary workload-contract benchmark\n./target/release/ax-engine-bench scenario \\\n  --manifest benchmarks/manifests/scenario/chat_gemma4_e2b_short.json \\\n  --output-root benchmarks/results\n\n# Smoke checks\n./target/release/ax-engine-manager --check --model-dir \"$MODEL_DIR\"\nbash scripts/check-server-preview.sh\nbash scripts/check-python-preview.sh\n```\n\n## Getting a Model\n\nax-engine requires pre-sanitized MLX weights. The recommended source is\n[mlx-community](https://huggingface.co/mlx-community) — models there are already\nconverted and validated. Loading an unsanitized raw HF checkpoint into a hybrid\narchitecture (Qwen3.5, Qwen3-Next) raises a hard error at load time.\n\n### mlx-community model (recommended)\n\n`download_model()` and `scripts/download_model.py` download weights and auto-generate\nthe required `model-manifest.json` in one step:\n\n```bash\n# Script (works with Homebrew install or source build)\npython scripts/download_model.py mlx-community/Qwen3-4B-4bit\n\n# For automation and future TUI integration, emit a parseable summary\npython scripts/download_model.py mlx-community/Qwen3-4B-4bit --json\n\n# Python SDK\nfrom ax_engine import download_model\npath = download_model(\"mlx-community/Qwen3-4B-4bit\")\n```\n\nIf you already have `mlx_lm` installed, its download also lands in the standard HF\ncache that ax-engine can auto-discover:\n\n```bash\npython -m mlx_lm.generate --model mlx-community/Qwen3-4B-4bit --prompt \"x\" --max-tokens 1\nax-engine-bench generate-manifest ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/\u003chash\u003e\nax-engine-server --mlx --resolve-model-artifacts hf-cache --preset qwen3_dense --port 8080\n```\n\n### Raw HuggingFace checkpoint\n\nRaw checkpoints need sanitization before ax-engine can load them. Use `mlx_lm.convert`:\n\n```bash\npip install mlx-lm\nmlx_lm.convert --hf-path \u003corg/model\u003e --mlx-path /path/to/dest -q --q-bits 4\nax-engine-bench generate-manifest /path/to/dest\nax-engine-server --mlx --mlx-model-artifacts-dir /path/to/dest --port 8080\n```\n\n### Manifest generation\n\nBoth paths above require `model-manifest.json`. The download helpers generate it\nautomatically. To run it directly:\n\n```bash\nax-engine-bench generate-manifest /path/to/model      # Homebrew or built binary\ncargo run -p ax-engine-core --bin generate-manifest -- /path/to/model  # source\n```\n\n## SDKs\n\nax-engine-server exposes OpenAI-compatible HTTP endpoints, and several SDKs\nwrap those endpoints or the in-process Rust session directly.\n\n| Language | Package / path | LangChain |\n|----------|---------------|-----------|\n| **Python** | `python/ax_engine` | `ax_engine.langchain` — `AXEngineChatModel`, `AXEngineLLM` |\n| **TypeScript / JS** | `javascript/ax-engine` (`@ax-engine/sdk`) | `@ax-engine/sdk/langchain` — `ChatAXEngine`, `AXEngineLLM` |\n| **Go** | `sdk/go/axengine` | Use [langchaingo](https://github.com/tmc/langchaingo) OpenAI provider — see `examples/go/langchain/` |\n| **Ruby** | `sdk/ruby` (`ax-engine-sdk`) | `ax_engine/langchain` — `ChatModel`, `LLM` (requires langchain-rb) |\n| **Mojo** | `sdk/mojo/ax_engine.mojo` | Via Python — use `ax_engine.langchain` from Mojo's Python interop |\n\n### TypeScript / JavaScript\n\n```bash\nnpm install @ax-engine/sdk\n```\n\n```typescript\nimport AxEngineClient from \"@ax-engine/sdk\";\n\nconst client = new AxEngineClient({ baseUrl: \"http://127.0.0.1:8080\" });\nconst resp = await client.chatCompletion({\n  messages: [{ role: \"user\", content: \"Hello!\" }],\n  max_tokens: 128,\n});\nconsole.log(resp.choices[0].message.content);\n\n// Streaming\nfor await (const event of client.streamChatCompletion({ messages: [...], stream: true })) {\n  process.stdout.write(event.data.choices[0]?.delta?.content ?? \"\");\n}\n```\n\nLangChain integration (requires `@langchain/core`):\n\n```typescript\nimport { ChatAXEngine } from \"@ax-engine/sdk/langchain\";\nimport { HumanMessage } from \"@langchain/core/messages\";\n\nconst chat = new ChatAXEngine({ maxTokens: 128 });\nconst response = await chat.invoke([new HumanMessage(\"Hello!\")]);\n```\n\n### Go\n\nThe Go SDK lives at `sdk/go/axengine` (module `github.com/ax-engine/ax-engine-go`).\n\n```go\nclient := axengine.NewClient(nil)\n\nresp, err := client.ChatCompletion(ctx, axengine.OpenAiChatCompletionRequest{\n    Messages:  []axengine.OpenAiChatMessage{{Role: \"user\", Content: \"Hello!\"}},\n    MaxTokens: axengine.Ptr(128),\n})\n\n// Streaming\nch, errCh := client.StreamChatCompletion(ctx, req)\nfor chunk := range ch {\n    fmt.Print(*chunk.Choices[0].Delta.Content)\n}\n```\n\nSee `examples/go/` for runnable examples. For LangChain, point\n[langchaingo](https://github.com/tmc/langchaingo)'s OpenAI provider at\n`http://127.0.0.1:8080/v1` — see `examples/go/langchain/` and `docs/GO.md`.\n\n### Ruby\n\nThe Ruby SDK lives at `sdk/ruby/` (`ax-engine-sdk` gem). Zero dependencies —\nstdlib `net/http` only. Streaming uses a block interface.\n\n```ruby\nrequire \"ax_engine\"\n\nclient = AxEngine::Client.new\n\n# Blocking chat\nresp = client.chat_completion(\n  messages: [{ role: \"user\", content: \"Hello!\" }],\n  max_tokens: 128\n)\nputs resp.dig(\"choices\", 0, \"message\", \"content\")\n\n# Streaming\nclient.stream_chat_completion(\n  messages: [{ role: \"user\", content: \"Count from 1 to 5.\" }],\n  max_tokens: 64\n) do |event|\n  print event.dig(\"data\", \"choices\", 0, \"delta\", \"content\").to_s\nend\n```\n\nLangChain via [langchain-rb](https://github.com/patterns-ai-core/langchain):\n\n```ruby\nrequire \"ax_engine/langchain\"\n\nchat = AxEngine::Langchain::ChatModel.new(max_tokens: 256)\nputs chat.chat(messages: [{ role: \"user\", content: \"Hello!\" }]).chat_completion\n```\n\nSee `examples/ruby/` and `docs/RUBY.md` for full details.\n\n### Python — LangChain\n\n```python\nfrom ax_engine.langchain import AXEngineChatModel\nfrom langchain_core.messages import HumanMessage\n\nchat = AXEngineChatModel(base_url=\"http://127.0.0.1:8080\", max_tokens=256)\nresponse = chat.invoke([HumanMessage(content=\"Hello!\")])\nprint(response.content)\n\n# Streaming\nfor chunk in chat.stream([HumanMessage(content=\"Count from 1 to 5.\")]):\n    print(chunk.content, end=\"\", flush=True)\n```\n\nRequires `pip install langchain-core`. See `docs/PYTHON.md` for full details.\n\n### Mojo\n\nThe Mojo SDK (`sdk/mojo/ax_engine.mojo`) wraps the Python `ax_engine` package\nvia Mojo's `PythonObject` interop. Requires the Python extension to be built\nfirst (`maturin develop`).\n\n```mojo\nfrom sdk.mojo.ax_engine import Session\n\nvar session = Session(\n    \"qwen3_dense\",\n    mlx=True,\n    mlx_model_artifacts_dir=\"/path/to/artifacts\",\n)\nvar result = session.generate(\"Hello from Mojo!\", max_output_tokens=64)\nprint(result.output_text)\nsession.close()\n```\n\n## Workspace\n\n```\ncrates/ax-engine-core    Engine state machine, scheduler, KV manager, sampler\ncrates/ax-engine-mlx     MLX model graph, n-gram acceleration, KV cache, runner\ncrates/mlx-sys           bindgen FFI over mlx-c; safe MlxArray RAII wrappers\ncrates/ax-engine-sdk     Session API, backend resolution (MLX, mlx-lm delegated, or llama.cpp)\ncrates/ax-engine-server  Axum HTTP/SSE adapter (OpenAI-compatible routes)\ncrates/ax-engine-bench   Manifest-driven workload-contract CLI\ncrates/ax-engine-py      PyO3 extension (ABI3, Python 3.10+)\njavascript/ax-engine     TypeScript/JS HTTP SDK + LangChain adapter\nsdk/go/axengine          Go HTTP SDK\nsdk/ruby/                Ruby HTTP SDK (ax-engine-sdk gem)\nsdk/mojo/                Mojo SDK (Python-interop)\n```\n\nUnsupported MLX text models can use the explicit delegated `mlx_lm_delegated`\nroute through a user-provided `mlx_lm.server`. Non-MLX inference routes through\nthe delegated `llama.cpp` contract.\n\n## Development\n\n```bash\ncargo build --workspace                                           # build all crates\ncargo test --quiet                                                # full Rust test suite\ncargo clippy --all-targets --all-features -- -D warnings         # lint (CI gate)\ncargo fmt                                                         # format\nmaturin develop                                                   # rebuild Python extension\npython -m unittest discover -s python/tests -v                   # Python tests\n```\n\nCoverage is collected by the report-only GitHub Actions workflow in\n`.github/workflows/coverage.yml`. It publishes Rust `cargo llvm-cov` and Python\n`coverage.py` artifacts without enforcing a percentage threshold yet; add a gate\nonly after the project has a stable baseline across macOS, MLX, and PyO3 paths.\n\nPublic documentation is in `docs/`. Canonical benchmark manifests are in\n`benchmarks/manifests/`. Key design documents:\n[SDK / API](docs/SDK.md) ·\n[Manager](docs/MANAGER.md) ·\n[Python](docs/PYTHON.md) ·\n[JavaScript / TypeScript](docs/JAVASCRIPT.md) ·\n[Go](docs/GO.md) ·\n[Ruby](docs/RUBY.md) ·\n[Mojo](docs/MOJO.md) ·\n[Scheduler](docs/SCHEDULER.md) ·\n[KV Cache](docs/KV-CACHE.md) ·\n[Benchmarking](docs/BENCH-DESIGN.md)\n\n## Contributing\n\nAX Engine welcomes public contributions. See [CONTRIBUTING.md](CONTRIBUTING.md)\nfor guidelines.\n\n## Community\n\n- Website: [automatosx.com](https://automatosx.com)\n- Discord: [Join us](https://discord.com/invite/cTavsMgu)\n- Email: enquiry@defai.digital\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\nCopyright (c) 2026 [DEFAI Private Limited](https://defai.digital)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefai-digital%2Fax-engine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdefai-digital%2Fax-engine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdefai-digital%2Fax-engine/lists"}