https://github.com/cklxx/arle

Pure-Rust LLM runtime: one binary serves (OpenAI-compatible), runs local agents, and distills models on their own rollouts — on Apple Silicon and NVIDIA. No Python on the hot path.
https://github.com/cklxx/arle

agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust

Last synced: 14 days ago
JSON representation

Pure-Rust LLM runtime: one binary serves (OpenAI-compatible), runs local agents, and distills models on their own rollouts — on Apple Silicon and NVIDIA. No Python on the hot path.

Host: GitHub
URL: https://github.com/cklxx/arle
Owner: cklxx
License: mit
Created: 2026-03-30T09:27:33.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-06-29T16:53:41.000Z (16 days ago)
Last Synced: 2026-06-29T17:28:43.694Z (16 days ago)
Topics: agent, cuda, flashinfer, gspo, inference, infra, kv-cache, llm, metal, mlx, openai-compatible, qwen3, qwen35, rl, rust
Language: Rust
Homepage: https://cklxx.github.io/arle/
Size: 117 MB
Stars: 17
Watchers: 1
Forks: 1
Open Issues: 26
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Support: docs/support-matrix.md
- Roadmap: ROADMAP.md
- Agents: AGENTS.md

Awesome Lists containing this project

README

          


  





  One pure-Rust binary that serves LLMs (OpenAI-compatible), runs local agents, and distills them on their own rollouts — on Apple Silicon and NVIDIA. No Python on the hot path.





  _{35B-A3B MoE at 85 tok/s on a MacBook · bit-identical speculative decode · OPD lifts a 4B student +27pp on MATH-500}





  

  

  

  

  





  Quick Start ·

  HTTP API ·

  Support Matrix ·

  Onboarding ·

  Architecture ·

  Roadmap ·

  Changelog





  English · 简体中文



---

## Quick Start

```bash

# Apple Silicon — Homebrew

brew install cklxx/tap/arle

# Apple Silicon or Linux x86_64 — one-line installer

curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh

# Linux + NVIDIA — Docker, no compile

docker run --rm --gpus all -p 8000:8000 -v /path/to/Qwen3.5-4B:/model:ro \

  ghcr.io/cklxx/arle:latest serve --backend cuda --model-path /model

# Serve

arle serve --backend cuda  --model-path /path/to/Qwen3.5-4B --port 8000

arle serve --backend metal --model-path mlx-community/Qwen3.5-0.8B-MLX-4bit --port 8000

```

```python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

print(client.chat.completions.create(

    model="qwen3.5-4b",

    messages=[{"role": "user", "content": "Hello from ARLE"}],

).choices[0].message.content)

```

Build from source, full install matrix, uninstall: [docs/install.md](docs/install.md) · more copy-paste: [`examples/`](examples/).

`arle` is one binary:

| Command | What it does |

|---|---|

| `arle` (no args) | Picks a model, serves it locally, and hands the session to the [Eli](https://github.com/cklxx/eli) agent framework against it — or the built-in `python`/`shell` REPL if Eli isn't installed. `--agent arle` forces the REPL; `--gateway` runs Eli's serve mode. Remembers the choice (defaults to Eli next run). |

| `arle run --prompt "…"` | One-shot agent prompt. `--no-tools` to disable tools. |

| `arle serve --backend …` | OpenAI-compatible HTTP server. |

| `arle train opd` | **On-Policy Distillation** — teacher on the serving runtime, student in `train`. [Manual](docs/projects/2026-05-21-arle-opd-cuda-usage-manual.md). |

| `arle --doctor [--json]` | Backend / hardware / model-resolution self-check. |

_{Eli is an optional runtime dependency — discovered via $ELI_BIN, PATH, or a sibling ../eli build; never a Cargo build-dep. Install it for the full agent runtime (governed self-evolution, gateway channels); without it arle uses its own REPL. arle points Eli at the local server through Eli's keyless local provider, leaving ~/.eli/config.toml untouched.}

---

## Performance

Measured on the runtime, not projected — fresh `arle serve` benches, one binary.

**Apple Silicon — one M4 Pro laptop (48 GB), single user.** A 35B-A3B MoE decodes as fast as the 4B dense and 1.7× the 9B, because only ~3B params activate per token:

| Model · Metal 4-bit | Decode | TPOT | TTFT |

|---|---:|---:|---:|

| Qwen3.5-0.8B | **318 tok/s** | 3.2 ms | 0.17 s |

| Qwen3.5-4B | 84 tok/s | 11.9 ms | 0.82 s |

| Qwen3.5-9B | 50 tok/s | 20.0 ms | 1.45 s |

| **Qwen3.6-35B-A3B** · MoE | **85 tok/s** | 11.7 ms | 1.23 s |

_{512-in / 128-out · c=1 · temp=0 · M4 Pro · build 4ea77e11 · decode = single-stream generation rate · snapshot + method}

**Speculative decode beats the HBM-bandwidth wall.** Qwen3.6-27B (OptiQ 4/8-bit): the model's own NextN/MTP head drafts, the base verifies, **output bit-identical to greedy** — **12.3 → 17.75 tok/s (+44%)**, past the 15.2 tok/s HBM floor no kernel can reach.

_{Quality held: PPL 7.82 (vs 8.56 uniform-4bit) · 68.8% draft acceptance · default-on, --no-speculative to disable.}

**NVIDIA — DeepSeek-V4-Flash, 8×H20 (TP=8 / EP=8, FP8 MoE).** B=1 decode **53 tok/s** (prefill 23 ms); the concurrent batched-decode lane adds **+48%** at c=8. Qwen3.6 FP8 MoE now serves on CUDA too (batched paged decode, tok/s scales c=1→8).



  



_{DSv4 B=1 decode, 33.5 → 53.3 tok/s across the 2026-06-13 → 06-14 campaign — every step traced to a docs/experience/wins/ entry.}


**On-Policy Distillation lifts the student for real.** A Qwen3.5-4B LoRA student distilled on its *own* rollouts against the Qwen3.6-35B-A3B teacher (same serving runtime) lifts MATH-500 **0.518 → 0.792** (**+27pp, CI-separated**), reaching the teacher's neighborhood (**0.82**):



  



_{MATH-500 greedy exact-match, n=500/seed @4096 tokens, 0 request-error · 3 recipe arms × 5 seeds, base→step25→step50 trajectory · error bars = ±1σ across seeds · base 0.518 (n=500) → reverse-KL 0.792, fully CI-separated · 2026-06-20. method.}


The same loop lifts *agentic* capability: with think-on OPD the 4B student learns to **decline irrelevant tool calls — BFCL-live abstention 0.60 → 1.00**. [method]

**Stability:** CUDA **Stable** · Metal **Beta** (DFlash + Qwen3.6 NextN-MTP: bit-identical spec decode) · OPD train **Beta** (~2× vs HF TRL `GKDTrainer` — measured 2.04–2.49× on Qwen3-0.6B; LoRA fits 4 GB cards) · CPU dev-only. Models: Qwen3-dense + Qwen3.5/3.6 (hybrid · MoE) on CUDA + Metal · DeepSeek-V4-Flash + GLM-5.2 (CUDA 8×H20 TP=8/EP=8; GLM-5.2 verify pending) · Qwen3.6 + Gemma4 · DeepSeek-OCR VLMs + DiffusionGemma (Metal). Full tiers: [support-matrix](docs/support-matrix.md) · [stability-policy](docs/stability-policy.md).

---

## Why ARLE

Agent and RL workloads waste compute re-processing the same prompt + history + tool output every turn. ARLE fixes this once and shares the fix across serving and training:

- **KV stays hot across turns.** Prior-turn KV stays on GPU so only new tokens prefill; prefix pages are shared across requests via the host radix cache, demote to a host-RAM tier under pressure (opt-in disk spill), and promote back on the next hit instead of re-prefilling. ([support-matrix §4b](docs/support-matrix.md#4b-multi-turn-kv-reuse--tiered-kv-matrix))

- **Quantized KV on CUDA.** INT8/FP8/INT4 paged-KV kernels behind a `--kv-cache-dtype` serve flag — correctness-gated, opt-in (default stays BF16).

- **KV-recall = long-context memory (Metal, opt-in).** When a session outgrows the window, decode attends only `sink + recent + top-k recalled` older blocks (scored by mean-key relevance to the current query) instead of the whole history. On Qwen3.6-35B a mid-context passkey resolves at **9.6% of the KV, identical to full attention** — where plain sliding-window truncation forgets it ([note](docs/notes/2026-06-23-kv-as-infinite-memory.md)). Behind `--kv-recall` (bf16, default off); the recall mechanism is live (compute-saving), L3 tier offload for the flat-VRAM-vs-history win is in progress.

- **One runtime, three surfaces.** Serving, the local agent, and OPD training run the same Rust + model code — the OPD teacher *is* the production server.

```mermaid

flowchart TB

  subgraph Surfaces["One arle binary"]

    Serve["arle serve
OpenAI v1 HTTP"]

    Agent["arle
local agent / REPL"]

    Train["arle train opd
OPD — teacher is the production server"]

  end

  subgraph Serving["Serving layer"]

    Server["infer-server
HTTP · streaming · ServeHandle"]

    API["infer-api
LoadedInferenceEngine — programmatic front door"]

  end

  Core["infer-core — device-neutral Engine<E,K>
continuous scheduler · RadixCache prefix reuse
chunked prefill · paged-KV admission · sampling"]

  Seam["infer-plan IR · infer-seam
the narrow waist: two host-only traits — BackendExecutor · KvPool"]

  subgraph Exec["Executors — a new backend = implement the two traits"]

    CUDA["infer-cuda
official FlashMLA · DeepGEMM · DeepEP + TileLang AOT
TP=8 / EP=8 · Qwen3.5 · Qwen3.6 · DeepSeek-V4-Flash · GLM-5.2"]

    Metal["infer-metal
MLX bridge · packed varlen decode · wired weights
Qwen3.5 · Qwen3.6 · Gemma4 · DeepSeek-OCR · DiffusionGemma"]

  end

  Serve --> Server

  Agent --> API

  Train --> API

  Server --> Core

  API --> Core

  Core --> Seam

  Seam --> CUDA

  Seam --> Metal

```

Deep dive: [onboarding](docs/onboarding.md) (30 min) · [architecture](docs/architecture.md) · [codebase-map](docs/codebase-map.md).

---

## Documentation

[http-api](docs/http-api.md) · [support-matrix](docs/support-matrix.md) · [architecture](docs/architecture.md) · [codebase-map](docs/codebase-map.md) · [environment](docs/environment.md) · [troubleshooting](docs/troubleshooting.md) · [comparison vs vLLM / SGLang / mistral.rs / llama.cpp](docs/comparison.md) · [CONTRIBUTING](CONTRIBUTING.md) · [docs/index.md](docs/index.md)

---

## License

[MIT](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cklxx/arle

Awesome Lists containing this project

README