https://github.com/cklxx/arle
Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.
https://github.com/cklxx/arle
agent cuda flashinfer gspo inference infra kv-cache llm metal mlx openai-compatible qwen3 qwen35 rl rust
Last synced: about 1 month ago
JSON representation
Rust-native inference runtime for Qwen3 / Qwen3.5 — OpenAI-compatible serving + integrated agent, train, and self-evolution workflows. CUDA + Metal, no PyTorch on the hot path.
- Host: GitHub
- URL: https://github.com/cklxx/arle
- Owner: cklxx
- License: mit
- Created: 2026-03-30T09:27:33.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-04-27T08:15:39.000Z (about 2 months ago)
- Last Synced: 2026-04-27T08:28:49.878Z (about 2 months ago)
- Topics: agent, cuda, flashinfer, gspo, inference, infra, kv-cache, llm, metal, mlx, openai-compatible, qwen3, qwen35, rl, rust
- Language: Rust
- Homepage: https://cklxx.github.io/arle/
- Size: 9.98 MB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: SECURITY.md
- Support: docs/support-matrix.md
- Roadmap: ROADMAP.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
ARLE
Pure-Rust runtime for serving, local agents, training, and evaluation. infer is the OpenAI-compatible serving binary; arle is the unified front door.
Quick Start ·
HTTP API ·
Support Matrix ·
Architecture ·
Roadmap ·
Changelog ·
Contributing
English · 简体中文
---
## Quick Start
### 1. Install
**Apple Silicon — Homebrew (recommended):**
```bash
brew install cklxx/tap/arle
arle --doctor
```
**Apple Silicon or Linux x86_64 — one-line installer:**
```bash
curl -fsSL https://github.com/cklxx/arle/releases/latest/download/install.sh | sh
```
The script grabs the matching tarball from the latest GitHub Release,
SHA256-verifies it, and drops the binaries into `~/.local/bin` (override
with `INSTALL_DIR=...`). See [docs/install.md](docs/install.md) for the full
matrix, env-var overrides, and uninstall steps.
**Linux + NVIDIA — pull the published Docker image, no compile:**
```bash
docker run --rm --gpus all -p 8000:8000 \
-v /path/to/Qwen3-4B:/model:ro \
ghcr.io/cklxx/arle:latest \
serve --backend cuda --model-path /model --port 8000
```
The `:latest` tag tracks `main`; tagged releases are published as
`ghcr.io/cklxx/arle:X.Y.Z` (note: no `v` prefix — the docker metadata-action
strips it). For the current release: `ghcr.io/cklxx/arle:0.1.5`.
**From source** (any backend; needed for `cpu`, `tilelang-attn`, or local hacking):
```bash
git clone https://github.com/cklxx/arle && cd arle
# Apple Silicon:
cargo build --release --no-default-features --features metal,no-cuda,cli --bin arle
# Linux + NVIDIA:
cargo build --release --features cli --bin arle
```
### 2. Serve a model
```bash
arle serve --backend metal \
--model-path mlx-community/Qwen3-0.6B-4bit --port 8000 # Apple Silicon
arle serve --backend cuda \
--model-path /path/to/Qwen3-4B --port 8000 # Linux + NVIDIA
```
### 3. Talk to it
```python
# pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
print(client.chat.completions.create(
model="qwen3-4b",
messages=[{"role": "user", "content": "Hello from ARLE"}],
).choices[0].message.content)
```
Or with curl: see [`examples/curl_chat.sh`](examples/curl_chat.sh).
More copy-paste paths: [`examples/`](examples/).
### 4. Run the local agent
```bash
arle # interactive REPL with built-in tools
arle --model-path /path/to/Qwen3-4B run --prompt "Summarize this repo" # one-shot
arle --doctor --json # self-check, machine-readable
```
CPU-only smoke build (no GPU required, source build):
```bash
cargo build --release --no-default-features --features cpu,no-cuda,cli --bin arle
./target/release/arle --doctor
```
---
## Status at a glance
| Backend | Platform | Status | Notes |
|---|---|:---:|---|
| **CUDA** | Linux + NVIDIA | **Stable** | Continuous batching, paged KV, radix-backed reuse, FlashInfer, CUDA Graph decode, packed paged-prefill for Qwen3 / Qwen3.5. **L4 / Qwen3-4B BF16 + FP8 paged KV (auto): 197 tok/s @ c=16 / 4096-in, peak_active=16 saturated.** |
| **Metal** | Apple Silicon | **Beta** | Live scheduler-backed serving, chunked prefill, replay-backed prefix reuse. Qwen3.5-0.8B MLX 4bit single-request step-driver reaches 305.5 tok/s on M4 Pro 20c; GGUF Q4_K_M exact default is 202.1 tok/s direct, with an opt-in native-q4 Metal load path at 236.7 tok/s direct / 239.8 tok/s step-driver on the matched 1024/256 profile. |
| **Metal DFlash** | Apple Silicon | **Beta — default-on** | Speculative decode for Qwen3 / Qwen3.5. Qwen3-4B bf16 achieves 5.9× decode speedup, Qwen3.5-4B-4bit maintains bit-identical parity, validated for c=1..8. |
| **CPU** | Portable | **Dev-only** | Smoke tests and request-path validation; not a serving target. |
Models: **Qwen3 (0.6B – 72B)** and the **Qwen3.5 family** (including 0.8B GGUF Q4_K_M and 4B hybrid linear + full attention) are supported on CUDA and Metal according to the current matrix. **DeepSeek V3 / Qwen3.5-MoE** has a narrow Metal Beta path; CUDA remains stubbed. Llama 3 / 4 and DeepSeek V3 / R1 are planned — see [ROADMAP.md](ROADMAP.md).
Authoritative matrix (HTTP API tiers, quantization, agent / train / eval surfaces): [docs/support-matrix.md](docs/support-matrix.md).
Stability tiers: [docs/stability-policy.md](docs/stability-policy.md).
---
## Why ARLE
In agent and RL workloads every turn pays a prefill tax: system prompt + history + tool results must be re-processed. As context grows, **prefill dominates latency**. ARLE treats this as the core problem in both serving and agent / RL loops:
- **Multi-turn KV reuse.** Slot-sticky reuse keeps prior-turn KV hot for the next turn. CUDA also includes a radix-backed tiered-KV path (`T0 GPU → T1 host pinned → T2 local disk → T3 cluster-shared`) for full-block reuse and staged readmission, so only the new user message requires prefill each turn when the prefix stays reusable.
- **Paged KV pool.** Main CUDA KV formats use `page_size=16` with direct GPU page attach and tail-page CoW on shared prefixes — predictable accounting, reusable full blocks, cheaper prefix sharing.
- **Shared runtime authority.** `infer`, `arle`, and the in-tree train / eval jobs resolve models and reuse the same Rust runtime / model contracts. Serving, local agent work, and RL tooling stay on one code path instead of drifting across separate stacks.
Architecture deep-dive: [docs/architecture.md](docs/architecture.md) · [docs/codebase-map.md](docs/codebase-map.md).
Latest benchmark snapshots (per change, dated): [docs/experience/wins/](docs/experience/wins/) · run your own with [`scripts/bench_guidellm.sh`](scripts/bench_guidellm.sh).
---
## Entry surfaces
`arle` is the single binary users interact with:
| Command | What it does |
|---|---|
| `arle` (no args) | Interactive agent REPL with built-in `python` and `shell` tools (sandboxed). |
| `arle run --prompt "…"` / `--stdin --json` | Script-friendly one-shot agent prompt. Use `--no-tools` to disable tool execution. |
| `arle serve --backend {cuda,metal,cpu} --model-path …` | Launch the OpenAI-compatible HTTP server. |
| `arle train {pretrain,sft,grpo,multi-turn,eval}` | In-tree training and RL workflows on the same runtime. |
| `arle data {download,convert}` | Dataset utilities. |
| `arle --doctor [--json] [--strict]` | Self-check: backend, hardware, HF cache, model resolution. CI-friendly. |
The REPL persists line history at `~/.arle-history` and exposes slash commands: `/help`, `/reset`, `/clear`, `/tools`, `/model`, `/stats`, `/models`, `/save`, `/load`, `/export`.
Operators who want only the serving binary can use `infer` directly (`cargo build -p infer --release --features cuda` on Linux, `--features metal,no-cuda` on Apple Silicon) — same HTTP contract, without the agent / train / data surface.
---
## 📰 Latest Updates
- **2026-04-28** — Metal `Qwen3.5-0.8B` MLX 4bit single-request step-driver reaches **305.5 tok/s mean / 304.7 p50** on M4 Pro 20c for `1024/256`, matching the Apple-native public SOTA band. GGUF `Q4_K_M` maintains the exact affine path as default at **202.1 tok/s direct**; `AGENT_INFER_METAL_GGUF_NATIVE_Q4=all` enables a lossy MLX native-q4 load path at **236.7 tok/s direct / 239.8 tok/s step-driver** on the matched profile. Evidence: [`docs/experience/wins/2026-04-28-bench-metal-qwen35-0p8b-mlx4bit-qknorm-default.md`](docs/experience/wins/2026-04-28-bench-metal-qwen35-0p8b-mlx4bit-qknorm-default.md), [`docs/experience/wins/2026-04-28-bench-metal-qwen35-0p8b-gguf-native-q4.md`](docs/experience/wins/2026-04-28-bench-metal-qwen35-0p8b-gguf-native-q4.md).
- **2026-04-28** — CUDA L4 `Qwen3-4B` BF16, c=16 / 4096-in increased from **120 → 197 tok/s (+64%)** after enabling automatic HBM-tier `chunked_prefill_size` and FP8 paged KV defaulting on L4-class GPUs. `peak_active` saturates at 16/16; achieves +42% vs SGLang reference on the same workload. Evidence: [`docs/experience/wins/2026-04-28-bench-guidellm-cuda-l4-kv-fp8-auto.md`](docs/experience/wins/2026-04-28-bench-guidellm-cuda-l4-kv-fp8-auto.md).
Full history: [CHANGELOG.md](CHANGELOG.md). Next up: [ROADMAP.md](ROADMAP.md).
---
## Documentation map
- [docs/http-api.md](docs/http-api.md) — HTTP route contract, streaming behavior, boundary guarantees
- [docs/support-matrix.md](docs/support-matrix.md) — backend / model / quant / API support tiers
- [docs/stability-policy.md](docs/stability-policy.md) — stability levels and compatibility posture
- [docs/architecture.md](docs/architecture.md) — package boundaries and dependency direction
- [docs/codebase-map.md](docs/codebase-map.md) — workspace layout and main execution paths
- [docs/environment.md](docs/environment.md) — environment variables and runtime knobs
- [docs/troubleshooting.md](docs/troubleshooting.md) — common build / runtime errors and fixes
- [docs/comparison.md](docs/comparison.md) — how ARLE compares to vLLM / SGLang / mistral.rs / llama.cpp
- [docs/release-checklist.md](docs/release-checklist.md) · [docs/perf-and-correctness-gates.md](docs/perf-and-correctness-gates.md)
- [CONTRIBUTING.md](CONTRIBUTING.md) — contributor setup, validation, release expectations
- [SECURITY.md](SECURITY.md) — vulnerability reporting policy
- [examples/](examples/) — copy-paste smoke paths (curl, OpenAI SDK, Docker, Metal, train fixtures)
- [docs/index.md](docs/index.md) — maintainer-facing PARA index, plans, and experience logs
---
## License
[MIT](LICENSE)