https://github.com/thebasedcapital/ane-infer
Apple Neural Engine (ANE) LLM inference engine — reverse-engineered private APIs, Metal GPU shaders, hybrid ANE+GPU+CPU on Apple Silicon. 32 tok/s matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels.
https://github.com/thebasedcapital/ane-infer
ane apple-neural-engine apple-silicon deltanet edge-ai gguf llm-inference macos metal-gpu neural-engine npu on-device-ai quantization qwen reverse-engineering rust
Last synced: about 2 months ago
JSON representation
Apple Neural Engine (ANE) LLM inference engine — reverse-engineered private APIs, Metal GPU shaders, hybrid ANE+GPU+CPU on Apple Silicon. 32 tok/s matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels.
- Host: GitHub
- URL: https://github.com/thebasedcapital/ane-infer
- Owner: thebasedcapital
- Created: 2026-03-05T02:36:31.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-03-05T02:44:14.000Z (3 months ago)
- Last Synced: 2026-03-05T07:50:52.421Z (3 months ago)
- Topics: ane, apple-neural-engine, apple-silicon, deltanet, edge-ai, gguf, llm-inference, macos, metal-gpu, neural-engine, npu, on-device-ai, quantization, qwen, reverse-engineering, rust
- Language: Rust
- Size: 181 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ane-infer
> **Apple Neural Engine (ANE) LLM Inference Engine** — reverse-engineered private APIs, Metal GPU compute shaders, hybrid ANE+GPU+CPU on Apple Silicon M1/M2/M3/M4/M5
**Hybrid ANE+Metal+CPU inference engine for LLMs on Apple Silicon.**
First implementation of Qwen3.5 (Gated DeltaNet + GQA) running natively on Apple Neural Engine via reverse-engineered private APIs. 32 tok/s Metal GPU decode matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels, built from scratch in Rust + Obj-C + Metal.
Built on the shoulders of [maderix/ANE](https://github.com/maderix/ANE) — the project that cracked open ANE training. We took it further into inference with DeltaNet, Metal GPU shaders, and a complete decode pipeline.
### Keywords
`apple-neural-engine` `ane` `apple-silicon` `metal-gpu` `llm-inference` `on-device-ai` `neural-engine` `m1` `m2` `m3` `m4` `m5` `private-api` `reverse-engineering` `coreml` `gguf` `quantization` `q4` `q8` `deltanet` `qwen` `rust` `metal-shaders` `npu` `mlx-alternative` `llama-cpp-alternative` `macos` `ios` `edge-ai` `low-power-inference`
---
## What This Is — Apple Neural Engine LLM Inference
A from-scratch LLM inference engine that runs Qwen3.5-2B on three Apple Silicon accelerators simultaneously:
- **Apple Neural Engine (ANE)** — batched prefill via 1x1 convolutions through private `_ANEClient` APIs
- **Metal GPU** — single-token decode with 13 custom compute shaders, ONE command buffer per token
- **CPU (NEON/AMX)** — parallel Q8_0 GEMV via rayon, Accelerate BLAS fallback
No CoreML. No Python. No MLX. Just system frameworks + `objc_msgSend`.
## What This Is Not
- Not faster than llama.cpp (yet). We match their decode speed, not their prefill.
- Not production-ready. Private API usage means it breaks with macOS updates.
- Not a general inference framework. Built specifically for Qwen3.5 DeltaNet hybrid architecture.
---
## Performance — ANE vs Metal GPU vs CPU on Apple Silicon
**Qwen3.5-2B Q8_0 on Apple M5 (same chip as llama.cpp benchmarks)**
| Backend | Speed | Power | Notes |
|---------|-------|-------|-------|
| Metal GPU Q8 decode | **32 tok/s** | ~15W | Matches llama.cpp (34.8) |
| Metal GPU Q4 decode | **42 tok/s** | ~15W | Q6K dequant WIP |
| CPU Q8 decode | 23 tok/s | ~5W | Rayon + NEON |
| ANE prefill pp16 | 33 tok/s | ~3W | Fused FFN mega-kernel |
| ANE fused FFN | **3.6 TFLOPS** | ~3W | 3x single-op throughput |
## Apple Neural Engine (ANE) Reverse Engineering — Private API Discoveries
We went deeper than anyone into Apple's private Neural Engine framework. Key discoveries:
### What We Cracked
| Discovery | Impact |
|-----------|--------|
| `doEvaluateDirectWithModel:` | Bypasses ANE daemon, 10% faster eval |
| Multi-procedure MIL models | N functions in one compiled program, dispatch by `procedureIndex` |
| `prepareChainingWithModel:` **succeeds** | First public success — error 15 was wrong `_ANEIOSurfaceOutputSets` API |
| `_ANEIOSurfaceOutputSets.objectWithstatsSurRef:outputBuffer:` | The correct factory method (not `outputSetsWithBuffers:`) |
| CoreML MLProgram → `MLProgramEngine` → `MLNeuralNetworkEngine` | Confirmed ANE enabled (`isANEPathForbidden=NO`, `modelIsMIL=YES`) |
| Espresso C++ runtime path | CoreML uses Espresso internally, no `_ANEModel` exposed |
| H11ANE IOKit user client type=1,4 | Direct kernel driver access via `IOServiceOpen` |
| `_ANEDaemonConnection` XPC surface | 19 methods including chaining, RT, telemetry |
### ANE Chaining — The Breakthrough
After 7 probe iterations across two sessions, we discovered that `ANEProgramChainingPrepare()` error 15 was **not a firmware limitation** — it was caused by using the wrong `_ANEIOSurfaceOutputSets` factory method.
```
Before: outputSetsWithBuffers:@[buf_out] → error 15
After: objectWithstatsSurRef:ioStats outputBuffer:@[buf_out] → SUCCESS
```
Both `prepareChainingWithModel:` (daemon) and `doPrepareChainingWithModel:` (direct) succeed. `buffersReady` remains blocked — the next frontier.
### Fused Mega-Kernels
Instead of dispatching one ANE kernel per linear projection (1.1 TFLOPS per op), we fuse multiple operations into single MIL programs:
- **Fused FFN**: gate_proj conv → sigmoid → mul → up_proj conv → mul → down_proj conv = **8 ops, ONE dispatch, 3.6 TFLOPS**
- **Fused QKV**: 3 parallel convolutions from same input = 1 dispatch
- **Fused dual projection**: gate + ssm_out in one program
The ANE compiler handles weight blobs >32MB SRAM automatically via DRAM spilling — no manual tiling needed.
---
## Metal GPU Compute Shaders for LLM Decode on Apple Silicon
13 custom Metal compute shaders encode the entire DeltaNet + FullAttention forward pass into **one command buffer per token**:
| Shader | Purpose |
|--------|---------|
| `q8_gemv` | Q8_0 GEMV (NR0=2, NQ=8, 4 simdgroups, simd_sum) |
| `q4_gemv` | Q4_0 GEMV (same pattern, nibble unpacking) |
| `deltanet_recurrence` | Full per-head state update (decay/recall/delta/update/query) |
| `conv1d_silu` | Shift + apply + SiLU activation |
| `compute_beta_decay` | sigmoid(beta) + exp(a*softplus(alpha+bias)) |
| `sdpa_causal` | Flash Attention decode (single-pass online softmax) |
| `rope_apply` | Rotary position embeddings |
| `rmsnorm_simple` | 128-thread reduction RMSNorm |
| `rmsnorm_gated` | Per-head RMSNorm with SiLU gate |
| `sigmoid_gate` | Output gating |
| `q_gate_split` | Deinterleave packed Q+gate projection |
| `residual_add` | Element-wise residual connection |
| `silu_mul` | Fused SiLU(gate) * up |
**Zero per-token Metal buffer allocations.** All params pre-allocated at model load.
### The GPU Performance Journey
| Optimization | Speed | Gain |
|---|---|---|
| Starting point (params buffer corruption) | 0.1 tok/s | — |
| Fix shared params buffer | 3.5 tok/s | 35x |
| Single command buffer per token | 5.0 tok/s | 1.4x |
| llama.cpp-style Q8 GEMV shader | 32.6 tok/s | 6.5x |
| NR0=2 threadgroup dispatch fix | 34.7 tok/s | 1.06x |
| FullAttention layers on GPU | 30.0 tok/s | (added 6 layers) |
| Flash SDPA (single-pass softmax) | 42.3 tok/s | +10% |
| **Total improvement** | **0.1 → 42 tok/s** | **420x** |
---
## Architecture — Hybrid ANE + Metal GPU + CPU Pipeline
```
┌─────────────┐
│ GGUF Model │
│ (Q8/Q4_0) │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │ │
┌────▼────┐ ┌───▼───┐ ┌────▼────┐
│ ANE │ │ CPU │ │ Metal │
│ Prefill │ │ NEON │ │ GPU │
│ 33 tk/s │ │ 23t/s │ │ 32t/s │
└─────────┘ └───────┘ └─────────┘
│ │ │
│ ┌──────┴──────┐ │
│ │ DeltaNet │ │
│ │ Recurrence │ │
│ │ (sequential)│ │
│ └─────────────┘ │
│ │
└────────┬────────────────┘
│
┌────▼────┐
│ Tokenizer│
│ (BPE) │
└──────────┘
```
### Qwen3.5-2B Hybrid Architecture
- **24 layers**: 18 DeltaNet (linear attention + SSM recurrence) + 6 Full Attention (GQA)
- **DeltaNet**: O(1) per token, 128-dim recurrent state, conv1d with kernel=4
- **Full Attention**: 8 Q heads, 2 KV heads, head_dim=256, partial RoPE
- **FFN**: SwiGLU, dim=2048 → hidden=6144
---
## Building
```bash
# Prerequisites: Rust, Xcode Command Line Tools
git clone https://github.com/youruser/ane-infer
cd ane-infer
# Compile Metal shaders
cd crates/engine/metal
xcrun -sdk macosx metal -c q8_gemv.metal -o q8_gemv.air
xcrun -sdk macosx metal -c deltanet.metal -o deltanet.air
xcrun -sdk macosx metal -c attention.metal -o attention.air
xcrun -sdk macosx metal -c q4_gemv.metal -o q4_gemv.air
xcrun -sdk macosx metallib q8_gemv.air deltanet.air attention.air q4_gemv.air -o q8_gemv.metallib
cd ../../..
# Build
cargo build --release
# Download model (Q8_0)
# Place at ~/models/Qwen3.5-2B-Q8_0.gguf
```
## Usage
```bash
# Generate text
ane-infer generate -m model.gguf -p "The capital of France is" --max-tokens 256 --temp 0.7
# Full benchmark suite
ane-infer bench -m model.gguf --prompt-tokens 128 --gen-tokens 32
# Test ANE hardware
ane-infer test-ane
# ANE throughput benchmark
ane-infer bench-ane
# Model info
ane-infer info -m model.gguf
```
---
## File Structure
```
crates/
├── ane-bridge/ # ANE private framework FFI
│ ├── objc/
│ │ ├── ane_runtime.m # _ANEClient, compile/eval/free lifecycle
│ │ ├── ane_runtime.h # C ABI for Rust FFI
│ │ ├── coreml_probe.m # CoreML MLProgram reverse engineering
│ │ ├── chaining_e2e.m # ANE chaining end-to-end test
│ │ ├── iokit_probe.m # IOKit H11ANE direct access
│ │ └── test_fused_ffn.m # Fused FFN mega-kernel test
│ └── src/lib.rs # Safe Rust wrappers (AneKernel, weight blobs)
├── mil-gen/ # MIL program text generation
│ └── src/
│ ├── lib.rs # MIL header/footer, conv op helper
│ ├── mega.rs # Fused FFN, dual/triple projections
│ ├── attention.rs # QKV, output projection
│ └── ffn.rs # FFN up/down projections
├── engine/ # Core inference engine
│ ├── metal/
│ │ ├── q8_gemv.metal # Q8_0 GEMV + SiLU (optimized)
│ │ ├── q4_gemv.metal # Q4_0 GEMV (tiled + simple)
│ │ ├── deltanet.metal # DeltaNet recurrence shaders (9 kernels)
│ │ └── attention.metal # RoPE, SDPA, gating (4 kernels)
│ └── src/
│ ├── metal_graph.rs # GpuContext, GpuGraph, all pipeline states
│ ├── gpu_full_decode.rs # Full-GPU token decode (ONE cmd buffer)
│ ├── gpu_decode.rs # GPU weight upload, GpuBuffer types
│ ├── ane_prefill.rs # ANE batched prefill with mega-kernels
│ ├── deltanet.rs # CPU DeltaNet recurrence (NEON)
│ ├── q8_gemv.rs # CPU Q8/Q4 GEMV (rayon parallel)
│ ├── model.rs # Model weight types, config
│ ├── tokenizer.rs # GPT-2 BPE tokenizer
│ └── scratch.rs # Pre-allocated scratch buffers
├── gguf/ # GGUF file parser
│ └── src/
│ ├── parser.rs # GGUF v2/v3 parsing
│ ├── to_ane.rs # Tensor extraction helpers
│ └── dequant.rs # Q4/Q8/Q6K dequantization
└── cli/ # CLI binary
└── src/main.rs # Commands: generate, bench, test-ane, info
```
---
## Limitations — Apple Neural Engine Private API Caveats
- **Private APIs**: Uses `_ANEClient`, `_ANEInMemoryModel`, etc. Will break on macOS updates.
- **Q6K dequant**: Partially broken — Q4 models with Q6K embeddings produce degraded output.
- **No speculative decoding**: Same-model speculation doesn't help (draft ~= verify speed). Needs separate tiny draft model.
- **Sequential recurrence**: DeltaNet state update is O(L) per token for prefill. Chunked parallel algorithm (FLA) not yet implemented.
- **FullAttention prefill**: Not yet batched on ANE — only DeltaNet layers use ANE prefill.
- **Single sequence**: No batched inference (batch_size=1 only).
## Acknowledgments
- [maderix/ANE](https://github.com/maderix/ANE) — The breakthrough project that reverse-engineered ANE training. We built on their `_ANEInMemoryModelDescriptor`, weight blob format, and MIL compilation pipeline.
- [hollance/neural-engine](https://github.com/hollance/neural-engine) — Comprehensive ANE documentation.
- [eiln/ane](https://github.com/eiln/ane) — Linux ANE driver reverse engineering.
- [llama.cpp](https://github.com/ggml-org/llama.cpp) — Metal Q8 GEMV shader patterns, GGUF format, performance reference.
- [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention) — Chunked parallel DeltaNet algorithm reference.
- [metalQwen3](https://github.com/BoltzmannEntropy/metalQwen3) — Metal GPU inference reference for Qwen.
## Disclaimer
This project uses Apple's **private, undocumented frameworks** (`AppleNeuralEngine.framework`). These APIs have no stability guarantee and may change or break with any macOS update. Use at your own risk. Not affiliated with Apple.
## License
MIT