{"id":48117252,"url":"https://github.com/thebasedcapital/ane-infer","last_synced_at":"2026-04-04T16:16:55.498Z","repository":{"id":342170438,"uuid":"1173106840","full_name":"thebasedcapital/ane-infer","owner":"thebasedcapital","description":"Apple Neural Engine (ANE) LLM inference engine — reverse-engineered private APIs, Metal GPU shaders, hybrid ANE+GPU+CPU on Apple Silicon. 32 tok/s matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels.","archived":false,"fork":false,"pushed_at":"2026-03-05T02:44:14.000Z","size":185,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-05T07:50:52.421Z","etag":null,"topics":["ane","apple-neural-engine","apple-silicon","deltanet","edge-ai","gguf","llm-inference","macos","metal-gpu","neural-engine","npu","on-device-ai","quantization","qwen","reverse-engineering","rust"],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/thebasedcapital.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-05T02:36:31.000Z","updated_at":"2026-03-05T03:28:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/thebasedcapital/ane-infer","commit_stats":null,"previous_names":["thebasedcapital/ane-infer"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/thebasedcapital/ane-infer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thebasedcapital%2Fane-infer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thebasedcapital%2Fane-infer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thebasedcapital%2Fane-infer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thebasedcapital%2Fane-infer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/thebasedcapital","download_url":"https://codeload.github.com/thebasedcapital/ane-infer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/thebasedcapital%2Fane-infer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31405700,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ane","apple-neural-engine","apple-silicon","deltanet","edge-ai","gguf","llm-inference","macos","metal-gpu","neural-engine","npu","on-device-ai","quantization","qwen","reverse-engineering","rust"],"created_at":"2026-04-04T16:16:54.561Z","updated_at":"2026-04-04T16:16:55.486Z","avatar_url":"https://github.com/thebasedcapital.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ane-infer\n\n\u003e **Apple Neural Engine (ANE) LLM Inference Engine** — reverse-engineered private APIs, Metal GPU compute shaders, hybrid ANE+GPU+CPU on Apple Silicon M1/M2/M3/M4/M5\n\n**Hybrid ANE+Metal+CPU inference engine for LLMs on Apple Silicon.**\n\nFirst implementation of Qwen3.5 (Gated DeltaNet + GQA) running natively on Apple Neural Engine via reverse-engineered private APIs. 32 tok/s Metal GPU decode matching llama.cpp, 3.6 TFLOPS fused ANE mega-kernels, built from scratch in Rust + Obj-C + Metal.\n\nBuilt on the shoulders of [maderix/ANE](https://github.com/maderix/ANE) — the project that cracked open ANE training. We took it further into inference with DeltaNet, Metal GPU shaders, and a complete decode pipeline.\n\n### Keywords\n`apple-neural-engine` `ane` `apple-silicon` `metal-gpu` `llm-inference` `on-device-ai` `neural-engine` `m1` `m2` `m3` `m4` `m5` `private-api` `reverse-engineering` `coreml` `gguf` `quantization` `q4` `q8` `deltanet` `qwen` `rust` `metal-shaders` `npu` `mlx-alternative` `llama-cpp-alternative` `macos` `ios` `edge-ai` `low-power-inference`\n\n---\n\n## What This Is — Apple Neural Engine LLM Inference\n\nA from-scratch LLM inference engine that runs Qwen3.5-2B on three Apple Silicon accelerators simultaneously:\n\n- **Apple Neural Engine (ANE)** — batched prefill via 1x1 convolutions through private `_ANEClient` APIs\n- **Metal GPU** — single-token decode with 13 custom compute shaders, ONE command buffer per token\n- **CPU (NEON/AMX)** — parallel Q8_0 GEMV via rayon, Accelerate BLAS fallback\n\nNo CoreML. No Python. No MLX. Just system frameworks + `objc_msgSend`.\n\n## What This Is Not\n\n- Not faster than llama.cpp (yet). We match their decode speed, not their prefill.\n- Not production-ready. Private API usage means it breaks with macOS updates.\n- Not a general inference framework. Built specifically for Qwen3.5 DeltaNet hybrid architecture.\n\n---\n\n## Performance — ANE vs Metal GPU vs CPU on Apple Silicon\n\n**Qwen3.5-2B Q8_0 on Apple M5 (same chip as llama.cpp benchmarks)**\n\n| Backend | Speed | Power | Notes |\n|---------|-------|-------|-------|\n| Metal GPU Q8 decode | **32 tok/s** | ~15W | Matches llama.cpp (34.8) |\n| Metal GPU Q4 decode | **42 tok/s** | ~15W | Q6K dequant WIP |\n| CPU Q8 decode | 23 tok/s | ~5W | Rayon + NEON |\n| ANE prefill pp16 | 33 tok/s | ~3W | Fused FFN mega-kernel |\n| ANE fused FFN | **3.6 TFLOPS** | ~3W | 3x single-op throughput |\n\n## Apple Neural Engine (ANE) Reverse Engineering — Private API Discoveries\n\nWe went deeper than anyone into Apple's private Neural Engine framework. Key discoveries:\n\n### What We Cracked\n\n| Discovery | Impact |\n|-----------|--------|\n| `doEvaluateDirectWithModel:` | Bypasses ANE daemon, 10% faster eval |\n| Multi-procedure MIL models | N functions in one compiled program, dispatch by `procedureIndex` |\n| `prepareChainingWithModel:` **succeeds** | First public success — error 15 was wrong `_ANEIOSurfaceOutputSets` API |\n| `_ANEIOSurfaceOutputSets.objectWithstatsSurRef:outputBuffer:` | The correct factory method (not `outputSetsWithBuffers:`) |\n| CoreML MLProgram → `MLProgramEngine` → `MLNeuralNetworkEngine` | Confirmed ANE enabled (`isANEPathForbidden=NO`, `modelIsMIL=YES`) |\n| Espresso C++ runtime path | CoreML uses Espresso internally, no `_ANEModel` exposed |\n| H11ANE IOKit user client type=1,4 | Direct kernel driver access via `IOServiceOpen` |\n| `_ANEDaemonConnection` XPC surface | 19 methods including chaining, RT, telemetry |\n\n### ANE Chaining — The Breakthrough\n\nAfter 7 probe iterations across two sessions, we discovered that `ANEProgramChainingPrepare()` error 15 was **not a firmware limitation** — it was caused by using the wrong `_ANEIOSurfaceOutputSets` factory method.\n\n```\nBefore: outputSetsWithBuffers:@[buf_out]  → error 15\nAfter:  objectWithstatsSurRef:ioStats outputBuffer:@[buf_out]  → SUCCESS\n```\n\nBoth `prepareChainingWithModel:` (daemon) and `doPrepareChainingWithModel:` (direct) succeed. `buffersReady` remains blocked — the next frontier.\n\n### Fused Mega-Kernels\n\nInstead of dispatching one ANE kernel per linear projection (1.1 TFLOPS per op), we fuse multiple operations into single MIL programs:\n\n- **Fused FFN**: gate_proj conv → sigmoid → mul → up_proj conv → mul → down_proj conv = **8 ops, ONE dispatch, 3.6 TFLOPS**\n- **Fused QKV**: 3 parallel convolutions from same input = 1 dispatch\n- **Fused dual projection**: gate + ssm_out in one program\n\nThe ANE compiler handles weight blobs \u003e32MB SRAM automatically via DRAM spilling — no manual tiling needed.\n\n---\n\n## Metal GPU Compute Shaders for LLM Decode on Apple Silicon\n\n13 custom Metal compute shaders encode the entire DeltaNet + FullAttention forward pass into **one command buffer per token**:\n\n| Shader | Purpose |\n|--------|---------|\n| `q8_gemv` | Q8_0 GEMV (NR0=2, NQ=8, 4 simdgroups, simd_sum) |\n| `q4_gemv` | Q4_0 GEMV (same pattern, nibble unpacking) |\n| `deltanet_recurrence` | Full per-head state update (decay/recall/delta/update/query) |\n| `conv1d_silu` | Shift + apply + SiLU activation |\n| `compute_beta_decay` | sigmoid(beta) + exp(a*softplus(alpha+bias)) |\n| `sdpa_causal` | Flash Attention decode (single-pass online softmax) |\n| `rope_apply` | Rotary position embeddings |\n| `rmsnorm_simple` | 128-thread reduction RMSNorm |\n| `rmsnorm_gated` | Per-head RMSNorm with SiLU gate |\n| `sigmoid_gate` | Output gating |\n| `q_gate_split` | Deinterleave packed Q+gate projection |\n| `residual_add` | Element-wise residual connection |\n| `silu_mul` | Fused SiLU(gate) * up |\n\n**Zero per-token Metal buffer allocations.** All params pre-allocated at model load.\n\n### The GPU Performance Journey\n\n| Optimization | Speed | Gain |\n|---|---|---|\n| Starting point (params buffer corruption) | 0.1 tok/s | — |\n| Fix shared params buffer | 3.5 tok/s | 35x |\n| Single command buffer per token | 5.0 tok/s | 1.4x |\n| llama.cpp-style Q8 GEMV shader | 32.6 tok/s | 6.5x |\n| NR0=2 threadgroup dispatch fix | 34.7 tok/s | 1.06x |\n| FullAttention layers on GPU | 30.0 tok/s | (added 6 layers) |\n| Flash SDPA (single-pass softmax) | 42.3 tok/s | +10% |\n| **Total improvement** | **0.1 → 42 tok/s** | **420x** |\n\n---\n\n## Architecture — Hybrid ANE + Metal GPU + CPU Pipeline\n\n```\n                    ┌─────────────┐\n                    │  GGUF Model │\n                    │  (Q8/Q4_0)  │\n                    └──────┬──────┘\n                           │\n              ┌────────────┼────────────┐\n              │            │            │\n         ┌────▼────┐  ┌───▼───┐  ┌────▼────┐\n         │   ANE   │  │  CPU  │  │  Metal  │\n         │ Prefill │  │ NEON  │  │   GPU   │\n         │ 33 tk/s │  │ 23t/s │  │  32t/s  │\n         └─────────┘  └───────┘  └─────────┘\n              │            │            │\n              │     ┌──────┴──────┐     │\n              │     │ DeltaNet    │     │\n              │     │ Recurrence  │     │\n              │     │ (sequential)│     │\n              │     └─────────────┘     │\n              │                         │\n              └────────┬────────────────┘\n                       │\n                  ┌────▼────┐\n                  │ Tokenizer│\n                  │ (BPE)    │\n                  └──────────┘\n```\n\n### Qwen3.5-2B Hybrid Architecture\n- **24 layers**: 18 DeltaNet (linear attention + SSM recurrence) + 6 Full Attention (GQA)\n- **DeltaNet**: O(1) per token, 128-dim recurrent state, conv1d with kernel=4\n- **Full Attention**: 8 Q heads, 2 KV heads, head_dim=256, partial RoPE\n- **FFN**: SwiGLU, dim=2048 → hidden=6144\n\n---\n\n## Building\n\n```bash\n# Prerequisites: Rust, Xcode Command Line Tools\ngit clone https://github.com/youruser/ane-infer\ncd ane-infer\n\n# Compile Metal shaders\ncd crates/engine/metal\nxcrun -sdk macosx metal -c q8_gemv.metal -o q8_gemv.air\nxcrun -sdk macosx metal -c deltanet.metal -o deltanet.air\nxcrun -sdk macosx metal -c attention.metal -o attention.air\nxcrun -sdk macosx metal -c q4_gemv.metal -o q4_gemv.air\nxcrun -sdk macosx metallib q8_gemv.air deltanet.air attention.air q4_gemv.air -o q8_gemv.metallib\ncd ../../..\n\n# Build\ncargo build --release\n\n# Download model (Q8_0)\n# Place at ~/models/Qwen3.5-2B-Q8_0.gguf\n```\n\n## Usage\n\n```bash\n# Generate text\nane-infer generate -m model.gguf -p \"The capital of France is\" --max-tokens 256 --temp 0.7\n\n# Full benchmark suite\nane-infer bench -m model.gguf --prompt-tokens 128 --gen-tokens 32\n\n# Test ANE hardware\nane-infer test-ane\n\n# ANE throughput benchmark\nane-infer bench-ane\n\n# Model info\nane-infer info -m model.gguf\n```\n\n---\n\n## File Structure\n\n```\ncrates/\n├── ane-bridge/           # ANE private framework FFI\n│   ├── objc/\n│   │   ├── ane_runtime.m        # _ANEClient, compile/eval/free lifecycle\n│   │   ├── ane_runtime.h        # C ABI for Rust FFI\n│   │   ├── coreml_probe.m       # CoreML MLProgram reverse engineering\n│   │   ├── chaining_e2e.m       # ANE chaining end-to-end test\n│   │   ├── iokit_probe.m        # IOKit H11ANE direct access\n│   │   └── test_fused_ffn.m     # Fused FFN mega-kernel test\n│   └── src/lib.rs               # Safe Rust wrappers (AneKernel, weight blobs)\n├── mil-gen/              # MIL program text generation\n│   └── src/\n│       ├── lib.rs               # MIL header/footer, conv op helper\n│       ├── mega.rs              # Fused FFN, dual/triple projections\n│       ├── attention.rs         # QKV, output projection\n│       └── ffn.rs               # FFN up/down projections\n├── engine/               # Core inference engine\n│   ├── metal/\n│   │   ├── q8_gemv.metal        # Q8_0 GEMV + SiLU (optimized)\n│   │   ├── q4_gemv.metal        # Q4_0 GEMV (tiled + simple)\n│   │   ├── deltanet.metal       # DeltaNet recurrence shaders (9 kernels)\n│   │   └── attention.metal      # RoPE, SDPA, gating (4 kernels)\n│   └── src/\n│       ├── metal_graph.rs       # GpuContext, GpuGraph, all pipeline states\n│       ├── gpu_full_decode.rs   # Full-GPU token decode (ONE cmd buffer)\n│       ├── gpu_decode.rs        # GPU weight upload, GpuBuffer types\n│       ├── ane_prefill.rs       # ANE batched prefill with mega-kernels\n│       ├── deltanet.rs          # CPU DeltaNet recurrence (NEON)\n│       ├── q8_gemv.rs           # CPU Q8/Q4 GEMV (rayon parallel)\n│       ├── model.rs             # Model weight types, config\n│       ├── tokenizer.rs         # GPT-2 BPE tokenizer\n│       └── scratch.rs           # Pre-allocated scratch buffers\n├── gguf/                 # GGUF file parser\n│   └── src/\n│       ├── parser.rs            # GGUF v2/v3 parsing\n│       ├── to_ane.rs            # Tensor extraction helpers\n│       └── dequant.rs           # Q4/Q8/Q6K dequantization\n└── cli/                  # CLI binary\n    └── src/main.rs              # Commands: generate, bench, test-ane, info\n```\n\n---\n\n## Limitations — Apple Neural Engine Private API Caveats\n\n- **Private APIs**: Uses `_ANEClient`, `_ANEInMemoryModel`, etc. Will break on macOS updates.\n- **Q6K dequant**: Partially broken — Q4 models with Q6K embeddings produce degraded output.\n- **No speculative decoding**: Same-model speculation doesn't help (draft ~= verify speed). Needs separate tiny draft model.\n- **Sequential recurrence**: DeltaNet state update is O(L) per token for prefill. Chunked parallel algorithm (FLA) not yet implemented.\n- **FullAttention prefill**: Not yet batched on ANE — only DeltaNet layers use ANE prefill.\n- **Single sequence**: No batched inference (batch_size=1 only).\n\n## Acknowledgments\n\n- [maderix/ANE](https://github.com/maderix/ANE) — The breakthrough project that reverse-engineered ANE training. We built on their `_ANEInMemoryModelDescriptor`, weight blob format, and MIL compilation pipeline.\n- [hollance/neural-engine](https://github.com/hollance/neural-engine) — Comprehensive ANE documentation.\n- [eiln/ane](https://github.com/eiln/ane) — Linux ANE driver reverse engineering.\n- [llama.cpp](https://github.com/ggml-org/llama.cpp) — Metal Q8 GEMV shader patterns, GGUF format, performance reference.\n- [Flash Linear Attention](https://github.com/fla-org/flash-linear-attention) — Chunked parallel DeltaNet algorithm reference.\n- [metalQwen3](https://github.com/BoltzmannEntropy/metalQwen3) — Metal GPU inference reference for Qwen.\n\n## Disclaimer\n\nThis project uses Apple's **private, undocumented frameworks** (`AppleNeuralEngine.framework`). These APIs have no stability guarantee and may change or break with any macOS update. Use at your own risk. Not affiliated with Apple.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthebasedcapital%2Fane-infer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthebasedcapital%2Fane-infer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthebasedcapital%2Fane-infer/lists"}