An open API service indexing awesome lists of open source software.

https://github.com/christopherkarani/Espresso

Train and run transformers directly on Apple's Neural Engine — 4.76x faster than CoreML.
https://github.com/christopherkarani/Espresso

ane apple apple-neural-engine neural-engine swift

Last synced: about 2 months ago
JSON representation

Train and run transformers directly on Apple's Neural Engine — 4.76x faster than CoreML.

Awesome Lists containing this project

README

          


Espresso


Direct Neural Engine inference for transformers on Apple Silicon — 4.76x faster than CoreML.


CI
ANE Matrix
Swift 6.2
License: MIT
macOS 15+
Zero Dependencies
Latest Release

---

Espresso compiles MIL programs straight to ANE silicon through reverse-engineered private APIs (`_ANEClient`, `_ANEInMemoryModel`). No CoreML. No per-token recompilation. Just IOSurface buffers, fused multi-layer kernels, and two verified tokens per decode step.

- **4.76x faster decode** — 1.08 ms/token vs CoreML's 5.09 ms/token on the same 6-layer model
- **Fused 3-layer kernels** — 6 transformer layers in 2 ANE dispatches, not 6
- **Zero-copy I/O** — NEON-vectorized reads, vDSP argmax, no marshaling
- **Full training on ANE** — forward + backward passes with gradient accumulation and Adam
- **Pure Swift 6.2** — `~Copyable` move-only tensors, strict concurrency, typed throws, zero dependencies


Espresso generating tokens on ANE

## Quick Start

```bash
git clone https://github.com/christopherkarani/Espresso.git
cd Espresso
./espresso # builds, downloads demo weights, launches TUI
```

Five lines to first ANE inference in your own project:

```swift
// Package.swift — add the dependency
.package(url: "https://github.com/christopherkarani/Espresso.git", from: "1.0.0")

import ANERuntime

let kernel = try ANEKernel(milText: myMIL, weights: blobs, inputSizes: [input], outputSizes: [output])
try kernel.eval() // runs on Neural Engine
let result = kernel.outputSurface(at: 0) // zero-copy read
```

Other entry points:

```bash
./espresso "Hello" # generate text
./espresso doctor # check host readiness
./espresso compare --no-power "Hello" # side-by-side vs CoreML
./espresso install # install to ~/.local/bin
swift run espresso-bench --ane-only --inference --layers 6
swift run espc pack-native /path/to/model /tmp/model.esp --overwrite
swift run esprun inspect /tmp/model.esp
swift run esprun generate /tmp/model.esp "Hello" 32
```

## ESP Model Platform

Espresso now ships a private-only model platform around portable `.esp` bundles and bundle-aware runtime selection.

- `.esp` is the canonical portable model bundle
- `.espc` is the derived compiled-cache layer
- `espc` packs native model directories into `.esp`
- `esprun` inspects, resolves, and runs bundle artifacts
- `espresso-generate --bundle ` runs the same bundle boundary used by the runtime

Current public docs for this layer:

- [Convert / Optimize / Native-Fast strategy](docs/platform/2026-03-26-convert-optimize-native-fast-plan.md)
- [Stories Convert -> Optimize execution plan](docs/platform/2026-03-26-stories-convert-optimize-execution-plan.md)
- [Stories agent prompt](docs/platform/2026-03-26-stories-convert-optimize-agent-prompt.md)

## Benchmark

### Espresso vs CoreML vs llama.cpp

| Backend | ms/token | tok/s | Notes |
|---------|----------|-------|-------|
| **Espresso ANE** (exact two-step) | **1.08** | **926** | Direct ANE, 2 dispatches / 6 layers |
| CoreML `.cpuAndNeuralEngine` | 5.09 | 196 | Apple's standard ANE path |
| llama.cpp Metal | ~12–20 | ~50–85 | GPU path, CPU-bound decode¹ |
| llama.cpp CPU (`ggml`) | ~25–40 | ~25–40 | Pure CPU, no ANE¹ |
| **Espresso speedup vs CoreML** | | **4.76x** | |
| **Espresso speedup vs llama.cpp Metal** | | **~11x** | |

> ¹ llama.cpp has no ANE backend. Metal figures are representative for GPT-2 117M on M3 Max; actual performance varies by quantization and prompt length.
> All Espresso / CoreML numbers: 6-layer local artifact · dim=768 · 12 heads · 32k vocab · seqLen=256 · M3 Max · macOS 15.

Reproduce Espresso benchmarks

```bash
RESULTS_DIR=results/$(date +%Y%m%d-%H%M%S) \
REPEATS=5 WARMUP=3 ITERATIONS=20 \
./scripts/reproduce_local_real_artifact_claim.sh
```

Machine-readable output lands in `artifacts/benchmarks/` and is kept out of git.

### Platform Compatibility

| SoC | Neural Engine | Tested | Notes |
|-----|---------------|--------|-------|
| M1 / M1 Pro / M1 Max / M1 Ultra | 16-core ANE | ✅ | Full feature set |
| M2 / M2 Pro / M2 Max / M2 Ultra | 16-core ANE | ✅ | Full feature set |
| M3 / M3 Pro / M3 Max | 18-core ANE | ✅ | Reference hardware (M3 Max) |
| M4 / M4 Pro / M4 Max | 38-core ANE | ✅ | Faster compile cache warm-up |
| Intel Mac | — | ❌ | No Neural Engine |
| Apple A-series (iOS) | ✅ | ⚠️ | Requires entitlement; not App Store safe |

macOS 15+ required. iOS / tvOS not supported out of the box (private API entitlements differ per platform).

## How It Works

```
┌─────────────────────┐
│ MIL Program Text │ Generated per-kernel
└──────────┬──────────┘

┌─────────────────────┐
│ _ANEClient compile │ Private API (dlopen)
└──────────┬──────────┘

┌─────────────────────┐
│ ANE E5 Binary │ Cached by system
└──────────┬──────────┘

┌────────────────┼────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ IOSurface │ │ IOSurface │ │ IOSurface │
│ (input) │ │ (weights) │ │ (output) │
└──────┬───────┘ └──────────────┘ └──────┬───────┘
│ ANE Hardware │
└──────────────eval───────────────┘
```

The decode loop compiles once and reuses the program across all steps. KV cache lives in IOSurface buffers — not marshaled through CoreML. Each step produces two exact tokens with verified parity. Fused triplet kernels process 3 layers per dispatch, reducing 6 layers to 2 eval calls.

## Architecture

```
ANEInterop (ObjC/C — private API bridge)
└── ANETypes (~Copyable value types, IOSurface I/O)
├── MILGenerator (28+ kernel variants)
│ └── ANERuntime (compile, eval, surface management)
│ └── Espresso (training, generation, decode)
│ ├── EspressoTrain (CLI)
│ └── EspressoBench (CLI)
└── CPUOps (Accelerate/vDSP kernels)
└── Espresso
```

| Module | What it does |
|--------|-------------|
| **ANEInterop** | `dlopen` bridge to `_ANEClient` and `_ANEInMemoryModel`. NEON-vectorized I/O. |
| **ANETypes** | `~Copyable` tensors, `SurfaceIO`, weight serialization, model config. |
| **MILGenerator** | Generates MIL text for forward, backward, decode, and fused kernels. |
| **CPUOps** | RMSNorm, RoPE, embedding, softmax, Adam via Accelerate/vDSP. |
| **ANERuntime** | Compiles MIL to ANE E5 binaries. Manages IOSurface buffers and compile budget. |
| **Espresso** | Transformer layers, generation harnesses, exact two-token decode, training loop. |

## SPM Integration

```swift
// Package.swift
dependencies: [
.package(url: "https://github.com/christopherkarani/Espresso.git", from: "1.0.0")
],
targets: [
.target(name: "MyApp", dependencies: [
.product(name: "ANERuntime", package: "Espresso"),
.product(name: "ANETypes", package: "Espresso"),
])
]
```

```swift
import ANERuntime
import ANETypes

// 1. Define your kernel shape
let gen = MyMILGenerator(config: .init(dim: 768, heads: 12))

// 2. Compile once to ANE E5 binary
let kernel = try ANEKernel(
milText: gen.milText,
weights: gen.weightBlobs,
inputSizes: [gen.inputSize],
outputSizes: [gen.outputSize]
)

// 3. Run inference — stays on ANE the whole time
try kernel.eval()

// 4. Read results via zero-copy IOSurface
let output = kernel.outputSurface(at: 0)
```

## Requirements

| | Minimum |
|---|---|
| Hardware | Apple Silicon (M1+) with Neural Engine |
| macOS | 15.0+ |
| Swift | 6.0+ |
| Dependencies | None — only Apple system frameworks |

## Testing

```bash
swift test # unit tests (no ANE needed)
ANE_HARDWARE_TESTS=1 swift test --filter "ANERuntimeTests|EspressoTests" # hardware tests
OBJC_CROSS_VALIDATION=1 ANE_HARDWARE_TESTS=1 swift test --filter CrossValidationTests # parity
```

7 test suites cover MIL generation, tensor ops, CPU kernels, ANE compilation, hardware eval, cross-validation, and end-to-end generation.

## Disclaimer

> **App Store**: Apps using private ANE APIs (`_ANEClient`, `_ANEInMemoryModel`) will be rejected.
>
> **Everywhere else**: Internal tools, research, sideloaded apps, enterprise distribution — all fine.

This project uses undocumented private Apple APIs discovered through runtime introspection. Results are hardware- and OS-dependent. Benchmarks run on a local artifact family built by this repo, not a pretrained production model. Not affiliated with or endorsed by Apple Inc.

## Contributing

Contributions welcome — see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
File bugs and feature requests via [GitHub Issues](https://github.com/christopherkarani/Espresso/issues).

## License

MIT — see [LICENSE](LICENSE).