https://github.com/petlukk/eacompute

Explicit compute kernels → shared libraries + native bindings for Python, Rust, C++, PyTorch.
https://github.com/petlukk/eacompute

aarch64 avx-512 avx2 code-generation compiler compute-kernels cpp ffi high-performance-computing llvm neon programming-language python pytorch rust simd

Last synced: 2 months ago
JSON representation

Explicit compute kernels → shared libraries + native bindings for Python, Rust, C++, PyTorch.

Host: GitHub
URL: https://github.com/petlukk/eacompute
Owner: petlukk
License: apache-2.0
Created: 2026-02-13T08:24:27.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-05-08T09:45:04.000Z (2 months ago)
Last Synced: 2026-05-08T11:37:52.252Z (2 months ago)
Topics: aarch64, avx-512, avx2, code-generation, compiler, compute-kernels, cpp, ffi, high-performance-computing, llvm, neon, programming-language, python, pytorch, rust, simd
Language: Rust
Homepage:
Size: 9.06 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # Eä

Write compute kernels in explicit, portable syntax. Compile to shared libraries. Generate native bindings for Python, Rust, C++, PyTorch, and CMake.

No runtime. No garbage collector. No glue code.

Targets x86-64 (AVX2, AVX-512) and AArch64 (NEON, FP16, dot-product, I8MM).

## The Performance Story

Three workloads, measured honestly: warm-up discarded, 10 trials × 50 iterations, reporting peak throughput in GB/s. 16M float32 elements (64 MB). All Eä kernels are autoresearch-optimized (dual accumulators, FMA, restrict pointers). [Full benchmark script and methodology.](benchmarks/METHODOLOGY.md)

**FMA: `out[i] = a[i]*b[i] + c[i]` — compute-bound**

| Method | Time | GB/s | vs NumPy |

|--------|------|------|----------|

| NumPy (2-pass multiply+add) | 45,994 µs | 5.6 | baseline |

| **Eä 1 thread** | **6,921 µs** | **37.0** | **6.6×** |

| Eä 2 threads | 6,540 µs | 39.1 | 7.0× |

| Dask (2 chunks) | 56,448 µs | 4.5 | 0.81× |

| Ray (2 workers) | 89,106 µs | 2.9 | 0.52× |

**Dot product: `sum(a[i]*b[i])` — bandwidth-bound**

| Method | Time | GB/s | vs NumPy |

|--------|------|------|----------|

| NumPy BLAS sdot | 3,570 µs | 35.9 | baseline |

| **Eä 1 thread** | **3,517 µs** | **36.4** | **1.01×** |

| Dask (2 chunks) | 6,657 µs | 19.2 | 0.54× |

| Ray (2 workers) | 26,159 µs | 4.9 | 0.14× |

**SAXPY: `y[i] = a*x[i] + y[i]` — bandwidth-bound**

| Method | Time | GB/s | vs NumPy |

|--------|------|------|----------|

| NumPy (2-pass multiply+add) | 7,637 µs | 16.8 | baseline |

| **Eä 1 thread** | **3,635 µs** | **35.2** | **2.1×** |

| Dask (2 chunks) | 57,131 µs | 2.2 | 0.13× |

| Ray (2 workers) | 91,306 µs | 1.4 | 0.08× |

Why: Eä fuses operations into single-pass SIMD (one FMA instruction where NumPy does two array passes). The dot product matches BLAS because dual accumulators with 4× unroll hide FMA latency and saturate memory bandwidth. Ray and Dask add serialization overhead that makes them 7–50× slower for single-machine work.

## What the code looks like

```

export kernel vscale(data: *f32, out result: *mut f32 [cap: n], factor: f32)

    over i in n step 8

    tail scalar { result[i] = data[i] * factor }

{

    let v: f32x8 = load(data, i)

    store(result, i, v .* splat(factor))

}

```

Compile, bind, call:

```bash

ea kernel.ea --lib                        # -> kernel.so + kernel.ea.json

ea bind kernel.ea --python --rust --cpp   # -> kernel.py, kernel.rs, kernel.hpp

```

```python

import numpy as np, kernel

data = np.random.rand(1_000_000).astype(np.float32)

result = kernel.vscale(data, 2.0)  # output auto-allocated, length auto-filled, dtype checked

```

One kernel. Any host language. The binding handles allocation, length inference, and type checking.

## Measured results

Three workloads benchmarked against industry tools. Warm-cache medians, 20–50 timed runs, 5–10 warmup. Source, data, and scripts in each demo directory.

| Workload | Compared against | Speedup | Method |

|----------|-----------------|---------|--------|

| [Vector search](demo/eavec/) (dim=384) | FAISS IndexFlatIP | **4–8×** | Dual-acc FMA, f32x8, next-vector prefetch |

| [Sobel edge detection](demo/sobel/) (720p–4K) | OpenCV | **5–6×** (single-threaded) | Stencil f32x4, prefetch, L3 scaling analysis |

| [CSV analytics](demo/eastat/) (10–544 MB) | polars | **1.4–2.2×** | Structural scan, SIMD reduction, binary search |

All three use `ea bind` for Python integration — zero manual ctypes. Validated across multiple input sizes. Full methodology and additional demos (conv2d at 265×, tokenizer at 406× vs NumPy) in [`COMPUTE_PATTERNS.md`](COMPUTE_PATTERNS.md).

## `ea bind`

Reads the compiler's JSON metadata and generates idiomatic wrappers per target:

```bash

ea bind kernel.ea --python    # -> kernel.py         (NumPy + ctypes)

ea bind kernel.ea --rust      # -> kernel.rs         (FFI + safe wrappers)

ea bind kernel.ea --cpp       # -> kernel.hpp        (std::span + extern "C")

ea bind kernel.ea --pytorch   # -> kernel_torch.py   (autograd.Function)

ea bind kernel.ea --cmake     # -> CMakeLists.txt + EaCompiler.cmake

```

Pointer args become slices/arrays/tensors. Length params collapse. Types are checked at the boundary. Multiple targets in one invocation: `ea bind kernel.ea --python --rust --cpp`

## `ea inspect`

See what the compiler produced:

```bash

ea kernel.ea --emit-asm       # assembly output

ea kernel.ea --emit-llvm      # LLVM IR

ea kernel.ea --header         # C header

```

## Quick start

```bash

# Requirements: LLVM 18, Rust

sudo apt install llvm-18-dev clang-18 libpolly-18-dev libzstd-dev

cargo build --features=llvm

# Compile + bind + run

ea kernel.ea --lib

ea bind kernel.ea --python

python -c "import kernel; print(kernel.vscale([1.0, 2.0, 3.0], 10.0))"

# Run a demo

cd demo/eastat && python run.py

# Tests

cargo test --tests --features=llvm

```

## SIMD types and operations

`f32x4`, `f32x8`, `f32x16`¹, `f64x2`, `f64x4`, `i32x4`, `i32x8`, `i32x16`¹, `i8x16`, `i8x32`, `u8x16`, `i16x8`, `i16x16`, `f16x4`², `f16x8`²

`load`, `store`, `splat`, `fma`, `shuffle`, `select`, `load_masked`, `store_masked`, `gather`³, `scatter`¹, `prefetch`

`reduce_add`, `reduce_max`, `reduce_min`, `min`, `max`

`maddubs_i16(u8x16, i8x16) -> i16x8` — SSSE3 pmaddubsw. Chain with `madd_i16` for i32 accumulation.

`madd_i16(i16xN, i16xN) -> i32x(N/2)` — SSE2/AVX2/AVX-512 pmaddwd (x86-only; ARM error points at `wmul_i32 + addp_i32`).

`vdot_i32`, `vdot_lane_i32` (ARM `--dotprod`); `smmla_i32`, `ummla_i32`, `usmmla_i32` (ARM `--i8mm`).

`exp_poly_f32(f32xN) -> f32xN` — polynomial vector exp on `[-50, 50]`, no libm scalarization. Measured 2.93× isolated vs scalar `exp()` on AMD Zen 4 + glibc 2.42; 2.23× in real `gemma4_gelu` on Pi 5 Cortex-A76 (other ops in GELU are Amdahl-capped).

`widen_u8_f32x4`, `widen_i8_f32x4`, `widen_u8_f32x8`, `widen_i8_f32x8`, `widen_u8_f32x16`¹, `widen_i8_f32x16`¹, `widen_u8_i32x4`, `widen_u8_i32x8`, `widen_u8_i32x16`¹, `widen_u8_u16`, `narrow_f32x4_i8`, `pack_sat_*`, `pack_usat_*`, `round_f32x{4,8}_i32x{4,8}`, `sat_add`, `sat_sub`, `sqrt`, `rsqrt`, `exp`, `to_f32`, `to_i32`, `to_f64`, `to_i64`, `to_f16`²,

`to_i16`, `cvt_f16_f32`, `cvt_f32_f16`.

Bitwise: `.&`, `.|`, `.^`, `.<<`, `.>>` on integer vectors; `&`, `|`, `^`, `<<`, `>>` on integer scalars. Restrict pointers: `*restrict T`, `*mut restrict T`.

¹ Requires `--avx512`. ² Requires `--fp16` (ARM-only). ³ x86-only; ARM users compose via `f32x{4,8}_from_scalars` — see [`docs/idioms/neon-gather.md`](docs/idioms/neon-gather.md).

## Kernel constructs

```

export kernel name(...) over i in n step N tail  { ... }

```

Tail strategies: `tail scalar { ... }` (scalar fallback), `tail mask { ... }` (masked SIMD), `tail pad` (caller pads input). Output annotations (`out name: *mut T [cap: expr]`) drive auto-allocation in bindings.

Also: `for i in 0..n step 8 { ... }` counted loops, `foreach (i in 0..n) { ... }` element-wise loops (LLVM auto-vectorizes at O2+), `unroll(N)`, compile-time `const`, `static_assert`, `#[cfg(x86_64)]` / `#[cfg(aarch64)]` conditional compilation, C-compatible structs, multi-kernel files, pointer-to-pointer `**T` parameters.

## Kernel fusion

Fusion eliminates memory round-trips between pipeline stages:

```

3 kernels (unfused):  8.55 ms   — 0.9× (slightly slower, FFI + memory overhead)

1 kernel  (fused):    0.07 ms   — 111× faster than NumPy

```

> If data leaves registers, you probably ended a kernel too early.

Analysis of when fusion helps and when it hurts: [`COMPUTE_PATTERNS.md`](COMPUTE_PATTERNS.md).

## Design

Explicit over implicit. SIMD width, loop stepping, and memory access are programmer-controlled. No hidden allocations, no auto-vectorizer in the default path, no runtime. Ea is not a general-purpose language — no strings, collections, or modules. It accelerates host languages, it does not replace them.

## Architecture

```

.ea -> Lexer -> Parser -> Desugar -> Type Check -> Codegen (LLVM 18) -> .o / .so

                                                                      -> .ea.json -> ea bind

```

~17,000 lines of Rust. 778 tests covering SIMD ops, C interop, structs, kernel constructs, tail strategies, binding generation, error suggestions, ARM targets. CI on x86-64, AArch64, Windows.

[`BENCHMARKS.md`](BENCHMARKS.md) — performance tables. [`CHANGELOG.md`](CHANGELOG.md) — version history. Language reference: [`docs/src/reference/`](docs/src/reference/) (mdbook).

## License

Apache 2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/petlukk/eacompute

Awesome Lists containing this project

README