{"id":46218900,"url":"https://github.com/petlukk/eacompute","last_synced_at":"2026-05-13T19:01:00.387Z","repository":{"id":338192024,"uuid":"1156937219","full_name":"petlukk/eacompute","owner":"petlukk","description":"Explicit compute kernels → shared libraries + native bindings for Python, Rust, C++, PyTorch.","archived":false,"fork":false,"pushed_at":"2026-05-08T09:45:04.000Z","size":9504,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-08T11:37:52.252Z","etag":null,"topics":["aarch64","avx-512","avx2","code-generation","compiler","compute-kernels","cpp","ffi","high-performance-computing","llvm","neon","programming-language","python","pytorch","rust","simd"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/petlukk.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-13T08:24:27.000Z","updated_at":"2026-03-30T11:19:50.000Z","dependencies_parsed_at":"2026-03-30T12:02:01.840Z","dependency_job_id":null,"html_url":"https://github.com/petlukk/eacompute","commit_stats":null,"previous_names":["petlukk/e-","petlukk/eacompute"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/petlukk/eacompute","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petlukk%2Feacompute","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petlukk%2Feacompute/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petlukk%2Feacompute/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petlukk%2Feacompute/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/petlukk","download_url":"https://codeload.github.com/petlukk/eacompute/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/petlukk%2Feacompute/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32995915,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T13:14:54.681Z","status":"ssl_error","status_checked_at":"2026-05-13T13:14:51.610Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aarch64","avx-512","avx2","code-generation","compiler","compute-kernels","cpp","ffi","high-performance-computing","llvm","neon","programming-language","python","pytorch","rust","simd"],"created_at":"2026-03-03T11:10:55.301Z","updated_at":"2026-05-13T19:01:00.381Z","avatar_url":"https://github.com/petlukk.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Eä\n\nWrite compute kernels in explicit, portable syntax. Compile to shared libraries. Generate native bindings for Python, Rust, C++, PyTorch, and CMake.\n\nNo runtime. No garbage collector. No glue code.\n\nTargets x86-64 (AVX2, AVX-512) and AArch64 (NEON, FP16, dot-product, I8MM).\n\n## The Performance Story\n\nThree workloads, measured honestly: warm-up discarded, 10 trials × 50 iterations, reporting peak throughput in GB/s. 16M float32 elements (64 MB). All Eä kernels are autoresearch-optimized (dual accumulators, FMA, restrict pointers). [Full benchmark script and methodology.](benchmarks/METHODOLOGY.md)\n\n**FMA: `out[i] = a[i]*b[i] + c[i]` — compute-bound**\n\n| Method | Time | GB/s | vs NumPy |\n|--------|------|------|----------|\n| NumPy (2-pass multiply+add) | 45,994 µs | 5.6 | baseline |\n| **Eä 1 thread** | **6,921 µs** | **37.0** | **6.6×** |\n| Eä 2 threads | 6,540 µs | 39.1 | 7.0× |\n| Dask (2 chunks) | 56,448 µs | 4.5 | 0.81× |\n| Ray (2 workers) | 89,106 µs | 2.9 | 0.52× |\n\n**Dot product: `sum(a[i]*b[i])` — bandwidth-bound**\n\n| Method | Time | GB/s | vs NumPy |\n|--------|------|------|----------|\n| NumPy BLAS sdot | 3,570 µs | 35.9 | baseline |\n| **Eä 1 thread** | **3,517 µs** | **36.4** | **1.01×** |\n| Dask (2 chunks) | 6,657 µs | 19.2 | 0.54× |\n| Ray (2 workers) | 26,159 µs | 4.9 | 0.14× |\n\n**SAXPY: `y[i] = a*x[i] + y[i]` — bandwidth-bound**\n\n| Method | Time | GB/s | vs NumPy |\n|--------|------|------|----------|\n| NumPy (2-pass multiply+add) | 7,637 µs | 16.8 | baseline |\n| **Eä 1 thread** | **3,635 µs** | **35.2** | **2.1×** |\n| Dask (2 chunks) | 57,131 µs | 2.2 | 0.13× |\n| Ray (2 workers) | 91,306 µs | 1.4 | 0.08× |\n\nWhy: Eä fuses operations into single-pass SIMD (one FMA instruction where NumPy does two array passes). The dot product matches BLAS because dual accumulators with 4× unroll hide FMA latency and saturate memory bandwidth. Ray and Dask add serialization overhead that makes them 7–50× slower for single-machine work.\n\n## What the code looks like\n\n```\nexport kernel vscale(data: *f32, out result: *mut f32 [cap: n], factor: f32)\n    over i in n step 8\n    tail scalar { result[i] = data[i] * factor }\n{\n    let v: f32x8 = load(data, i)\n    store(result, i, v .* splat(factor))\n}\n```\n\nCompile, bind, call:\n\n```bash\nea kernel.ea --lib                        # -\u003e kernel.so + kernel.ea.json\nea bind kernel.ea --python --rust --cpp   # -\u003e kernel.py, kernel.rs, kernel.hpp\n```\n\n```python\nimport numpy as np, kernel\ndata = np.random.rand(1_000_000).astype(np.float32)\nresult = kernel.vscale(data, 2.0)  # output auto-allocated, length auto-filled, dtype checked\n```\n\nOne kernel. Any host language. The binding handles allocation, length inference, and type checking.\n\n## Measured results\n\nThree workloads benchmarked against industry tools. Warm-cache medians, 20–50 timed runs, 5–10 warmup. Source, data, and scripts in each demo directory.\n\n| Workload | Compared against | Speedup | Method |\n|----------|-----------------|---------|--------|\n| [Vector search](demo/eavec/) (dim=384) | FAISS IndexFlatIP | **4–8×** | Dual-acc FMA, f32x8, next-vector prefetch |\n| [Sobel edge detection](demo/sobel/) (720p–4K) | OpenCV | **5–6×** (single-threaded) | Stencil f32x4, prefetch, L3 scaling analysis |\n| [CSV analytics](demo/eastat/) (10–544 MB) | polars | **1.4–2.2×** | Structural scan, SIMD reduction, binary search |\n\nAll three use `ea bind` for Python integration — zero manual ctypes. Validated across multiple input sizes. Full methodology and additional demos (conv2d at 265×, tokenizer at 406× vs NumPy) in [`COMPUTE_PATTERNS.md`](COMPUTE_PATTERNS.md).\n\n## `ea bind`\n\nReads the compiler's JSON metadata and generates idiomatic wrappers per target:\n\n```bash\nea bind kernel.ea --python    # -\u003e kernel.py         (NumPy + ctypes)\nea bind kernel.ea --rust      # -\u003e kernel.rs         (FFI + safe wrappers)\nea bind kernel.ea --cpp       # -\u003e kernel.hpp        (std::span + extern \"C\")\nea bind kernel.ea --pytorch   # -\u003e kernel_torch.py   (autograd.Function)\nea bind kernel.ea --cmake     # -\u003e CMakeLists.txt + EaCompiler.cmake\n```\n\nPointer args become slices/arrays/tensors. Length params collapse. Types are checked at the boundary. Multiple targets in one invocation: `ea bind kernel.ea --python --rust --cpp`\n\n## `ea inspect`\n\nSee what the compiler produced:\n\n```bash\nea kernel.ea --emit-asm       # assembly output\nea kernel.ea --emit-llvm      # LLVM IR\nea kernel.ea --header         # C header\n```\n\n## Quick start\n\n```bash\n# Requirements: LLVM 18, Rust\nsudo apt install llvm-18-dev clang-18 libpolly-18-dev libzstd-dev\ncargo build --features=llvm\n\n# Compile + bind + run\nea kernel.ea --lib\nea bind kernel.ea --python\npython -c \"import kernel; print(kernel.vscale([1.0, 2.0, 3.0], 10.0))\"\n\n# Run a demo\ncd demo/eastat \u0026\u0026 python run.py\n\n# Tests\ncargo test --tests --features=llvm\n```\n\n## SIMD types and operations\n\n`f32x4`, `f32x8`, `f32x16`¹, `f64x2`, `f64x4`, `i32x4`, `i32x8`, `i32x16`¹, `i8x16`, `i8x32`, `u8x16`, `i16x8`, `i16x16`, `f16x4`², `f16x8`²\n\n`load`, `store`, `splat`, `fma`, `shuffle`, `select`, `load_masked`, `store_masked`, `gather`³, `scatter`¹, `prefetch`\n\n`reduce_add`, `reduce_max`, `reduce_min`, `min`, `max`\n\n`maddubs_i16(u8x16, i8x16) -\u003e i16x8` — SSSE3 pmaddubsw. Chain with `madd_i16` for i32 accumulation.\n`madd_i16(i16xN, i16xN) -\u003e i32x(N/2)` — SSE2/AVX2/AVX-512 pmaddwd (x86-only; ARM error points at `wmul_i32 + addp_i32`).\n`vdot_i32`, `vdot_lane_i32` (ARM `--dotprod`); `smmla_i32`, `ummla_i32`, `usmmla_i32` (ARM `--i8mm`).\n`exp_poly_f32(f32xN) -\u003e f32xN` — polynomial vector exp on `[-50, 50]`, no libm scalarization. Measured 2.93× isolated vs scalar `exp()` on AMD Zen 4 + glibc 2.42; 2.23× in real `gemma4_gelu` on Pi 5 Cortex-A76 (other ops in GELU are Amdahl-capped).\n\n`widen_u8_f32x4`, `widen_i8_f32x4`, `widen_u8_f32x8`, `widen_i8_f32x8`, `widen_u8_f32x16`¹, `widen_i8_f32x16`¹, `widen_u8_i32x4`, `widen_u8_i32x8`, `widen_u8_i32x16`¹, `widen_u8_u16`, `narrow_f32x4_i8`, `pack_sat_*`, `pack_usat_*`, `round_f32x{4,8}_i32x{4,8}`, `sat_add`, `sat_sub`, `sqrt`, `rsqrt`, `exp`, `to_f32`, `to_i32`, `to_f64`, `to_i64`, `to_f16`²,\n`to_i16`, `cvt_f16_f32`, `cvt_f32_f16`.\n\nBitwise: `.\u0026`, `.|`, `.^`, `.\u003c\u003c`, `.\u003e\u003e` on integer vectors; `\u0026`, `|`, `^`, `\u003c\u003c`, `\u003e\u003e` on integer scalars. Restrict pointers: `*restrict T`, `*mut restrict T`.\n\n¹ Requires `--avx512`. ² Requires `--fp16` (ARM-only). ³ x86-only; ARM users compose via `f32x{4,8}_from_scalars` — see [`docs/idioms/neon-gather.md`](docs/idioms/neon-gather.md).\n\n## Kernel constructs\n\n```\nexport kernel name(...) over i in n step N tail \u003cstrategy\u003e { ... }\n```\n\nTail strategies: `tail scalar { ... }` (scalar fallback), `tail mask { ... }` (masked SIMD), `tail pad` (caller pads input). Output annotations (`out name: *mut T [cap: expr]`) drive auto-allocation in bindings.\n\nAlso: `for i in 0..n step 8 { ... }` counted loops, `foreach (i in 0..n) { ... }` element-wise loops (LLVM auto-vectorizes at O2+), `unroll(N)`, compile-time `const`, `static_assert`, `#[cfg(x86_64)]` / `#[cfg(aarch64)]` conditional compilation, C-compatible structs, multi-kernel files, pointer-to-pointer `**T` parameters.\n\n## Kernel fusion\n\nFusion eliminates memory round-trips between pipeline stages:\n\n```\n3 kernels (unfused):  8.55 ms   — 0.9× (slightly slower, FFI + memory overhead)\n1 kernel  (fused):    0.07 ms   — 111× faster than NumPy\n```\n\n\u003e If data leaves registers, you probably ended a kernel too early.\n\nAnalysis of when fusion helps and when it hurts: [`COMPUTE_PATTERNS.md`](COMPUTE_PATTERNS.md).\n\n## Design\n\nExplicit over implicit. SIMD width, loop stepping, and memory access are programmer-controlled. No hidden allocations, no auto-vectorizer in the default path, no runtime. Ea is not a general-purpose language — no strings, collections, or modules. It accelerates host languages, it does not replace them.\n\n## Architecture\n\n```\n.ea -\u003e Lexer -\u003e Parser -\u003e Desugar -\u003e Type Check -\u003e Codegen (LLVM 18) -\u003e .o / .so\n                                                                      -\u003e .ea.json -\u003e ea bind\n```\n\n~17,000 lines of Rust. 778 tests covering SIMD ops, C interop, structs, kernel constructs, tail strategies, binding generation, error suggestions, ARM targets. CI on x86-64, AArch64, Windows.\n\n[`BENCHMARKS.md`](BENCHMARKS.md) — performance tables. [`CHANGELOG.md`](CHANGELOG.md) — version history. Language reference: [`docs/src/reference/`](docs/src/reference/) (mdbook).\n\n## License\n\nApache 2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetlukk%2Feacompute","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpetlukk%2Feacompute","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpetlukk%2Feacompute/lists"}