https://github.com/verilean/hesper

Verified GPU programming framework for Lean 4. Write type-safe WebGPU shaders with formal verification, hardware-accelerated matrix ops, and cross-platform support (Metal/Vulkan/D3D12). Build provably correct GPU compute and ML inference engines.
https://github.com/verilean/hesper
formal-verification gpgpu lean4 webgpu
Last synced: 25 days ago
JSON representation
Host: GitHub
URL: https://github.com/verilean/hesper
Owner: Verilean
License: apache-2.0
Created: 2026-01-24T08:13:51.000Z (5 months ago)
Default Branch: main
Last Pushed: 2026-05-14T07:16:59.000Z (about 1 month ago)
Last Synced: 2026-05-30T00:14:28.235Z (27 days ago)
Topics: formal-verification, gpgpu, lean4, webgpu
Language: Lean
Homepage:
Size: 3.72 MB
Stars: 24
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # Hesper

**Write GPU programs in Lean 4. Prove them correct. Run on WebGPU.**

> [!IMPORTANT]

> **This is Alpha Software.**

> The APIs, verification features, and compiler are under active development and subject to breaking changes. While core functionality works, this project is primarily for research and experimentation.

Hesper is a verified GPU programming framework that brings the power of formal verification to GPU computing. Write type-safe shaders, execute tensor operations, and build graphics applications—all in Lean 4.

```lean

import Hesper.WGSL.DSL

-- Type-safe shader expressions with compile-time verification

let x : Exp (.scalar .f32) := var "x"

let y : Exp (.scalar .f32) := var "y"

let result := sqrt (x * x + y * y)  -- Generates: sqrt(x * x + y * y)

-- Cannot mix types (compile error!)

-- let wrong := x + (var "i" : Exp (.scalar .i32))  ✗ Type error!

```

## BitNet b1.58 Inference: 125 TPS on M4 Max

Hesper includes a complete **BitNet b1.58 2B** inference engine running entirely on WebGPU, achieving **125 tokens/second** on Apple M4 Max:

```

$ lake exe bitnet-complete --stats

> Hello, world!

Hello, world! I'm a 20-year-old college student...

Performance: 125.6 TPS (8.0 ms/token)

  Model: BitNet b1.58 2B (30 layers, 2560 dim, i2_s ternary weights)

```

**Key optimizations:**

- **Flash Attention**: fused score + online softmax + apply in 1 kernel (3 kernels → 1)

- Ternary weight kernel (i2_s): 2-bit packed weights, addition-only matmul

- Kernel fusion: fused gate+up+ReLU²×mul and fused KV cache write (150 fewer dispatches/token)

- Shared memory F16 matmul for LM head (128K vocab)

- PreparedDispatch graph capture: ~99% pipeline cache hit rate

- Command buffer batching: single GPU submit per token

- KV cache with grouped-query attention (20 heads, 5 KV heads)

**Also: 40 TPS on RTX 4070 Ti (Vulkan)**

See [bitnet.lean](https://github.com/Verilean/bitnet.lean) for the full inference pipeline.

### LoRA Finetuning (Alpaca-style Instruction Tuning)

Hesper supports LoRA finetuning with a **verified backward pass**:

```bash

# Train on Alpaca-format dataset

lake exe alpaca-finetune --model model.gguf --data alpaca_data.json --epochs 50 --rank 8

# Inference with LoRA adapter

lake exe bitnet-complete model.gguf "What is Hesper?" 60 --lora lora_weights.bin

```

**Training features:**

- Complete backward chain: 13/13 ops (attention 7 + FFN 6)

- Verified AD: each backward op numerically checked against CPU spec

- GPU ↔ CPU consistency: all backward kernels match CPU spec (error = 0.0)

- Type-safe backward chain: missing ops cause compile-time error

- AdamW optimizer with gradient clipping, LR scheduling (cosine + warmup)

- GPU-batched forward + backward (1 GPU submit per token)

### Verified Automatic Differentiation

Every backward operation is verified correct:

```bash

$ lake exe verified-ad

  PASS Softmax, RoPE, RMSNorm, ScaledDot, ReLU²×Mul  (numerical gradient check)

  PASS Chain rule composition: error = 0.0

  ✓ All AD verifications PASSED

$ lake exe gpu-vs-cpu-test

  ✓ SoftmaxBackward, RMSNormBackward, RoPEBackward, ReLU²×Mul  (GPU matches CPU spec)

$ lake exe chain-completeness

  ✓ Backward chain is COMPLETE (13/13 ops)

```

## Why Hesper?

Modern GPU programming lacks safety guarantees. Hesper provides:

- **Type Safety**: Shaders are type-checked at compile time, preventing type mismatches

- **Formal Verification**: Prove correctness properties about your GPU programs

- **Verified Training**: Backward ops numerically checked, GPU kernels match CPU specs

- **WebGPU Backend**: Cross-platform GPU access via Dawn (Metal, Vulkan, D3D12)

- **Lean Integration**: Use Lean's powerful theorem proving alongside GPU computation

- **Multi-GPU Support**: Select and coordinate across multiple GPU adapters

## Quick Start

### Prerequisites

- **Platform**: macOS (Metal), Linux (Vulkan), or Windows (D3D12/Vulkan)

### 🐳 Docker Environment (Recommended for Linux/CI)

For a reproducible build environment, especially on Linux, you can use the provided Docker image:

```bash

# Build the image

docker build -t hesper-ci .

# Run build and tests inside container

docker run -it hesper-ci lake test-all

```

### Installation

```bash

# Clone the repository

git clone https://github.com/Verilean/hesper.git

cd hesper

# Build native dependencies (this will take a while on first build)

lake run buildNative

# Build and run a demo

lake build dsl-basics

./.lake/build/bin/dsl-basics

```

### Your First Hesper Program

Create `MyFirst.lean`:

```lean

import Hesper

import Hesper.WebGPU.Device

def main : IO Unit := do

  -- Initialize WebGPU

  Hesper.init

  -- Get a GPU device

  let device ← Hesper.WebGPU.getDevice

  IO.println "✓ GPU ready!"

```

Build and run:

```bash

lake build myfirst

./.lake/build/bin/myfirst

```

## Features

### 🚀 Portable SIMD CPU Backend (Google Highway)

Hardware-accelerated CPU operations powered by **Google Highway**, providing high-performance SIMD across x86, ARM, and RISC-V:

```lean

import Hesper.Simd

import Hesper.Float32

import Hesper.Float16

-- Float64 (8 bytes): Native Lean Float, NEON 2/op, AVX2 4/op

let a64 := FloatArray.mk #[1.0, 2.0, 3.0, 4.0]

let c64 := Hesper.Simd.simdAdd a64 b64

-- Float32 (4 bytes): 2x memory savings, NEON 4/op, AVX2 8/op

let a32 := Float32.fromFloatArray a64

let c32 := Float32.simdAdd a32 b32

-- Float16 (2 bytes): 4x memory savings, NEON 8/op, AVX2+F16C 8/op

-- Requires ARMv8.2-A FP16 or x86_64 F16C - returns error if unavailable

let hasFP16 ← Float16.hasHardwareSupport

if hasFP16 then

  let a16 ← Float16.fromFloatArray a64

  let c16 ← Float16.simdAdd a16 b16

```

**Features:**

- **Google Highway Integration**: Portable SIMD implementation with runtime dispatch

- **Architecture Support**: NEON (ARM), AVX2/AVX-512 (x86), optional FP16 vector arithmetic

- **Multi-Precision**: Optimized paths for Float64, Float32, and Float16

- **OpenMP Support**: Optional multithreading for large tensor operations

**Zero-Conversion Architecture:**

All operations work directly on raw `ByteArray` with no automatic type conversions. Conversions are explicit only when needed.

### ⚡️ High-Level Parallel API

Inspired by `webgpu-dawn`, Hesper provides an easy-to-use API for data-parallelism that handles all GPU boilerplate (buffers, shaders, synchronization) in a single call.

#### parallelFor

Quickly execute a WGSL shader over a `Float` array:

```lean

import Hesper.Compute

-- Multiply each element by 1000 on the GPU

let result ← parallelFor device shader inputData

```

#### Device.compute

Run a computation with multiple named buffers directly on the `Device`:

```lean

device.compute myKernel [("input", inputBuf), ("output", outputBuf)] config

```

### 🎯 Type-Safe Shader DSL

Write WGSL shaders with Lean's type system guaranteeing correctness:

```lean

import Hesper.WGSL.DSL

-- Expressions are typed and checked at compile time

let x : Exp (.scalar .f32) := var "x"

let y : Exp (.scalar .f32) := var "y"

-- Arithmetic operators work naturally

let distance := sqrt (x * x + y * y)

-- Built-in functions

let clamped := Exp.clamp x (lit 0.0) (lit 1.0)

let power := Exp.pow x (lit 2.0)

-- Generate WGSL code

IO.println distance.toWGSL  -- Output: sqrt((x * x) + (y * y))

```

### 🧩 Verified Composable Kernels (Operator Fusion)

Hesper's `VerifiedOpFusion` architecture allows you to compose multiple GPU operations into a single kernel pass while maintaining formal correctness:

```lean

-- Fuses MatMul and ReLU into a single GPU kernel

-- Correctness is proven by construction

let fusedOp := matmulKernel |> reluKernel

```

**Key Advantages:**

- **Zero-Copy Fusion**: Eliminate expensive memory roundtrips between kernels.

- **Formal Correctness**: Each fused kernel is verified against a high-level CPU specification (`spec_forward`).

- **Unified Interface**: Same code runs on GPU (via WGSL) or CPU (via Google Highway) for easy debugging.

### 📈 Unified Verified Automatic Differentiation

Hesper's unique architecture unifies **formal verification** with **automatic differentiation** via a shared **Differentiable** interface. This allows the AD engine to treat complex, verified GPU kernels as first-class primitives.

#### The Differentiable Interface

All operations in Hesper—from simple scalar addition to fused ResNet blocks—implement this common trait:

```lean

class Differentiable (I O : Type) where

  /-- Primal execution (Forward pass) -/

  forward : I → O

  

  /-- Adjoint computation (Backward pass) -/

  /-- Matrix-Free Vector-Jacobian Product (Jᵀv) -/

  backward : I → O → I

```

#### Why it Matters:

- **Unified Logic**: Scalar-CPU logic and Tensor-GPU kernels share the same mathematical abstraction.

- **End-to-End Correctness**: By "lifting" `VerifiedOp` instances into the AD tape, Hesper ensures that backpropagation is as formally correct as the forward pass.

- **Zero-Copy Fusion**: The AD engine can calculate gradients across fused kernels (e.g., `MatMul |> ReLU`) without writing intermediate tensors to VRAM.

```lean

-- AD engine automatically dispatches to hand-optimized GPU kernels

let grad := diff (matmul |> relu |> crossEntropy) input 

```

**Key Features:**

- **Hybrid AD**: Seamlessly switch between CPU-scalar AD and GPU-tensor AD.

- **Verified Primitives**: Every AD node is backed by a verified `spec_forward` and `spec_backward`.

- **High Performance**: Leverages Hand-optimized WGSL and Google Highway SIMD.

### ⚙️ High-Level Optimizers

Train models using state-of-the-art optimizers that integrate with Hesper's verified tensors:

```lean

import Hesper.Optimizer.SGD

-- Configure SGD with momentum

let opt := SGDConfig.default

  |>.withLearningRate 0.01 

  |>.withMomentum 0.9

-- Perform optimization step

let (newParams, newState) := opt.step params grads state

```

### 🎮 Graphics & Windowing

Build interactive graphics applications with GLFW integration:

```lean

import Hesper.GLFW

def main : IO Unit := do

  Hesper.init

  withGLFW do

    let window ← createWindow 800 600 "Hesper Graphics"

    let device ← Hesper.WebGPU.getDevice

    let surface ← createSurface device window

    -- Render loop

    gameLoop window surface

```

### 🔌 Multi-GPU Support

Enumerate and select GPUs in multi-GPU systems:

```lean

import Hesper.WebGPU.Device

-- List all available GPUs

Hesper.WebGPU.listAdapters

-- Select specific GPU

let device0 ← getDeviceByIndex 0  -- First GPU

let device1 ← getDeviceByIndex 1  -- Second GPU

-- Get adapter information

let info ← getAdapterInfo 0

IO.println s!"GPU: {info.name} (Backend: {info.backendType})"

```

## Examples

### WebGPU Tetris

A full Tetris implementation using GLFW and WebGPU, demonstrating:

- Dynamic shader generation

- Real-time rendering

- Input handling

- Game state management

```bash

lake build tetris

./.lake/build/bin/tetris

```

**Controls**: A/D (move), S (drop), Space (rotate), ESC (exit)

### Matrix Multiplication

High-performance matrix multiplication with subgroup optimizations:

```bash

lake build matmul-demo

./.lake/build/bin/matmul-demo

```

Demonstrates:

- GPU buffer management

- Compute shader execution

- Performance profiling

- Result verification

### SIMD CPU Backend

Multi-precision SIMD operations with hardware acceleration:

```bash

# Run multi-precision test (Float64/Float32/Float16)

lake script run buildSimd

lake build multi-precision

./.lake/build/bin/multi-precision

# Run SIMD benchmarks

lake build simd-bench

./.lake/build/bin/simd-bench

```

Output:

```

Backend: NEON (ARM64) - F64: 2/op, F32: 4/op, FP16

─── Float64 (8 bytes/element) ───

Result: #[6.0, 8.0, 10.0, 12.0] ✓

─── Float32 (4 bytes/element) ───

Result: Float32[4]: [6.0, 8.0, 10.0, 12.0] ✓

─── Float16 (2 bytes/element) ───

FP16 hardware detected!

Result: Float16[4]: [6.0, 8.0, 10.0, 12.0] ✓

```

### Multi-GPU Demo

Enumerate GPUs and create devices from specific adapters:

```bash

lake build multigpu

./.lake/build/bin/multigpu

```

Output:

```

Found 2 GPU adapter(s):

  [0] NVIDIA GeForce RTX 3080 (Backend: Vulkan)

  [1] Intel UHD Graphics 630 (Backend: Vulkan)

✓ Device created from GPU 0

```

### Neural Network Training

Automatic differentiation and gradient descent on GPU:

```bash

lake build nn-gpu-demo

./.lake/build/bin/nn-gpu-demo

```

Features:

- Conv2D layers with verified gradients

- Backpropagation on GPU

- Real-time training visualization

## Building and Testing

### Building the Project

Hesper requires building both native C++ dependencies (Google Dawn) and Lean code.

**Step 1: Build Native Dependencies**

The first build will take 5-15 minutes as it compiles Google Dawn from source:

```bash

# Build the native WebGPU bridge (hesper_native library)

lake script run buildNative

```

This compiles:

- Google Dawn WebGPU implementation

- C++ FFI bridge (`native/bridge.cpp`)

- SIMD CPU backend (`c_src/simd_ops.cpp`)

**Step 2: Build Lean Code**

Once native dependencies are built, compile the Lean libraries and executables:

```bash

# Build the core library

lake build Hesper

# Or build a specific executable

lake build simple-write

```

**Clean Build** (if you encounter issues):

```bash

lake clean

lake script run buildNative

lake build

```

### Testing the Installation

#### 1. Simple GPU Test (Raw WGSL + DSL)

This test verifies both raw WGSL shaders and DSL-generated shaders execute correctly on your GPU:

```bash

lake build simple-write

./.lake/build/bin/simple-write

```

**Expected output:**

```

╔══════════════════════════════════════╗

║   GPU Double Test (DSL + Raw)        ║

╚══════════════════════════════════════╝

📝 DSL-generated WGSL:

─────────────────────────────────────

@group(0) @binding(0) var input: array;

@group(0) @binding(1) var output: array;

@compute @workgroup_size(1)

fn main(@builtin(global_invocation_id) gid: vec3) {

    let idx = gid.x;

    if (idx < arrayLength(&input)) {

        output[idx] = input[idx] * 2.0;

    }

}

🚀 Initializing WebGPU...

  ✓ Created input buffer

  ✓ Wrote input: [1.0, 2.0, 3.0, 4.0]

  ✓ Created output buffer

  🔹 Test 1: Raw WGSL shader

  ✓ Raw WGSL executed

  🔹 Test 2: DSL-generated shader

  ✓ DSL shader executed

📊 Results:

  Input → Expected → Raw WGSL → DSL WGSL

  [0] 1.0 → 2.0 → 2.0 ✓ → 2.0 ✓

  [1] 2.0 → 4.0 → 4.0 ✓ → 4.0 ✓

  [2] 3.0 → 6.0 → 6.0 ✓ → 6.0 ✓

  [3] 4.0 → 8.0 → 8.0 ✓ → 8.0 ✓

✅ SUCCESS: Both shaders work correctly!

   - Raw WGSL shader: ✓

   - DSL-generated shader (ShaderM monad): ✓

   - Both produce identical correct results

```

This test validates:

- ✓ WebGPU initialization and GPU discovery

- ✓ Buffer creation and data transfer (CPU ↔ GPU)

- ✓ Raw WGSL shader compilation and execution

- ✓ DSL shader code generation (ShaderM monad → WGSL)

- ✓ DSL shader execution on GPU

- ✓ Correct data marshalling across the FFI boundary

#### 2. FFI Boundary Tests

Test data conversion across the Lean ↔ C++ FFI boundary:

```bash

lake build ffi-tests

./.lake/build/bin/ffi-tests

```

**Expected output:**

```

╔══════════════════════════════════════╗

║   FFI Boundary Tests                 ║

╚══════════════════════════════════════╝

Test 1: Lean writes data, C++ reads

  ✓ Lean wrote: [1.0, 2.0, 3.0, 4.0]

  ✓ C++ verified byte-level accuracy

Test 2: C++ writes data, Lean reads

  ✓ GPU wrote: [10.0, 20.0, 30.0, 40.0]

  ✓ Lean verified byte-level accuracy

Test 3: Round-trip (Lean → GPU → Lean)

  ✓ Input: [5.0, 10.0, 15.0, 20.0]

  ✓ Output: [10.0, 20.0, 30.0, 40.0]

  ✓ Data integrity preserved

✅ All FFI boundary tests passed!

```

This validates:

- Lean writes ByteArray → C++ reads correct bytes

- C++ writes bytes → Lean reads correct Float values

- Round-trip data integrity across FFI boundary

#### 3. SIMD CPU Backend Test

Test multi-precision SIMD operations (Float64/Float32/Float16):

```bash

lake script run buildSimd

lake build multi-precision

./.lake/build/bin/multi-precision

```

**Expected output (on ARM64 with FP16 support):**

```

Backend: NEON (ARM64) - F64: 2/op, F32: 4/op, FP16

─── Float64 (8 bytes/element) ───

Result: #[6.0, 8.0, 10.0, 12.0] ✓

─── Float32 (4 bytes/element) ───

Result: Float32[4]: [6.0, 8.0, 10.0, 12.0] ✓

─── Float16 (2 bytes/element) ───

FP16 hardware detected!

Result: Float16[4]: [6.0, 8.0, 10.0, 12.0] ✓

```

### For Contributors: Testing Your Changes

When making changes to Hesper, run these tests to ensure you haven't broken anything:

#### 1. Core FFI Tests

```bash

# Test Lean ↔ C++ data conversion

lake build ffi-tests

./.lake/build/bin/ffi-tests

```

#### 2. GPU Shader Tests

```bash

# Test raw WGSL and DSL shader execution

lake build simple-write

./.lake/build/bin/simple-write

```

#### 3. SIMD Tests

```bash

# Rebuild SIMD library and run tests

lake script run buildSimd

lake build simd-test

./.lake/build/bin/simd-test

```

#### 4. Full Test Suite

```bash

# Run all tests

lake build test-all

./.lake/build/bin/test-all

```

### Troubleshooting

#### Issue: "Build failed: native library not found"

**Solution:** Rebuild the native library:

```bash

lake clean

lake script run buildNative

lake build

```

#### Issue: "No GPU adapters found"

**Solution:** Ensure you have proper GPU drivers:

- **macOS**: No action needed (Metal is built-in)

- **Linux**: Install Vulkan drivers (`vulkan-tools`, `mesa-vulkan-drivers`)

- **Windows**: Install latest GPU drivers with D3D12/Vulkan support

#### Issue: "SIMD library not found"

**Solution:** Build the SIMD backend:

```bash

lake script run buildSimd

```

#### Issue: "FP16 not supported"

**Solution:** This is expected on older hardware. Float16 requires:

- ARM64: ARMv8.2-A with FP16 extension (Apple M1+, AWS Graviton2+)

- x86_64: F16C extension (Intel Ivy Bridge+ / AMD Bulldozer+)

The library will gracefully fall back to Float32 operations.

#### Issue: Dawn build takes too long

**Solution:** Dawn's first build can take 10-15 minutes. Subsequent builds are incremental and much faster. To speed up:

```bash

# Use more CPU cores (adjust -j value)

lake script run buildNative -- -j 16

```

## How It Works

```

┌─────────────────────────────────────────────────────────────┐

│                    Lean 4 Code                               │

│  • Type-safe shader DSL                                      │

│  • Tensor operations                                         │

│  • Formal proofs                                             │

└─────────────────────┬───────────────────────────────────────┘

                      │

                      ▼

┌─────────────────────────────────────────────────────────────┐

│              WGSL Code Generation                            │

│  Exp (.scalar .f32) → WGSL shader source                    │

└─────────────────────┬───────────────────────────────────────┘

                      │

                      ▼

┌─────────────────────────────────────────────────────────────┐

│              Lean FFI (C++ Bridge)                           │

│  • lean_hesper_* functions                                   │

│  • Resource management via Lean.External                     │

└─────────────────────┬───────────────────────────────────────┘

                      │

                      ▼

┌─────────────────────────────────────────────────────────────┐

│              Google Dawn (WebGPU Native)                     │

│  • Metal (macOS)                                             │

│  • Vulkan (Linux/Windows)                                    │

│  • D3D12 (Windows)                                           │

└─────────────────────────────────────────────────────────────┘

```

### Architecture Layers

1. **DSL Layer**: Type-safe WGSL expression builder with dependent types

2. **Tensor Layer**: High-level operations (matmul, conv2d, pooling)

3. **Compute Layer**: Shader compilation, buffer management, execution

4. **WebGPU Layer**: FFI bindings to Dawn native implementation

5. **Backend Layer**: Platform-specific GPU drivers (Metal/Vulkan/D3D12)

## Project Structure

```

Hesper/

├── Hesper/

│   ├── WGSL/          # Type-safe shader DSL

│   │   ├── Types.lean      # WGSL type system

│   │   ├── Exp.lean        # Expression AST

│   │   └── DSL.lean        # User-facing DSL

│   ├── WebGPU/        # WebGPU bindings

│   │   ├── Device.lean     # GPU device management

│   │   ├── Buffer.lean     # GPU buffers

│   │   ├── Shader.lean     # Shader modules

│   │   ├── Pipeline.lean   # Compute/render pipelines

│   │   └── Errors.lean     # Comprehensive error handling

│   ├── Tensor/        # Tensor operations

│   │   └── MatMul.lean     # Matrix multiplication

│   ├── NN/            # Neural network layers

│   │   └── Conv.lean       # Convolution layers

│   ├── GLFW/          # Windowing and graphics

│   │   └── GLFW.lean       # GLFW bindings

│   ├── Simd.lean      # SIMD Float64 operations

│   ├── Float32.lean   # SIMD Float32 operations

│   ├── Float16.lean   # SIMD Float16 operations

│   └── Compute.lean   # High-level compute API

├── Examples/          # Example programs

│   ├── Tetris.lean         # Full game demo

│   ├── MultiGPU.lean       # Multi-GPU support

│   ├── DSLBasics.lean      # DSL tutorial

│   └── ...

├── native/            # C++ WebGPU bridge

│   ├── bridge.cpp          # FFI implementation

│   └── CMakeLists.txt      # Build configuration

├── c_src/             # SIMD CPU backend

│   └── simd_ops.cpp        # NEON/AVX2 implementations

├── Tests/             # Comprehensive test suite

│   ├── ErrorTests.lean     # Error handling tests

│   ├── ShaderTests.lean    # Shader monad tests

│   └── ...

└── lakefile.lean      # Lake build script

```

## Roadmap

**Current Status**: Early Development (Alpha)

- [x] **Multi-precision SIMD CPU backend (Google Highway)**

- [x] **Architecture detection (NEON/AVX2/F16C)**

- [x] **Comprehensive error handling with structured error types**

- [x] **Complete test suite (error handling, shader monad)**

- [x] **Docker-based CI environment**

- [x] **Verified Composable Kernels (VerifiedOpFusion)**

- [x] **BitNet b1.58 inference engine (125 TPS on M4 Max)**

- [x] **Kernel fusion: fused gate+up+ReLU²×mul, fused KV cache write**

- [x] **KV cache with grouped-query attention**

- [x] **PreparedDispatch graph capture (99%+ cache hit rate)**

In Progress:

- [ ] Comprehensive tensor operation library (GEMM, Conv3D)

- [ ] Gemma 3 / Transformer support

- [ ] Automatic differentiation on GPU kernels

- [ ] Formal proofs of kernel numerical stability

- [ ] Integration with Lean's tactic framework

## Contributing

Hesper is part of the **Verilean** organization's effort to bring verified computing to GPUs.

### How to Contribute

1. **Fork the repository** and create a feature branch

2. **Make your changes** following the existing code style

3. **Run the test suite** to ensure nothing broke:

   ```bash

   # Core FFI boundary tests

   lake build ffi-tests

   ./.lake/build/bin/ffi-tests

   # GPU shader tests (raw WGSL + DSL)

   lake build simple-write

   ./.lake/build/bin/simple-write

   # SIMD tests (if you modified SIMD code)

   lake script run buildSimd

   lake build simd-test

   ./.lake/build/bin/simd-test

   ```

4. **Add tests** for new features (see `Examples/Tests/` for examples)

5. **Submit a pull request** with a clear description of changes

### Testing Guidelines

- **FFI changes**: Always run `test-ffi` to verify Lean ↔ C++ data marshalling

- **DSL changes**: Run `simple-write` to verify WGSL code generation

- **GPU operations**: Test with real GPU hardware, not just compilation

- **SIMD changes**: Test on both ARM64 (NEON) and x86_64 (AVX2) if possible

- **Cross-platform**: macOS (Metal), Linux (Vulkan), Windows (D3D12/Vulkan)

### Code Organization for Contributors

```

Hesper/

├── Hesper/               # Core library

│   ├── WGSL/            # Type-safe shader DSL

│   ├── WebGPU/          # WebGPU bindings (Device, Buffer, Shader, Pipeline)

│   ├── Compute.lean     # High-level compute API

│   ├── Simd.lean        # SIMD Float64 operations

│   ├── Float32.lean     # SIMD Float32 operations

│   └── Float16.lean     # SIMD Float16 operations

├── Examples/             # Example programs (organized by category)

│   ├── DSL/             # DSL feature demonstrations

│   ├── Compute/         # GPU compute examples

│   ├── MachineLearning/ # Neural network training

│   ├── Graphics/        # GLFW rendering demos

│   ├── SIMD/            # CPU SIMD benchmarks

│   ├── Tests/           # Integration tests

│   └── Utilities/       # Helper utilities

├── Tests/                # Unit tests

│   ├── FFIBoundaryTests.lean  # Lean ↔ C++ data conversion tests

│   └── FusionTest.lean        # Operator fusion tests

├── native/               # C++ WebGPU bridge

│   ├── bridge.cpp       # FFI implementation (lean_hesper_* functions)

│   └── CMakeLists.txt   # Dawn build configuration

├── c_src/                # SIMD CPU backend

│   └── simd_ops.cpp     # NEON/AVX2 implementations

└── lakefile.lean         # Lake build script

```

**Key files for contributors:**

- **`native/bridge.cpp`**: FFI boundary - all Lean ↔ C++ data conversion happens here

- **`Hesper/WGSL/Monad.lean`**: ShaderM monad for imperative shader construction

- **`Hesper/WGSL/Execute.lean`**: Compiles ShaderM → WGSL and executes on GPU

- **`Examples/Tests/SimpleWrite.lean`**: Reference test showing raw WGSL vs DSL execution

- **`Tests/FFIBoundaryTests.lean`**: Reference test for FFI data conversion

### Links

- **Report Issues**: [GitHub Issues](https://github.com/Verilean/hesper/issues)

- **Discussions**: [GitHub Discussions](https://github.com/Verilean/hesper/discussions)

- **Sister Project**: [Sparkle HDL](https://github.com/Verilean/sparkle) - Verified hardware design in Lean 4

## Author

**Junji Hashimoto**

Twitter/X: [@junjihashimoto3](https://twitter.com/junjihashimoto3)

## License

Apache License 2.0 - see LICENSE file for details

## Acknowledgments

- **Google Dawn** for the WebGPU native implementation

- **Lean 4** for the foundation of verified programming

- **WebGPU Working Group** for the standard

- **gpu.cpp (Answer.AI):** High-level C++ API wrapper inspiration.

- **[llama.cpp](https://github.com/ggerganov/llama.cpp) (Georgi Gerganov & contributors):** Reference implementation for Gemma 4 architecture and Q4_K_M quantization. Its per-op trace (`llama-eval-callback`) is the golden oracle for hesper's per-layer parity work, and its `ggml_gallocr` compute-buffer allocator inspired hesper's `ScratchPool` design for eliminating per-forward `cuMemAlloc` churn.

---

*Write GPU code that's not just fast—make it correct by construction.*
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/verilean/hesper

Awesome Lists containing this project

README