https://github.com/townsendmerino/wgpu

Minimal compute-only Go (CGO) binding over wgpu-native v29 — a drop-in for the cogentcore/webgpu subset, with full control over the WGSL dot4I8Packed builtin.
https://github.com/townsendmerino/wgpu

cgo compute dp4a go gpu metal webgpu wgpu-native wgsl

Last synced: 8 days ago
JSON representation

Minimal compute-only Go (CGO) binding over wgpu-native v29 — a drop-in for the cogentcore/webgpu subset, with full control over the WGSL dot4I8Packed builtin.

Host: GitHub
URL: https://github.com/townsendmerino/wgpu
Owner: townsendmerino
License: other
Created: 2026-06-18T01:56:30.000Z (10 days ago)
Default Branch: main
Last Pushed: 2026-06-18T02:55:22.000Z (10 days ago)
Last Synced: 2026-06-18T04:25:41.249Z (10 days ago)
Topics: cgo, compute, dp4a, go, gpu, metal, webgpu, wgpu-native, wgsl
Language: C++
Size: 54.7 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE

Awesome Lists containing this project

README

          # wgpu — minimal compute-only Go binding over wgpu-native v29.0.0.0

A small **CGO** binding over [wgpu-native](https://github.com/gfx-rs/wgpu-native)

**v29.0.0.0**, built for one purpose: full control over the WGSL

`dot4I8Packed` builtin from Go, with an API that is a **drop-in for the slice of

[`github.com/cogentcore/webgpu`](https://github.com/cogentcore/webgpu) that

goinfer's `./gpu` package uses**. Migrating goinfer is a near-mechanical import

swap.

## Why this exists

`go-webgpu` (the zero-CGO `goffi` binding) SIGABRTs at `RequestAdapter` on Go

1.26: `goffi` targets the Go 1.25 `crosscall2` callback ABI, which broke. **cgo

callbacks are robust across Go versions.** This binding uses cgo and statically

links a prebuilt `libwgpu_native.a` into a single binary.

## Status

Built and validated on **darwin/arm64 (Apple M1 Pro, Metal)** against the real

v29.0.0.0 static lib:

```

adapter: Apple M1 Pro | backend=Metal | type=IntegratedGPU

ABI validation: PASS (dot4I8Packed results match CPU reference)

dot4I8Packed :  38.7 Gdot4/s

scalar       :  12.3 Gdot4/s

speedup (scalar/dot4): 3.16x   →  GO ✅

```

`dot4I8Packed` needs **no device feature** for correctness (naga polyfills it

everywhere). There is **no packed-dot feature flag** in wgpu-native v29 — the

DP4A fast path is selected automatically by wgpu-core/naga. Detection is

empirical: compile the builtin and **measure** (see `cmd/dot4probe`).

## Layout

```

*.go                       the binding (package wgpu)

lib///libwgpu_native.a   vendored static libs

lib/webgpu.h, lib/wgpu.h   headers from the v29.0.0.0 release (match the libs)

lib/licenses/              wgpu-native MIT + Apache-2.0 texts

scripts/fetch.sh           re-vendor the libs/headers (build works offline after)

cmd/dot4probe/             Phase-2 go/no-go: ABI validation + DP4A measurement

```

Vendored platforms: `darwin/arm64`, `darwin/amd64`, `linux/amd64`,

`linux/arm64`, `windows/amd64` (GNU). Re-fetch with `bash scripts/fetch.sh`.

## Build & run

```sh

CGO_ENABLED=1 go build ./...

CGO_ENABLED=1 go test ./...          # ABI correctness test (skips without a GPU)

CGO_ENABLED=1 go run ./cmd/dot4probe # full DP4A measurement

```

## Migrating goinfer

goinfer's `./gpu` imports `github.com/cogentcore/webgpu/wgpu` as `wgpu`. The

exported type, method, and descriptor names here match that subset, so:

```sh

# in goinfer/gpu

grep -rl 'cogentcore/webgpu/wgpu' . | xargs sed -i '' \

  's#github.com/cogentcore/webgpu/wgpu#github.com/townsendmerino/wgpu#g'

```

The import alias stays `wgpu`, every `wgpu.X` call site is unchanged, and the

blocking call style (`RequestAdapter` returns `(*Adapter, error)`; `MapAsync` +

`Poll(true, nil)`) is preserved. v29's async futures are hidden behind

synchronous wrappers.

## Beyond the drop-in (v29 extras)

The cogentcore surface is mirrored exactly; on top of it this binding also

exposes v29-only capabilities useful for the dot4 work:

| Feature | API |

|---|---|

| GPU timestamp queries | `Device.CreateQuerySet`, `ComputePassDescriptor.TimestampWrites`, `CommandEncoder.ResolveQuerySet`, `Queue.GetTimestampPeriod` |

| Pipeline-overridable WGSL constants | `ProgrammableStageDescriptor.Constants []ConstantEntry` |

| Push-constant-equivalent immediates | `ComputePassEncoder.SetImmediates`, `NativeFeatureImmediates`, `Limits.MaxPushConstantSize` (→ `maxImmediateSize`) |

| Subgroup adapter info | `AdapterInfo.SubgroupMinSize/MaxSize`, `FeatureNameSubgroups`, `NativeFeatureSubgroup` |

| Batched dispatch recording (one CGO crossing for a whole chain; ~5× faster record) | `ComputePassEncoder.RecordSteps([]ComputeStep)` |

## License

MIT (this binding). Vendored wgpu-native blobs are MIT OR Apache-2.0 — see

`NOTICE` and `lib/licenses/`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/townsendmerino/wgpu

Awesome Lists containing this project

README