https://github.com/townsendmerino/wgpu
Minimal compute-only Go (CGO) binding over wgpu-native v29 — a drop-in for the cogentcore/webgpu subset, with full control over the WGSL dot4I8Packed builtin.
https://github.com/townsendmerino/wgpu
cgo compute dp4a go gpu metal webgpu wgpu-native wgsl
Last synced: 8 days ago
JSON representation
Minimal compute-only Go (CGO) binding over wgpu-native v29 — a drop-in for the cogentcore/webgpu subset, with full control over the WGSL dot4I8Packed builtin.
- Host: GitHub
- URL: https://github.com/townsendmerino/wgpu
- Owner: townsendmerino
- License: other
- Created: 2026-06-18T01:56:30.000Z (10 days ago)
- Default Branch: main
- Last Pushed: 2026-06-18T02:55:22.000Z (10 days ago)
- Last Synced: 2026-06-18T04:25:41.249Z (10 days ago)
- Topics: cgo, compute, dp4a, go, gpu, metal, webgpu, wgpu-native, wgsl
- Language: C++
- Size: 54.7 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# wgpu — minimal compute-only Go binding over wgpu-native v29.0.0.0
A small **CGO** binding over [wgpu-native](https://github.com/gfx-rs/wgpu-native)
**v29.0.0.0**, built for one purpose: full control over the WGSL
`dot4I8Packed` builtin from Go, with an API that is a **drop-in for the slice of
[`github.com/cogentcore/webgpu`](https://github.com/cogentcore/webgpu) that
goinfer's `./gpu` package uses**. Migrating goinfer is a near-mechanical import
swap.
## Why this exists
`go-webgpu` (the zero-CGO `goffi` binding) SIGABRTs at `RequestAdapter` on Go
1.26: `goffi` targets the Go 1.25 `crosscall2` callback ABI, which broke. **cgo
callbacks are robust across Go versions.** This binding uses cgo and statically
links a prebuilt `libwgpu_native.a` into a single binary.
## Status
Built and validated on **darwin/arm64 (Apple M1 Pro, Metal)** against the real
v29.0.0.0 static lib:
```
adapter: Apple M1 Pro | backend=Metal | type=IntegratedGPU
ABI validation: PASS (dot4I8Packed results match CPU reference)
dot4I8Packed : 38.7 Gdot4/s
scalar : 12.3 Gdot4/s
speedup (scalar/dot4): 3.16x → GO ✅
```
`dot4I8Packed` needs **no device feature** for correctness (naga polyfills it
everywhere). There is **no packed-dot feature flag** in wgpu-native v29 — the
DP4A fast path is selected automatically by wgpu-core/naga. Detection is
empirical: compile the builtin and **measure** (see `cmd/dot4probe`).
## Layout
```
*.go the binding (package wgpu)
lib///libwgpu_native.a vendored static libs
lib/webgpu.h, lib/wgpu.h headers from the v29.0.0.0 release (match the libs)
lib/licenses/ wgpu-native MIT + Apache-2.0 texts
scripts/fetch.sh re-vendor the libs/headers (build works offline after)
cmd/dot4probe/ Phase-2 go/no-go: ABI validation + DP4A measurement
```
Vendored platforms: `darwin/arm64`, `darwin/amd64`, `linux/amd64`,
`linux/arm64`, `windows/amd64` (GNU). Re-fetch with `bash scripts/fetch.sh`.
## Build & run
```sh
CGO_ENABLED=1 go build ./...
CGO_ENABLED=1 go test ./... # ABI correctness test (skips without a GPU)
CGO_ENABLED=1 go run ./cmd/dot4probe # full DP4A measurement
```
## Migrating goinfer
goinfer's `./gpu` imports `github.com/cogentcore/webgpu/wgpu` as `wgpu`. The
exported type, method, and descriptor names here match that subset, so:
```sh
# in goinfer/gpu
grep -rl 'cogentcore/webgpu/wgpu' . | xargs sed -i '' \
's#github.com/cogentcore/webgpu/wgpu#github.com/townsendmerino/wgpu#g'
```
The import alias stays `wgpu`, every `wgpu.X` call site is unchanged, and the
blocking call style (`RequestAdapter` returns `(*Adapter, error)`; `MapAsync` +
`Poll(true, nil)`) is preserved. v29's async futures are hidden behind
synchronous wrappers.
## Beyond the drop-in (v29 extras)
The cogentcore surface is mirrored exactly; on top of it this binding also
exposes v29-only capabilities useful for the dot4 work:
| Feature | API |
|---|---|
| GPU timestamp queries | `Device.CreateQuerySet`, `ComputePassDescriptor.TimestampWrites`, `CommandEncoder.ResolveQuerySet`, `Queue.GetTimestampPeriod` |
| Pipeline-overridable WGSL constants | `ProgrammableStageDescriptor.Constants []ConstantEntry` |
| Push-constant-equivalent immediates | `ComputePassEncoder.SetImmediates`, `NativeFeatureImmediates`, `Limits.MaxPushConstantSize` (→ `maxImmediateSize`) |
| Subgroup adapter info | `AdapterInfo.SubgroupMinSize/MaxSize`, `FeatureNameSubgroups`, `NativeFeatureSubgroup` |
| Batched dispatch recording (one CGO crossing for a whole chain; ~5× faster record) | `ComputePassEncoder.RecordSteps([]ComputeStep)` |
## License
MIT (this binding). Vendored wgpu-native blobs are MIT OR Apache-2.0 — see
`NOTICE` and `lib/licenses/`.