https://github.com/pwhiddy/webgpu-atomics-benchmark
Atomics Benchmark using WebGPU
https://github.com/pwhiddy/webgpu-atomics-benchmark
Last synced: 7 months ago
JSON representation
Atomics Benchmark using WebGPU
- Host: GitHub
- URL: https://github.com/pwhiddy/webgpu-atomics-benchmark
- Owner: PWhiddy
- License: mit
- Created: 2024-04-03T21:21:16.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-10-29T20:23:20.000Z (11 months ago)
- Last Synced: 2025-03-17T19:11:37.881Z (7 months ago)
- Language: HTML
- Homepage: https://pwhiddy.github.io/webgpu-atomics-benchmark/
- Size: 29.3 KB
- Stars: 5
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### WebGPU Atomics Benchmark
A simple test of the throughput of atomics on your gpu using webgpu.
While building a very custom GPU memory allocator for my game engine, I've been relying heavily on atomics. This was inspired by the old [CUDA blog post on warp-aggregated atomics](https://developer.nvidia.com/blog/cuda-pro-tip-optimized-filtering-warp-aggregated-atomics/) which demonstrated that compiler magic can counterintuitively make certain GPU atomics extremely fast (faster than a CPU). I've been very curious to know if the same holds true for modern APIs and GPUs in general. Results indicate that even on non-nvidia systems and high level APIs such as WebGPU, these optimizations are clearly available!
*PRs adding results for your GPU are welcome!*
----
Current configuration is 32 atomic adds per thread, launching a total of 15M threads, all writing to a single global memory address.
| GPU | Max Bandwidth | Ops/s | Bandwidth Utilization* |
|----- | ----- | ----- | ----- |
|M1 Max | 400 GB/s | 20B | 40% |
| RTX 4090 | 1008 GB/s | 62B | 49% |*This may not be actual global memory utilization, but the utilization that would be required if operations were not aggregated prior to global memory.
----
### Find out your GPU's performance1. Go to https://pwhiddy.github.io/webgpu-atomics-benchmark/
2. Copy the result: `Operations per second`
3. Calculate results using this formula:
```python
operations_per_second = # your result here
gpu_max_bandwidth = # your gpu max bandwidth (look this up online)
# 1 read + 1 write for a 4 byte u32
bandwidth_utilized = ((operations_per_second * 4 * 2) / gpu_max_bandwidth) * 100
```