https://github.com/masterskepticista/parallel_reductions_cuda
Iteratively optimizing parallel reductions in CUDA.
https://github.com/masterskepticista/parallel_reductions_cuda
cuda reduce-sum reductions
Last synced: 2 months ago
JSON representation
Iteratively optimizing parallel reductions in CUDA.
- Host: GitHub
- URL: https://github.com/masterskepticista/parallel_reductions_cuda
- Owner: MasterSkepticista
- Created: 2025-02-23T15:46:51.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-02-23T16:13:02.000Z (2 months ago)
- Last Synced: 2025-02-23T17:24:38.606Z (2 months ago)
- Topics: cuda, reduce-sum, reductions
- Language: Cuda
- Homepage:
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Parallel Reductions in CUDA
Iteratively optimizing a `reduce_sum` operation in CUDA until we reach >95% of GPU performance. This code accompanies the blog post [Embarrasingly Parallel Reduction in CUDA](https://masterskepticista.github.io/posts/2025/02/reducesum/).
### Results
Effective bandwidth achieved on an RTX-3090 (`N=1<<25` elements):
#
Kernel
Bandwidth (GB/s)
Relative tojnp.sum
1
Vector Loads
9.9
1.1%
2
Interleaved Addressing
223
24.7%
3
Non-divergent Threads
317
36.3%
4
Sequential Addressing
331
38.0%
5
Reduce on First Loads
618
70.9%
6
Warp Unrolling
859
98.6%
0
jnp.sum
reference
871
100%
### Run benchmarks
```bash
# Compile
nvcc -arch=native -O3 --use_fast_math reduce_sum.cu -lcublas -lcublasLt -o ./reduce_sum# Run
./reduce_sum <1...6>
```### Acknowledgements
Benchmarking setup borrowed from [karpathy/llm.c](https://github.com/karpathy/llm.c/).### License
MIT