https://github.com/masterskepticista/parallel_reductions_cuda

Iteratively optimizing parallel reductions in CUDA.
https://github.com/masterskepticista/parallel_reductions_cuda

cuda reduce-sum reductions

Last synced: 2 months ago
JSON representation

Iteratively optimizing parallel reductions in CUDA.

Host: GitHub
URL: https://github.com/masterskepticista/parallel_reductions_cuda
Owner: MasterSkepticista
Created: 2025-02-23T15:46:51.000Z (2 months ago)
Default Branch: main
Last Pushed: 2025-02-23T16:13:02.000Z (2 months ago)
Last Synced: 2025-02-23T17:24:38.606Z (2 months ago)
Topics: cuda, reduce-sum, reductions
Language: Cuda
Homepage:
Size: 3.91 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Parallel Reductions in CUDA

Iteratively optimizing a `reduce_sum` operation in CUDA until we reach >95% of GPU performance. This code accompanies the blog post [Embarrasingly Parallel Reduction in CUDA](https://masterskepticista.github.io/posts/2025/02/reducesum/).

### Results

Effective bandwidth achieved on an RTX-3090 (`N=1<<25` elements):

#
Kernel
Bandwidth (GB/s)
Relative to jnp.sum

1
Vector Loads
9.9
1.1%

2
Interleaved Addressing
223
24.7%

3
Non-divergent Threads
317
36.3%

4
Sequential Addressing
331
38.0%

5
Reduce on First Loads
618
70.9%

6
Warp Unrolling
859
98.6%

0
jnp.sum reference
871
100%

### Run benchmarks

```bash
# Compile
nvcc -arch=native -O3 --use_fast_math reduce_sum.cu -lcublas -lcublasLt -o ./reduce_sum

# Run
./reduce_sum <1...6>
```

### Acknowledgements
Benchmarking setup borrowed from [karpathy/llm.c](https://github.com/karpathy/llm.c/).

### License
MIT

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/masterskepticista/parallel_reductions_cuda

Awesome Lists containing this project

README