https://github.com/williamzhang20/cuda-practice
Exercises in CUDA
https://github.com/williamzhang20/cuda-practice
cuda n-body-problem
Last synced: over 1 year ago
JSON representation
Exercises in CUDA
- Host: GitHub
- URL: https://github.com/williamzhang20/cuda-practice
- Owner: WilliamZhang20
- Created: 2025-02-23T17:54:06.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2025-03-23T03:09:55.000Z (over 1 year ago)
- Last Synced: 2025-03-23T03:26:03.174Z (over 1 year ago)
- Topics: cuda, n-body-problem
- Language: Cuda
- Homepage:
- Size: 97.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CUDA Practice
A collection of exercises learning & practicing parallel algorithms in CUDA.
Implemented in the course "Getting Started with Accelerated Computing in CUDA C/C++".
Contains a final project to engineer a GPU-accelerated simulation of gravitationally-induced motion between many bodies, aka the N-Body Problem.
See section below on [the final project](#the-final-project).
## Contents:
Section 1: Introduction to CUDA
- Heat Conduction
- Matrix Multiplication
- Adding two vectors with strides
Section 2: Unified Memory
- Adding 2 very large vectors
- Fast implementation of SAXPY
Section 3: Streaming & Visual Profiling
- Initializating memory with CUDA streams
- Solving the n-body problem with 4096 & 65536 bodies.
## The Final Project
The final assignment consisted of simulating the n-body problem on a GPU with force, velocity, & position computations - but accelerating a given CPU-only code by exploiting the CUDA device.
The overall mathematical procedure was simple - first, the forces are calculated in 3D space using Newton's gravitational laws from each mass to every other mass.
Once that is done, we use the calculated forces obtain the new positions for all masses.
The development was iterative & profile-driven, with each version analyzed using the NSight Systems Visual Profiler to analyze bottlenecks & memory usage patterns.
In my first successful version, only the force calculations were parallelized. While this "passed" the threshold, it was inefficient. Over 10 iterations, each iteration involved a transfer between host & device.
In my second version, I delegated memory transfers to only before and after all iterations of interaction computation, the latter of which was entirely parallelized. This meant that position integration had to be made into another kernel, which took the address of the array in the GPU device as an argument.
In my third version, I tried incorporating streams to further accelerate, since it was introduced in the same section. However, that is clearly not the right solution, since it adds unecessary overhead implementing the same logic across several sections of memory, when they all access the same data, and do not branch diverge.
Here is what the NSys profile looks like for the best version by far:
- The green at the start and pink at the end are memory transfers.
- The big blue blocks in the middle are the force calculations - they can still be accelerated with streams.
- Between the big blue blocks, one can notice a few ticks. This is the position integration procedure.
