https://github.com/jmaczan/cuda_cpp_programming_guide
Working through NVIDIA's CUDA C++ Programming Guide - version 12.8
https://github.com/jmaczan/cuda_cpp_programming_guide
Last synced: about 1 year ago
JSON representation
Working through NVIDIA's CUDA C++ Programming Guide - version 12.8
- Host: GitHub
- URL: https://github.com/jmaczan/cuda_cpp_programming_guide
- Owner: jmaczan
- Created: 2025-04-12T18:52:41.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-13T08:19:13.000Z (about 1 year ago)
- Last Synced: 2025-04-13T09:27:05.151Z (about 1 year ago)
- Language: Cuda
- Size: 222 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# cuda_cpp_programming_guide
Working through NVIDIA's CUDA C++ Programming Guide - version 12.8
## Notes
Thread -> Block -> Grid
Typically 1024 threads per block
- Block (Thread Block): A group of threads. Threads within the same block can:
- Cooperate: By synchronizing their execution using __syncthreads(). This is a barrier; threads reaching it wait until all threads in the block reach it before any proceed.
- Share Data: Using a fast, on-chip shared memory (__shared__). This is much faster than global device memory.
<<< GridSize, BlockSize, SharedMemoryBytes, Stream >>>
- GridSize: Specifies the dimensions of the grid (how many blocks). It can be an int (for 1D) or a dim3 type (for 1D, 2D, or 3D). dim3(Gx, Gy, Gz) means Gx * Gy * Gz total blocks.
- BlockSize: Specifies the dimensions of each block (how many threads per block). It can be an int or a dim3. dim3(Bx, By, Bz) means Bx * By * Bz threads per block. The total threads launched is GridSize * BlockSize.
- SharedMemoryBytes (Optional): Amount of dynamic shared memory to allocate per block (defaults to 0).
- Stream (Optional): Specifies the CUDA stream for asynchronous execution (defaults to the default stream 0).