https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution โ essential for building high-throughput pipelines.
https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
asynchronous-execution cuda cuda-streams gpu parallel-programming performance-optimization pinned-memory
Last synced: 19 days ago
JSON representation
A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution โ essential for building high-throughput pipelines.
- Host: GitHub
- URL: https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
- Owner: FlosMume
- Created: 2025-10-28T06:52:25.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-10-29T03:57:20.000Z (7 months ago)
- Last Synced: 2025-10-29T05:44:12.528Z (7 months ago)
- Topics: asynchronous-execution, cuda, cuda-streams, gpu, parallel-programming, performance-optimization, pinned-memory
- Language: C++
- Homepage:
- Size: 1.03 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CUDA Streams & Pinned Memory โ Overlap Compute & Transfers
## ๐ Overview
This project demonstrates **how to overlap CUDA memory transfers and kernel execution** using:
- Multiple CUDA streams
- Pinned (page-locked) host memory
- Asynchronous `cudaMemcpyAsync`
- A simple SAXPY-like compute (`z = a*x + b`)
The goal is to show how **PCIe transfers**, **kernel compute**, and **host/device synchronization** can run concurrently to maximize GPU utilization.
---
## ๐ Project Structure
```
streams-and-pinned-mem/
โโโ CMakeLists.txt
โโโ overlap_streams.cu
โโโ README.md โ (this file)
โโโ scripts/
โ โโโ check_cuda_streams_status.sh
โโโ build/ (generated)
```
---
## โจ Key Concepts Demonstrated
### 1. CUDA Streams
Each stream executes operations **in order**, but different streams can run **in parallel**:
- Independent **compute** and **memcpy** paths
- Helps hide PCIe transfer latency
- Enables multi-chunk pipelining
### 2. Pinned (Page-Locked) Memory
Pinned memory allows:
- True asynchronous DMA transfers
- Higher PCIe bandwidth
- Required for overlap with kernel execution
Allocated using:
```cpp
cudaHostAlloc(&h_x, N*sizeof(float), cudaHostAllocDefault);
```
### 3. Overlapping Execution
The program uses **N streams**, each responsible for a chunk:
```
H2D copy โ Kernel โ D2H copy
```
All streams operate concurrently, creating a pipeline.
---
## ๐ Timeline Diagram (Conceptual)
```
Stream 0: [H2D]----[Compute]-------[D2H]
Stream 1: [H2D]----[Compute]-------[D2H]
Stream 2: [H2D]----[Compute]-------[D2H]
Stream 3: [H2D]----[Compute]-------[D2H]
```
**Result:** PCIe transfers and kernels run **at the same time**, improving throughput.
---
## ๐งฎ Kernel Explanation
The compute is intentionally simple:
```cpp
z[i] = a * x[i] + b;
```
This allows the demo to focus on **stream behavior**, not algorithm complexity.
---
## ๐ Build Instructions (Clean & Simple)
### **Prerequisites**
- Linux (WSL2 Ubuntu recommended)
- NVIDIA GPU + driver
- CUDA Toolkit installed system-wide (`/usr/local/cuda`)
### **Build**
```bash
rm -rf build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
```
### **Run**
```bash
./build/overlap_streams
```
---
## โ Verification Script
Included under `scripts/check_cuda_streams_status.sh`:
- Detects `nvcc`
- Detects GPU compute capability
- Confirms pinned memory support
- Prints all CUDA runtime library versions
- Warns if conda CUDA overrides system CUDA
Run:
```bash
bash scripts/check_cuda_streams_status.sh
```
---
## ๐งช Tips for Success
### Avoid Conda CUDA Unless Needed
System CUDA is almost always safer:
```bash
which nvcc
# should be /usr/local/cuda/bin/nvcc
```
### Always clear hash after PATH changes
```bash
hash -r
```
### Measure Overlap Efficiency
Use:
```bash
nvprof ./build/overlap_streams
```
or Nsight Systems.
---
## ๐ References
- NVIDIA CUDA Programming Guide
- โStreams and Concurrencyโ โ official CUDA samples
- Nsight Systems Profiling Tutorials
---
## ๐ค Author
**Samuel Huang**
GitHub: **FlosMume**
---
## ๐ License
MIT License