https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution — essential for building high-throughput pipelines.
https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
asynchronous-execution cuda cuda-streams gpu parallel-programming performance-optimization pinned-memory
Last synced: 5 months ago
JSON representation
A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution — essential for building high-throughput pipelines.
- Host: GitHub
- URL: https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem
- Owner: FlosMume
- Created: 2025-10-28T06:52:25.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2025-10-29T03:57:20.000Z (5 months ago)
- Last Synced: 2025-10-29T05:44:12.528Z (5 months ago)
- Topics: asynchronous-execution, cuda, cuda-streams, gpu, parallel-programming, performance-optimization, pinned-memory
- Language: C++
- Homepage:
- Size: 1.03 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# streams-and-pinned-mem
Overlap host↔device memory copies with GPU compute using **CUDA streams** and **pinned (page-locked) host memory**.
## Highlights
- Uses `cudaMallocHost` for pinned host buffers → enables true async H2D/D2H with `cudaMemcpyAsync`.
- Partitions a large vector into chunks and pipelines **H2D copy → kernel → D2H copy** across multiple streams.
- Simple compute kernel with extra FLOPs to make overlap visible.
- Measures timings with CUDA **events** and prints effective bandwidth & speedup vs. single-stream baseline.
## Build & Run (Linux / WSL / Windows + NVCC)
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/overlap_streams
```
(Windows PowerShell): `build\Release\overlap_streams.exe`
## Tunables
Use environment variables to tweak problem size and number of streams:
- `N` (default `16777216`, i.e., 2^24 elements)
- `N_STREAMS` (default `4`)
- `FLOP_ITERS` per element (default `256`) increases compute work
Example:
```bash
N=8388608 N_STREAMS=8 FLOP_ITERS=512 ./build/overlap_streams
```
## Files
- `src/overlap_streams.cu` – demo program
- `CMakeLists.txt` – CUDA 12+ project config (targets Ada, SM 89 by default)
- `scripts/check_streams_status.sh` – quick GPU + build status and micro-benchmark helper