An open API service indexing awesome lists of open source software.

https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem

A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution โ€” essential for building high-throughput pipelines.
https://github.com/flosmume/cpp-cuda-streams-and-pinned-mem

asynchronous-execution cuda cuda-streams gpu parallel-programming performance-optimization pinned-memory

Last synced: 19 days ago
JSON representation

A CUDA C++ demo showing how to overlap data transfer and kernel execution using multiple streams and pinned (page-locked) host memory. This project illustrates asynchronous memcpy, event timing, and performance benefits of concurrent GPU execution โ€” essential for building high-throughput pipelines.

Awesome Lists containing this project

README

          

# CUDA Streams & Pinned Memory โ€” Overlap Compute & Transfers

## ๐Ÿš€ Overview
This project demonstrates **how to overlap CUDA memory transfers and kernel execution** using:
- Multiple CUDA streams
- Pinned (page-locked) host memory
- Asynchronous `cudaMemcpyAsync`
- A simple SAXPY-like compute (`z = a*x + b`)

The goal is to show how **PCIe transfers**, **kernel compute**, and **host/device synchronization** can run concurrently to maximize GPU utilization.

---

## ๐Ÿ“ Project Structure
```
streams-and-pinned-mem/
โ”‚โ”€โ”€ CMakeLists.txt
โ”‚โ”€โ”€ overlap_streams.cu
โ”‚โ”€โ”€ README.md โ† (this file)
โ”‚โ”€โ”€ scripts/
โ”‚ โ””โ”€โ”€ check_cuda_streams_status.sh
โ”‚โ”€โ”€ build/ (generated)
```

---

## โœจ Key Concepts Demonstrated
### 1. CUDA Streams
Each stream executes operations **in order**, but different streams can run **in parallel**:
- Independent **compute** and **memcpy** paths
- Helps hide PCIe transfer latency
- Enables multi-chunk pipelining

### 2. Pinned (Page-Locked) Memory
Pinned memory allows:
- True asynchronous DMA transfers
- Higher PCIe bandwidth
- Required for overlap with kernel execution

Allocated using:
```cpp
cudaHostAlloc(&h_x, N*sizeof(float), cudaHostAllocDefault);
```

### 3. Overlapping Execution
The program uses **N streams**, each responsible for a chunk:
```
H2D copy โ†’ Kernel โ†’ D2H copy
```
All streams operate concurrently, creating a pipeline.

---

## ๐Ÿ“Š Timeline Diagram (Conceptual)

```
Stream 0: [H2D]----[Compute]-------[D2H]
Stream 1: [H2D]----[Compute]-------[D2H]
Stream 2: [H2D]----[Compute]-------[D2H]
Stream 3: [H2D]----[Compute]-------[D2H]
```

**Result:** PCIe transfers and kernels run **at the same time**, improving throughput.

---

## ๐Ÿงฎ Kernel Explanation
The compute is intentionally simple:
```cpp
z[i] = a * x[i] + b;
```
This allows the demo to focus on **stream behavior**, not algorithm complexity.

---

## ๐Ÿ›  Build Instructions (Clean & Simple)

### **Prerequisites**
- Linux (WSL2 Ubuntu recommended)
- NVIDIA GPU + driver
- CUDA Toolkit installed system-wide (`/usr/local/cuda`)

### **Build**
```bash
rm -rf build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
```

### **Run**
```bash
./build/overlap_streams
```

---

## โœ” Verification Script
Included under `scripts/check_cuda_streams_status.sh`:

- Detects `nvcc`
- Detects GPU compute capability
- Confirms pinned memory support
- Prints all CUDA runtime library versions
- Warns if conda CUDA overrides system CUDA

Run:
```bash
bash scripts/check_cuda_streams_status.sh
```

---

## ๐Ÿงช Tips for Success
### Avoid Conda CUDA Unless Needed
System CUDA is almost always safer:
```bash
which nvcc
# should be /usr/local/cuda/bin/nvcc
```

### Always clear hash after PATH changes
```bash
hash -r
```

### Measure Overlap Efficiency
Use:
```bash
nvprof ./build/overlap_streams
```
or Nsight Systems.

---

## ๐Ÿ”— References
- NVIDIA CUDA Programming Guide
- โ€œStreams and Concurrencyโ€ โ€” official CUDA samples
- Nsight Systems Profiling Tutorials

---

## ๐Ÿ‘ค Author
**Samuel Huang**
GitHub: **FlosMume**

---

## ๐Ÿ“ License
MIT License