https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter
CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.
https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter
cpp cuda cuda-kernels cuda-streams deep-learning-inference gpu gpu-optimization gpu-profiling high-performance-computing nsight nvidia parrallel-computing pinned-memory
Last synced: 2 months ago
JSON representation
CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.
- Host: GitHub
- URL: https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter
- Owner: FlosMume
- Created: 2025-10-29T03:43:40.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-10-29T08:05:50.000Z (3 months ago)
- Last Synced: 2025-10-29T10:12:29.852Z (3 months ago)
- Topics: cpp, cuda, cuda-kernels, cuda-streams, deep-learning-inference, gpu, gpu-optimization, gpu-profiling, high-performance-computing, nsight, nvidia, parrallel-computing, pinned-memory
- Language: C++
- Homepage:
- Size: 3.91 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DeepVision-RTX (Starter)
A CUDA C++ practice project designed for the RTX 4070 SUPER (Ada 8.9), demonstrating how to overlap data transfers with computation using streams and pinned memory, apply basic kernel optimizations with 1D and 2D grid configurations, and perform precise event timing for profiling in Nsight Systems and Nsight Compute.
## What’s here?
- **Pinned host memory** + `cudaMemcpyAsync` to demonstrate overlap
- **Multiple streams** for concurrent copy/compute
- **Timed sections** with `cudaEventRecord` / `cudaEventElapsedTime`
- **Kernels**: `saxpy` (1D), `blur3x3_naive` (2D)
## Build
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/deepvision_rtx
```
## Profile (examples)
```bash
# Nsight Systems GUI (on host with CUDA Toolkit installed)
nsys profile -o nsys_report ./build/deepvision_rtx
# Nsight Compute single-kernel collection
ncu --set full --target-processes all ./build/deepvision_rtx
```
## Next steps
- Convert blur kernel to **shared-memory tiled** version
- Add **half-precision** path to prep for Tensor Cores
- Compare end-to-end with **cuDNN** and optionally TensorRT
```