https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.
https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter

cpp cuda cuda-kernels cuda-streams deep-learning-inference gpu gpu-optimization gpu-profiling high-performance-computing nsight nvidia parrallel-computing pinned-memory

Last synced: 2 months ago
JSON representation

Host: GitHub
URL: https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter
Owner: FlosMume
Created: 2025-10-29T03:43:40.000Z (3 months ago)
Default Branch: main
Last Pushed: 2025-10-29T08:05:50.000Z (3 months ago)
Last Synced: 2025-10-29T10:12:29.852Z (3 months ago)
Topics: cpp, cuda, cuda-kernels, cuda-streams, deep-learning-inference, gpu, gpu-optimization, gpu-profiling, high-performance-computing, nsight, nvidia, parrallel-computing, pinned-memory
Language: C++
Homepage:
Size: 3.91 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# DeepVision-RTX (Starter)

A CUDA C++ practice project designed for the RTX 4070 SUPER (Ada 8.9), demonstrating how to overlap data transfers with computation using streams and pinned memory, apply basic kernel optimizations with 1D and 2D grid configurations, and perform precise event timing for profiling in Nsight Systems and Nsight Compute.

## What’s here?
- **Pinned host memory** + `cudaMemcpyAsync` to demonstrate overlap
- **Multiple streams** for concurrent copy/compute
- **Timed sections** with `cudaEventRecord` / `cudaEventElapsedTime`
- **Kernels**: `saxpy` (1D), `blur3x3_naive` (2D)

## Build
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/deepvision_rtx
```

## Profile (examples)
```bash
# Nsight Systems GUI (on host with CUDA Toolkit installed)
nsys profile -o nsys_report ./build/deepvision_rtx

# Nsight Compute single-kernel collection
ncu --set full --target-processes all ./build/deepvision_rtx
```

## Next steps
- Convert blur kernel to **shared-memory tiled** version
- Add **half-precision** path to prep for Tensor Cores
- Compare end-to-end with **cuDNN** and optionally TensorRT
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter

Awesome Lists containing this project

README