An open API service indexing awesome lists of open source software.

https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.
https://github.com/flosmume/cpp-cuda-deepvision-rtx-starter

cpp cuda cuda-kernels cuda-streams deep-learning-inference gpu gpu-optimization gpu-profiling high-performance-computing nsight nvidia parrallel-computing pinned-memory

Last synced: 6 days ago
JSON representation

CUDA C++ practice project for RTX 4070 SUPER — explore GPU concurrency, pinned memory, and Nsight profiling. Includes SAXPY and 2D blur kernels to train optimization, stream overlap, and timing analysis for NVIDIA Developer Technology Engineering skillset.

Awesome Lists containing this project

README

          

# DeepVision-RTX Starter

This is an enhanced README combining:
- GitHub-friendly formatting
- Technical deep dive
- ASCII diagrams
- Rendered PNG diagrams
- Profiling workflow
- Architecture explanations
- GPU hardware reasoning

## Architecture Diagram (SVG)
![Architecture Diagram](docs/architecture.svg)

## Streams Overlap Diagram (SVG)
![Streams Timeline](docs/streams_overlap.svg)

## ASCII Architecture Diagram
```
+-------------------+ PCIe / DMA +---------------------+
| CPU Host | -----------------> | GPU Global |
| (Pinned Memory) | <----------------- | Memory |
+-------------------+ +---------------------+
| |
| Launch Kernels |
v v
+--------------+ +------------------+
| CUDA Driver | | SMs (Ada 8.9) |
| Runtime API | | Warps / Threads |
+--------------+ +------------------+
```

## ASCII Streams Overlap Diagram
```
Time →
----------------------------------------------------------------------
H2D Stream 0: [========== copy =========]
H2D Stream 1: [========== copy =========]
Compute Stream 2: [==== kernel ====]
D2H Stream 0: [==== copy ====]
----------------------------------------------------------------------
```

## Build Instructions
```bash
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j
./build/deepvision_rtx
```

## Profiling Instructions (Nsight Systems & Nsight Compute)
```bash
nsys profile -o nsys_report ./build/deepvision_rtx
ncu --set full ./build/deepvision_rtx
```

## Project Structure
```
src/
main.cpp # Entry point
main.cu # Separate experimental demo
conv_kernels.cu # SAXPY + blur3x3 kernels
conv_kernels.cuh # Kernel declarations
utils/
check_cuda.hpp # Error-checking utilities
```

## GPU Architecture Notes (RTX 4070 SUPER - Ada 8.9)
- SM count: 46
- Warp size: 32
- Max threads/block: 1024
- Memory bandwidth: 504 GB/s
- Concurrent copy/compute supported
- Best performance achieved when:
- You use pinned memory
- H2D and D2H overlap with compute
- Kernels maintain good occupancy

## Roadmap
- Add shared-memory tiled blur
- Add constant-memory kernel variants
- Add half-precision path (FP16)
- Add Tensor Core WMMA version
- Add occupancy analysis + roofline plot
- Add Nsight Compute performance tables
- Add multi-kernel pipelines
- Compare against cuDNN for 3×3 conv

---