https://github.com/bolu-atx/cuda-dojo

Level up your CUDA skills - RPG style. Do or do not, there is no try.
https://github.com/bolu-atx/cuda-dojo

cuda examples learning tutorial

Last synced: about 4 hours ago
JSON representation

Level up your CUDA skills - RPG style. Do or do not, there is no try.

Host: GitHub
URL: https://github.com/bolu-atx/cuda-dojo
Owner: bolu-atx
License: bsd-3-clause
Created: 2026-06-27T19:27:52.000Z (3 days ago)
Default Branch: main
Last Pushed: 2026-06-27T21:14:13.000Z (3 days ago)
Last Synced: 2026-06-27T21:15:00.645Z (3 days ago)
Topics: cuda, examples, learning, tutorial
Language: Cuda
Homepage:
Size: 1.34 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
- Agents: AGENTS.md

Awesome Lists containing this project

README

          


  



# CUDA Dojo

A level-by-level CUDA learning project, structured as a skill tree rather than a

textbook. Each level unlocks one mental model and one production skill. Host code

is **C++23**, device code is **C++20** (the highest CUDA reliably supports for

`__device__` code today).

> **Hardware:** CUDA requires an NVIDIA GPU. macOS/Apple Silicon cannot build or

> run this — develop on a Linux/Windows box with the CUDA Toolkit installed.

## How to use this

CUDA Dojo is a hands-on **learning template**, not a library to depend on. The

intended loop is: get your own copy, work a level until you can *explain* it, then

climb the tree. The [interactive guide](https://bolu.dev/cuda-dojo/) teaches the

mental model; the `levels/` code is where you prove you have it.

1. **Get your own copy.** Click **Use this template** on GitHub (or fork) to start

   a personal repo you can commit your solutions into, then clone it:

   ```bash

   git clone https://github.com//cuda-dojo.git

   cd cuda-dojo

   make dep-check        # confirm your CUDA toolchain is ready

   ```

2. **Read the level, then the code.** Open the matching page in the

   [interactive guide](https://bolu.dev/cuda-dojo/) (or `make docs-serve` for a

   local copy), play with the widgets until the concept clicks, then read the

   kernels in `levels/levelNN_/`.

3. **Build and prove it.** `make levelNN-test` compiles the level and runs it

   against a CPU reference. A green test is the *floor*, not the goal — make sure

   you can predict the numbers (transactions, bandwidth, occupancy), not just pass.

4. **Extend, then continue.** Each level ends with "your reps" — small variations

   to implement yourself. Scaffold a new level with `add_dojo_level(...)` (see

   [Anatomy of a level](#anatomy-of-a-level)) and work up the

   [skill tree](#the-skill-tree).

## Build & test

The `Makefile` is the front door — it wraps CMake/CTest and auto-discovers levels

from `levels/`, so per-level targets appear automatically as levels are added.

```bash

make dep-check        # verify toolchain (nvcc, cmake, generator, profilers)

make                  # configure + build everything

make test             # run all level tests (CTest)

make level01          # build one level's kernel lib + demo

make level01-test     # build + run just that level's tests

make help             # list every target, including per-level ones

make clean            # remove ./build (distclean also drops ./out, ./.venv)

```

Prefer raw CMake? It works the same:

```bash

cmake -B build -G Ninja                        # configure (targets local GPU arch)

cmake --build build -j                          # build everything

ctest --test-dir build --output-on-failure      # run all tests

./build/levels/level01_vector_add/level01_demo   # run a level's demo

```

Pin specific architectures instead of autodetecting (required when building

without a GPU present, e.g. in CI or a container):

```bash

cmake -B build -DCMAKE_CUDA_ARCHITECTURES="80;86;90"   # or: all-major

```

## Docker (self-contained toolchain)

The image encapsulates the **entire CUDA toolkit** (nvcc, cuBLAS, cuFFT, …) plus

build deps — no host CUDA install needed. The one thing it can't contain is the

NVIDIA **kernel driver**: that stays on the host and is bridged in at runtime by

the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

(`--gpus all`). So *compiling* needs no GPU, but *running kernels* needs a GPU host.

```bash

make docker-build     # build the toolchain image

make docker-compile   # compile in-container WITHOUT a GPU (all-major) — CI/Mac friendly

make docker-test      # build + test in-container (needs a GPU host + --gpus all)

make docker-shell     # interactive shell, repo bind-mounted, --gpus all

```

VS Code users: **Reopen in Container** uses `.devcontainer/` (GPU optional, so it

opens and compiles even on machines without an NVIDIA GPU).

## Project layout

```

common/dojo/cuda_utils.cuh   CUDA_CHECK, GpuTimer, bandwidth/FLOP helpers, device info

common/dojo/test.hpp         zero-dependency micro test harness (CTest-backed)

cmake/Dojo.cmake             add_dojo_level() — one helper builds lib + demo + test

levels/levelNN_/      each level: kernels (.cu), a demo, and a test

```

### Anatomy of a level

Kernels live in a small static library that **both** the demo and the test link

against, so the canonical host+kernel code is written once and exercised two

ways. Add a new level by creating `levels/levelNN_/` with a

`CMakeLists.txt` calling `add_dojo_level(...)`, then `add_subdirectory(...)` it

from the top-level `CMakeLists.txt`.

## The skill tree

The numbering matches the [interactive guide](https://bolu.dev/cuda-dojo/). Status

reflects the **code** under `levels/`: ✅ implemented, ⬜ planned, 📖 guide-only

(conceptual, no kernel to write).

| Level | The one idea | Project | Status |

|------:|--------------|---------|:------:|

| 0 | A GPU trades latency for throughput | GPU mental model | 📖 |

| 1 | A kernel is one function run by a grid of threads | vector add / SAXPY / reduction | ✅ |

| 2 | You design the thread→data mapping | invert / threshold / crop | ✅ |

| 3 | CUDA has a logical machine and a physical machine | scope-mapping drills (grid/block/thread → SM/warp/lane) | ✅ |

| 4 | Where data lives dominates speed | tiled transpose | ✅ |

| 5 | Every kernel is memory- or compute-bound | optimize blur / Sobel (roofline) | ✅ |

| 6 | A block is a team with a shared scratchpad | box filter / separable blur | ✅ |

| 7 | Warp lanes cooperate through masks and registers | warp reduction / histogram | ✅ |

| 8 | Synchronization is a scope decision | warp / block / stream idioms | ⬜ |

| 9 | Don't hand-roll GEMM or FFT | cuBLAS / cuFFT pipeline | ⬜ |

| 10 | Overlap copy and compute | video pipeline (streams) | ⬜ |

| 11 | Real programs are kernel graphs | GEMM (multi-kernel) | ⬜ |

| 12 | Nsight tells you the truth | profile a kernel: 20 ms → 1 ms | ⬜ |

| 13 | Compose work into pipelines with streams, events, graphs | producer/consumer pipeline | ⬜ |

| 14 | Production = pools + graphs + streams | end-to-end image pipeline | ⬜ |

| 15 | Reformulate the algorithm for the hardware | your own | ⬜ |

| 16 | Pipeline tiles inside one kernel | `cp.async` GEMM tile loop | ⬜ |

Given an HPC/SIMD/OpenMP background, levels 0–2 should go fast; the real payoff

is levels 4–10 (memory hierarchy, warp programming, Nsight-driven perf analysis,

and stream/graph pipelines).

The docs also include three advanced tracks that stay in the same Feynman style:

architecture (now also covering multi-GPU scaling — NCCL/NVSHMEM, MIG), libraries,

and imaging/CV. Each one is built around prediction, interactive widgets, and concrete

CUDA reps rather than reference-manual lists.

## Profiling (from Level 5 onward)

Release builds compile with `-lineinfo` so the profilers map SASS back to source:

```bash

nsys profile ./build/levels/.../levelNN_demo     # timeline: transfers, kernels, gaps

ncu --set full ./build/levels/.../levelNN_demo    # per-kernel: occupancy, memory, roofline

```

## Interactive guide (docs)

A Feynman-style, level-by-level companion site with **interactive canvas widgets**

(thread indexing, SIMT divergence, coalescing, roofline, reduction, streams …)

lives in `docs/`, built with MkDocs Material.

```bash

make docs-serve       # live preview at http://127.0.0.1:9090

make docs             # build static site into ./out/docs

```

## Cheatsheets

To help with learning CUDA concepts (these are designed to be printed on standard Letter-sized paper):

![CUDA Cheatsheet](docs/assets/cuda-cheatsheet.png)

![CUDA Cheatsheet Page 2](docs/assets/cuda-cheatsheet-2.png)

## License

BSD 3-Clause. Redistribution must retain the copyright notice, license terms,

and disclaimer; see [LICENSE](LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bolu-atx/cuda-dojo

Awesome Lists containing this project

README