https://github.com/ashvardanian/cuda-python-starter-kit

Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
https://github.com/ashvardanian/cuda-python-starter-kit

cmake cuda cuda-programming hip hpc matrix-multiplication openmp parallel-computing parallel-programming pybind pybind11 python starter-kit starter-template tutorial

Last synced: 3 months ago
JSON representation

Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11

Host: GitHub
URL: https://github.com/ashvardanian/cuda-python-starter-kit
Owner: ashvardanian
License: apache-2.0
Created: 2024-08-09T04:57:40.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-03-11T20:59:47.000Z (7 months ago)
Last Synced: 2025-07-09T00:55:31.041Z (3 months ago)
Topics: cmake, cuda, cuda-programming, hip, hpc, matrix-multiplication, openmp, parallel-computing, parallel-programming, pybind, pybind11, python, starter-kit, starter-template, tutorial
Language: Cuda
Homepage: https://ashvardanian.com/tags/less-slow
Size: 238 KB
Stars: 26
Watchers: 1
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# C++ & CUDA Starter Kit for Python Developers

![CUDA Python Starter Kit Thumbnail](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/cuda-python-starter-kit.jpg?raw=true)

One of the most common workflows in high-performance computing is to 1️⃣ prototype algorithms in Python and then 2️⃣ port them to C++ and CUDA.
It's a simple way to prototype and test ideas quickly, but configuring the build tools for such heterogenous code + heterogeneous hardware projects is a pain, often amplified by the error-prone syntax of CMake.
This project provides a pre-configured environment for such workflows...:

1. using only `setup.py` and `requirements-{cpu,gpu}.txt` to manage the build process,
2. supporting OpenMP for parallelism on the CPU, and CUDA for GPU, and
3. including [CCCL](https://github.com/NVIDIA/cccl) libraries, like Thrust, and CUB, to simplify the code.

As an example, the repository implements, tests, and benchmarks only 2 operations - array accumulation and matrix multiplication.
The baseline Python + Numba implementations are placed in `starter_kit_baseline.py`, and the optimized CUDA nd OpenMP implementations are placed in `starter_kit.cu`.
If no CUDA-capable device is found, the file will be treated as a CPU-only C++ implementation.
If VSCode is used, the `tasks.json` file is configured with debuggers for both CPU and GPU code, both in Python and C++.
The `.clang-format` is configured with LLVM base style, adjusted for wider screens, allowing 120 characters per line.

## Installation

I'd recommend forking the repository for your own projects, but you can also clone it directly:

```bash
git clone https://github.com/ashvardanian/cpp-cuda-python-starter-kit.git
cd cpp-cuda-python-starter-kit
```

Once pulled down, you can build and run the project with `uv`:

```bash
git submodule update --init --recursive # fetch CCCL libraries
uv pip install -e .[gpu] # or `.[cpu]` for non-CUDA devices
uv run pytest test.py -s -x # build and test until first failure
```

Or using a conventional Python environment and dependency management tooling:

```bash
git submodule update --init --recursive # fetch CCCL libraries
pip install -r requirements-gpu.txt # or requirements-cpu.txt
pip install -e . # compile for the current platform
pytest test.py -s -x # test until first failure
python bench.py # saves charts to disk
```

## Workflow

The project is designed to be as simple as possible, with the following workflow:

1. Fork or download the repository.
2. Implement your baseline algorithm in `starter_kit_baseline.py`.
3. Implement your optimized algorithm in `starter_kit.cu`.

## Reading Materials

Beginner GPGPU:

- High-level concepts: [nvidia.com](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)
- Nvidia CuPy UDFs: [cupy.dev](https://docs.cupy.dev/en/stable/user_guide/kernel.html)
- CUDA in Python with Numba: [numba/nvidia-cuda-tutorial](https://github.com/numba/nvidia-cuda-tutorial)
- C++ STL Parallelism on GPUs: [nvidia.com](https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/)

Advanced GPGPU:

- CUDA math intrinsics: [nvidia.com](https://docs.nvidia.com/cuda/cuda-math-api/index.html)
- Troubleshooting Nvidia hardware: [stas00/ml-engineering](https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/nvidia/debug.md)
- Nvidia ISA Generator with SM89 and SM90 codes: [kuterd/nv_isa_solver](https://github.com/kuterd/nv_isa_solver)
- Multi GPU examples: [nvidia/multi-gpu-programming-models](https://github.com/NVIDIA/multi-gpu-programming-models)

Communities:

- CUDA MODE on [Discord](https://discord.com/invite/cudamode)
- r/CUDA on [Reddit](https://www.reddit.com/r/CUDA/)
- NVIDIA Developer Forums on [DevTalk](https://forums.developer.nvidia.com)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ashvardanian/cuda-python-starter-kit

Awesome Lists containing this project

README