https://github.com/ashvardanian/cuda-python-starter-kit
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
https://github.com/ashvardanian/cuda-python-starter-kit
cmake cuda cuda-programming hip hpc matrix-multiplication openmp parallel-computing parallel-programming pybind pybind11 python starter-kit starter-template tutorial
Last synced: 3 months ago
JSON representation
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
- Host: GitHub
- URL: https://github.com/ashvardanian/cuda-python-starter-kit
- Owner: ashvardanian
- License: apache-2.0
- Created: 2024-08-09T04:57:40.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-11T20:59:47.000Z (7 months ago)
- Last Synced: 2025-07-09T00:55:31.041Z (3 months ago)
- Topics: cmake, cuda, cuda-programming, hip, hpc, matrix-multiplication, openmp, parallel-computing, parallel-programming, pybind, pybind11, python, starter-kit, starter-template, tutorial
- Language: Cuda
- Homepage: https://ashvardanian.com/tags/less-slow
- Size: 238 KB
- Stars: 26
- Watchers: 1
- Forks: 3
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# C++ & CUDA Starter Kit for Python Developers

One of the most common workflows in high-performance computing is to 1️⃣ prototype algorithms in Python and then 2️⃣ port them to C++ and CUDA.
It's a simple way to prototype and test ideas quickly, but configuring the build tools for such heterogenous code + heterogeneous hardware projects is a pain, often amplified by the error-prone syntax of CMake.
This project provides a pre-configured environment for such workflows...:1. using only `setup.py` and `requirements-{cpu,gpu}.txt` to manage the build process,
2. supporting OpenMP for parallelism on the CPU, and CUDA for GPU, and
3. including [CCCL](https://github.com/NVIDIA/cccl) libraries, like Thrust, and CUB, to simplify the code.As an example, the repository implements, tests, and benchmarks only 2 operations - array accumulation and matrix multiplication.
The baseline Python + Numba implementations are placed in `starter_kit_baseline.py`, and the optimized CUDA nd OpenMP implementations are placed in `starter_kit.cu`.
If no CUDA-capable device is found, the file will be treated as a CPU-only C++ implementation.
If VSCode is used, the `tasks.json` file is configured with debuggers for both CPU and GPU code, both in Python and C++.
The `.clang-format` is configured with LLVM base style, adjusted for wider screens, allowing 120 characters per line.## Installation
I'd recommend forking the repository for your own projects, but you can also clone it directly:
```bash
git clone https://github.com/ashvardanian/cpp-cuda-python-starter-kit.git
cd cpp-cuda-python-starter-kit
```Once pulled down, you can build and run the project with `uv`:
```bash
git submodule update --init --recursive # fetch CCCL libraries
uv pip install -e .[gpu] # or `.[cpu]` for non-CUDA devices
uv run pytest test.py -s -x # build and test until first failure
```Or using a conventional Python environment and dependency management tooling:
```bash
git submodule update --init --recursive # fetch CCCL libraries
pip install -r requirements-gpu.txt # or requirements-cpu.txt
pip install -e . # compile for the current platform
pytest test.py -s -x # test until first failure
python bench.py # saves charts to disk
```## Workflow
The project is designed to be as simple as possible, with the following workflow:
1. Fork or download the repository.
2. Implement your baseline algorithm in `starter_kit_baseline.py`.
3. Implement your optimized algorithm in `starter_kit.cu`.## Reading Materials
Beginner GPGPU:
- High-level concepts: [nvidia.com](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)
- Nvidia CuPy UDFs: [cupy.dev](https://docs.cupy.dev/en/stable/user_guide/kernel.html)
- CUDA in Python with Numba: [numba/nvidia-cuda-tutorial](https://github.com/numba/nvidia-cuda-tutorial)
- C++ STL Parallelism on GPUs: [nvidia.com](https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/)Advanced GPGPU:
- CUDA math intrinsics: [nvidia.com](https://docs.nvidia.com/cuda/cuda-math-api/index.html)
- Troubleshooting Nvidia hardware: [stas00/ml-engineering](https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/nvidia/debug.md)
- Nvidia ISA Generator with SM89 and SM90 codes: [kuterd/nv_isa_solver](https://github.com/kuterd/nv_isa_solver)
- Multi GPU examples: [nvidia/multi-gpu-programming-models](https://github.com/NVIDIA/multi-gpu-programming-models)Communities:
- CUDA MODE on [Discord](https://discord.com/invite/cudamode)
- r/CUDA on [Reddit](https://www.reddit.com/r/CUDA/)
- NVIDIA Developer Forums on [DevTalk](https://forums.developer.nvidia.com)