{"id":16976151,"url":"https://github.com/ashvardanian/cuda-python-starter-kit","last_synced_at":"2025-07-13T08:09:58.680Z","repository":{"id":253735728,"uuid":"840154846","full_name":"ashvardanian/cuda-python-starter-kit","owner":"ashvardanian","description":"Parallel Computing starter project to build GPU \u0026 CPU kernels in CUDA \u0026 C++ and call them from Python without a single line of CMake using PyBind11","archived":false,"fork":false,"pushed_at":"2025-03-11T20:59:47.000Z","size":244,"stargazers_count":26,"open_issues_count":3,"forks_count":3,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-09T00:55:31.041Z","etag":null,"topics":["cmake","cuda","cuda-programming","hip","hpc","matrix-multiplication","openmp","parallel-computing","parallel-programming","pybind","pybind11","python","starter-kit","starter-template","tutorial"],"latest_commit_sha":null,"homepage":"https://ashvardanian.com/tags/less-slow","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ashvardanian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-09T04:57:40.000Z","updated_at":"2025-05-28T20:25:02.000Z","dependencies_parsed_at":"2024-08-25T20:34:19.215Z","dependency_job_id":"53972780-9507-4fc7-947b-829922394fb8","html_url":"https://github.com/ashvardanian/cuda-python-starter-kit","commit_stats":null,"previous_names":["ashvardanian/cuda-python-starter-kit","ashvardanian/cpp-cuda-python-starter-kit"],"tags_count":0,"template":true,"template_full_name":null,"purl":"pkg:github/ashvardanian/cuda-python-starter-kit","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fcuda-python-starter-kit","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fcuda-python-starter-kit/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fcuda-python-starter-kit/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fcuda-python-starter-kit/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ashvardanian","download_url":"https://codeload.github.com/ashvardanian/cuda-python-starter-kit/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ashvardanian%2Fcuda-python-starter-kit/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265108514,"owners_count":23712466,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cmake","cuda","cuda-programming","hip","hpc","matrix-multiplication","openmp","parallel-computing","parallel-programming","pybind","pybind11","python","starter-kit","starter-template","tutorial"],"created_at":"2024-10-14T01:25:06.753Z","updated_at":"2025-07-13T08:09:58.658Z","avatar_url":"https://github.com/ashvardanian.png","language":"Cuda","readme":"# C++ \u0026 CUDA Starter Kit for Python Developers\n\n![CUDA Python Starter Kit Thumbnail](https://github.com/ashvardanian/ashvardanian/blob/master/repositories/cuda-python-starter-kit.jpg?raw=true)\n\nOne of the most common workflows in high-performance computing is to 1️⃣ prototype algorithms in Python and then 2️⃣ port them to C++ and CUDA.\nIt's a simple way to prototype and test ideas quickly, but configuring the build tools for such heterogenous code + heterogeneous hardware projects is a pain, often amplified by the error-prone syntax of CMake.\nThis project provides a pre-configured environment for such workflows...:\n\n1. using only `setup.py` and `requirements-{cpu,gpu}.txt` to manage the build process,\n2. supporting OpenMP for parallelism on the CPU, and CUDA for GPU, and\n3. including [CCCL](https://github.com/NVIDIA/cccl) libraries, like Thrust, and CUB, to simplify the code.\n\nAs an example, the repository implements, tests, and benchmarks only 2 operations - array accumulation and matrix multiplication.\nThe baseline Python + Numba implementations are placed in `starter_kit_baseline.py`, and the optimized CUDA nd OpenMP implementations are placed in `starter_kit.cu`.\nIf no CUDA-capable device is found, the file will be treated as a CPU-only C++ implementation.\nIf VSCode is used, the `tasks.json` file is configured with debuggers for both CPU and GPU code, both in Python and C++.\nThe `.clang-format` is configured with LLVM base style, adjusted for wider screens, allowing 120 characters per line.\n\n## Installation\n\nI'd recommend forking the repository for your own projects, but you can also clone it directly:\n\n```bash\ngit clone https://github.com/ashvardanian/cpp-cuda-python-starter-kit.git\ncd cpp-cuda-python-starter-kit\n```\n\nOnce pulled down, you can build and run the project with `uv`:\n\n```bash\ngit submodule update --init --recursive     # fetch CCCL libraries\nuv pip install -e .[gpu]                    # or `.[cpu]` for non-CUDA devices\nuv run pytest test.py -s -x                 # build and test until first failure\n```\n\nOr using a conventional Python environment and dependency management tooling:\n\n```bash\ngit submodule update --init --recursive     # fetch CCCL libraries\npip install -r requirements-gpu.txt         # or requirements-cpu.txt\npip install -e .                            # compile for the current platform\npytest test.py -s -x                        # test until first failure\npython bench.py                             # saves charts to disk\n```\n\n## Workflow\n\nThe project is designed to be as simple as possible, with the following workflow:\n\n1. Fork or download the repository.\n2. Implement your baseline algorithm in `starter_kit_baseline.py`.\n3. Implement your optimized algorithm in `starter_kit.cu`.\n\n## Reading Materials\n\nBeginner GPGPU:\n\n- High-level concepts: [nvidia.com](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)\n- Nvidia CuPy UDFs: [cupy.dev](https://docs.cupy.dev/en/stable/user_guide/kernel.html)\n- CUDA in Python with Numba: [numba/nvidia-cuda-tutorial](https://github.com/numba/nvidia-cuda-tutorial)\n- C++ STL Parallelism on GPUs: [nvidia.com](https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/)\n\nAdvanced GPGPU:\n\n- CUDA math intrinsics: [nvidia.com](https://docs.nvidia.com/cuda/cuda-math-api/index.html)\n- Troubleshooting Nvidia hardware: [stas00/ml-engineering](https://github.com/stas00/ml-engineering/blob/master/compute/accelerator/nvidia/debug.md)\n- Nvidia ISA Generator with SM89 and SM90 codes: [kuterd/nv_isa_solver](https://github.com/kuterd/nv_isa_solver)\n- Multi GPU examples: [nvidia/multi-gpu-programming-models](https://github.com/NVIDIA/multi-gpu-programming-models)\n\nCommunities:\n\n- CUDA MODE on [Discord](https://discord.com/invite/cudamode)\n- r/CUDA on [Reddit](https://www.reddit.com/r/CUDA/)\n- NVIDIA Developer Forums on [DevTalk](https://forums.developer.nvidia.com)\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fcuda-python-starter-kit","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fashvardanian%2Fcuda-python-starter-kit","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fashvardanian%2Fcuda-python-starter-kit/lists"}