https://github.com/gaurisharan/cuda-ml-kernels
Practice repo for CUDA C++ GPU kernels for ML and HPC.
https://github.com/gaurisharan/cuda-ml-kernels
cpp cuda gpu hpc kernels ml parallel-computing systems-ml
Last synced: 6 months ago
JSON representation
Practice repo for CUDA C++ GPU kernels for ML and HPC.
- Host: GitHub
- URL: https://github.com/gaurisharan/cuda-ml-kernels
- Owner: gaurisharan
- Created: 2025-07-06T15:08:53.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-07-06T16:03:25.000Z (6 months ago)
- Last Synced: 2025-07-06T16:34:20.984Z (6 months ago)
- Topics: cpp, cuda, gpu, hpc, kernels, ml, parallel-computing, systems-ml
- Language: Cuda
- Homepage:
- Size: 190 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CUDA ML Kernels
## ๐ Motivation
This repository implements **custom CUDA kernels for common ML operations**, benchmarked against PyTorch's highly optimized cuBLAS/cuDNN kernels. The goal is to:
- Understand GPU parallelization patterns
- Compare naive kernel performance vs. library implementations
- Build intuition for ML Systems performance engineering
---
## ๐ Repository Structure
```.
โโโ kernels/
โ โโโ matrix_multiply.cu
โ โโโ vector_add.cu
โ โโโ relu.cu
โ โโโ dot_product.cu
โ โโโ intro.cu
โโโ benchmarks/
โ โโโ benchmark.py
โโโ .gitignore
```
- `kernels/`: CUDA C++ kernel implementations
- `benchmarks/`: Python script to benchmark kernels vs. PyTorch
---
## โก Kernels Implemented
| Kernel | Description |
|-------------------|-----------------------------------|
| `matrix_multiply` | Matrix multiplication (1024x1024) |
| `vector_add` | Elementwise vector addition |
| `relu` | ReLU activation function |
| `dot_product` | Vector dot product reduction |
| `intro` | 4x4 matrix multiplication demo |
---
## ๐ง Build Instructions
1. **Ensure NVIDIA CUDA Toolkit is installed.**
2. **Compile each `.cu` file:**
```bash
cd kernels
nvcc -o matrix_multiply.exe matrix_multiply.cu
nvcc -o vector_add.exe vector_add.cu
nvcc -o relu.exe relu.cu
nvcc -o dot_product.exe dot_product.cu
nvcc -o intro.exe intro.cu
````
> Replace `.exe` with no extension if on Linux/Mac.
---
## ๐งช Running Benchmarks
From the repo root:
```bash
cd benchmarks
python benchmark.py
```
---
## ๐ Results Summary
| Kernel | PyTorch Time (ms) | Custom CUDA Time (ms) | Speedup |
| ------------------ | ----------------- | --------------------- | ------- |
| matrix\_multiply | 14.28 | 6.95 | 2.06x |
| vector\_add | 2.39 | 1.35 | 1.77x |
| relu | 2.97 | 0.56 | 5.35x |
| dot\_product | \~0 | 1.33 | Slower |
| intro (4x4 matmul) | 2.25 | 1.08 | 2.09x |
---
## ๐ก Key Insights
* **Matrix multiplication and ReLU** kernels show significant speedups, demonstrating effective GPU thread parallelization.
* **Vector addition** gains are modest, as PyTorch uses cuBLAS kernels optimized near theoretical peak.
* **Dot product** is slower due to naive reduction implementation vs. PyTorch's warp-level optimized reductions.
* **Small matmul (intro)** demonstrates kernel launch overhead optimization benefits.
---
## ๐ Future Improvements
* Implement **warp-level reductions** for dot product
* Integrate **unit tests** comparing kernel outputs with PyTorch for correctness validation
* Extend to **batched kernels** relevant for end-to-end ML pipeline acceleration
---
## ๐ค Author
Gauri Sharan
---
## ๐ License
MIT License