https://github.com/gaurisharan/cuda-ml-kernels

Practice repo for CUDA C++ GPU kernels for ML and HPC.
https://github.com/gaurisharan/cuda-ml-kernels

cpp cuda gpu hpc kernels ml parallel-computing systems-ml

Last synced: 6 months ago
JSON representation

Practice repo for CUDA C++ GPU kernels for ML and HPC.

Host: GitHub
URL: https://github.com/gaurisharan/cuda-ml-kernels
Owner: gaurisharan
Created: 2025-07-06T15:08:53.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-07-06T16:03:25.000Z (6 months ago)
Last Synced: 2025-07-06T16:34:20.984Z (6 months ago)
Topics: cpp, cuda, gpu, hpc, kernels, ml, parallel-computing, systems-ml
Language: Cuda
Homepage:
Size: 190 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CUDA ML Kernels

## 🚀 Motivation

This repository implements **custom CUDA kernels for common ML operations**, benchmarked against PyTorch's highly optimized cuBLAS/cuDNN kernels. The goal is to:

- Understand GPU parallelization patterns
- Compare naive kernel performance vs. library implementations
- Build intuition for ML Systems performance engineering

---

## 📁 Repository Structure

```.
├── kernels/
│ ├── matrix_multiply.cu
│ ├── vector_add.cu
│ ├── relu.cu
│ ├── dot_product.cu
│ ├── intro.cu
├── benchmarks/
│ └── benchmark.py
└── .gitignore
```

- `kernels/`: CUDA C++ kernel implementations
- `benchmarks/`: Python script to benchmark kernels vs. PyTorch

---

## ⚡ Kernels Implemented

| Kernel | Description |
|-------------------|-----------------------------------|
| `matrix_multiply` | Matrix multiplication (1024x1024) |
| `vector_add` | Elementwise vector addition |
| `relu` | ReLU activation function |
| `dot_product` | Vector dot product reduction |
| `intro` | 4x4 matrix multiplication demo |

---

## 🔧 Build Instructions

1. **Ensure NVIDIA CUDA Toolkit is installed.**

2. **Compile each `.cu` file:**

```bash
cd kernels

nvcc -o matrix_multiply.exe matrix_multiply.cu
nvcc -o vector_add.exe vector_add.cu
nvcc -o relu.exe relu.cu
nvcc -o dot_product.exe dot_product.cu
nvcc -o intro.exe intro.cu
````

> Replace `.exe` with no extension if on Linux/Mac.

---

## 🧪 Running Benchmarks

From the repo root:

```bash
cd benchmarks
python benchmark.py
```

---

## 📊 Results Summary

| Kernel | PyTorch Time (ms) | Custom CUDA Time (ms) | Speedup |
| ------------------ | ----------------- | --------------------- | ------- |
| matrix\_multiply | 14.28 | 6.95 | 2.06x |
| vector\_add | 2.39 | 1.35 | 1.77x |
| relu | 2.97 | 0.56 | 5.35x |
| dot\_product | \~0 | 1.33 | Slower |
| intro (4x4 matmul) | 2.25 | 1.08 | 2.09x |

---

## 💡 Key Insights

* **Matrix multiplication and ReLU** kernels show significant speedups, demonstrating effective GPU thread parallelization.
* **Vector addition** gains are modest, as PyTorch uses cuBLAS kernels optimized near theoretical peak.
* **Dot product** is slower due to naive reduction implementation vs. PyTorch's warp-level optimized reductions.
* **Small matmul (intro)** demonstrates kernel launch overhead optimization benefits.

---

## 📝 Future Improvements

* Implement **warp-level reductions** for dot product
* Integrate **unit tests** comparing kernel outputs with PyTorch for correctness validation
* Extend to **batched kernels** relevant for end-to-end ML pipeline acceleration

---

## 👤 Author

Gauri Sharan

---

## 📜 License

MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gaurisharan/cuda-ml-kernels

Awesome Lists containing this project

README