An open API service indexing awesome lists of open source software.

https://github.com/gaurisharan/cuda-ml-kernels

Practice repo for CUDA C++ GPU kernels for ML and HPC.
https://github.com/gaurisharan/cuda-ml-kernels

cpp cuda gpu hpc kernels ml parallel-computing systems-ml

Last synced: 6 months ago
JSON representation

Practice repo for CUDA C++ GPU kernels for ML and HPC.

Awesome Lists containing this project

README

          

# CUDA ML Kernels

## ๐Ÿš€ Motivation

This repository implements **custom CUDA kernels for common ML operations**, benchmarked against PyTorch's highly optimized cuBLAS/cuDNN kernels. The goal is to:

- Understand GPU parallelization patterns
- Compare naive kernel performance vs. library implementations
- Build intuition for ML Systems performance engineering

---

## ๐Ÿ“ Repository Structure

```.
โ”œโ”€โ”€ kernels/
โ”‚ โ”œโ”€โ”€ matrix_multiply.cu
โ”‚ โ”œโ”€โ”€ vector_add.cu
โ”‚ โ”œโ”€โ”€ relu.cu
โ”‚ โ”œโ”€โ”€ dot_product.cu
โ”‚ โ”œโ”€โ”€ intro.cu
โ”œโ”€โ”€ benchmarks/
โ”‚ โ””โ”€โ”€ benchmark.py
โ””โ”€โ”€ .gitignore
```

- `kernels/`: CUDA C++ kernel implementations
- `benchmarks/`: Python script to benchmark kernels vs. PyTorch

---

## โšก Kernels Implemented

| Kernel | Description |
|-------------------|-----------------------------------|
| `matrix_multiply` | Matrix multiplication (1024x1024) |
| `vector_add` | Elementwise vector addition |
| `relu` | ReLU activation function |
| `dot_product` | Vector dot product reduction |
| `intro` | 4x4 matrix multiplication demo |

---

## ๐Ÿ”ง Build Instructions

1. **Ensure NVIDIA CUDA Toolkit is installed.**

2. **Compile each `.cu` file:**

```bash
cd kernels

nvcc -o matrix_multiply.exe matrix_multiply.cu
nvcc -o vector_add.exe vector_add.cu
nvcc -o relu.exe relu.cu
nvcc -o dot_product.exe dot_product.cu
nvcc -o intro.exe intro.cu
````

> Replace `.exe` with no extension if on Linux/Mac.

---

## ๐Ÿงช Running Benchmarks

From the repo root:

```bash
cd benchmarks
python benchmark.py
```

---

## ๐Ÿ“Š Results Summary

| Kernel | PyTorch Time (ms) | Custom CUDA Time (ms) | Speedup |
| ------------------ | ----------------- | --------------------- | ------- |
| matrix\_multiply | 14.28 | 6.95 | 2.06x |
| vector\_add | 2.39 | 1.35 | 1.77x |
| relu | 2.97 | 0.56 | 5.35x |
| dot\_product | \~0 | 1.33 | Slower |
| intro (4x4 matmul) | 2.25 | 1.08 | 2.09x |

---

## ๐Ÿ’ก Key Insights

* **Matrix multiplication and ReLU** kernels show significant speedups, demonstrating effective GPU thread parallelization.
* **Vector addition** gains are modest, as PyTorch uses cuBLAS kernels optimized near theoretical peak.
* **Dot product** is slower due to naive reduction implementation vs. PyTorch's warp-level optimized reductions.
* **Small matmul (intro)** demonstrates kernel launch overhead optimization benefits.

---

## ๐Ÿ“ Future Improvements

* Implement **warp-level reductions** for dot product
* Integrate **unit tests** comparing kernel outputs with PyTorch for correctness validation
* Extend to **batched kernels** relevant for end-to-end ML pipeline acceleration

---

## ๐Ÿ‘ค Author

Gauri Sharan

---

## ๐Ÿ“œ License

MIT License