https://github.com/vantaa89/cuda-matmul
Cuda Matrix Multiplication Optimization
https://github.com/vantaa89/cuda-matmul
Last synced: about 15 hours ago
JSON representation
Cuda Matrix Multiplication Optimization
- Host: GitHub
- URL: https://github.com/vantaa89/cuda-matmul
- Owner: vantaa89
- Created: 2024-12-29T02:32:28.000Z (over 1 year ago)
- Default Branch: master
- Last Pushed: 2026-04-28T17:51:19.000Z (about 2 months ago)
- Last Synced: 2026-04-28T19:35:09.755Z (about 2 months ago)
- Language: Cuda
- Size: 10.7 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CUDA Matrix Multiplication Optimization
This repository implements six matrix multiplication kernels, with incremental optimizations applied to each. A detailed explanation of each kernel is available [here](https://seojune.org/post/cuda-matmul).
* Kernel 0: A naive implementation.
* Kernel 1: Adds block tiling and shared memory.
* Kernel 2: Introduces thread tiling, increasing the workload per thread.
* Kernel 3: Implements 2D thread tiling.
* Kernel 4: Leverages **tensor cores** for computation.
* Kernel 5: Combines tensor cores with warp tiling for further optimization.
The code was tested on NVIDIA GeForce RTX 3060 (CUDA Capability 8.6) with CUDA Driver 12.6
## Usage
1. Clone the repository to your local machine.
```
git clone --depth=1 https://github.com/vantaa89/cuda-matmul/
```
1. Build the code using `make`.
1. Run the program with `./main [-k kernel_num] M N K`. For example, `./main -k 5 4096 4096 4096` performs the matrix multiplication using kernel 5(tensor core + warp tiling) with matrix size $M=N=K=4096$.
## Performance Results
Below is the throughput measured for $M=N=K=4096$ on an **NVIDIA GeForce RTX 3060**:
