https://github.com/vantaa89/cuda-matmul

Cuda Matrix Multiplication Optimization
https://github.com/vantaa89/cuda-matmul

Last synced: about 15 hours ago
JSON representation

Cuda Matrix Multiplication Optimization

Host: GitHub
URL: https://github.com/vantaa89/cuda-matmul
Owner: vantaa89
Created: 2024-12-29T02:32:28.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2026-04-28T17:51:19.000Z (about 2 months ago)
Last Synced: 2026-04-28T19:35:09.755Z (about 2 months ago)
Language: Cuda
Size: 10.7 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# CUDA Matrix Multiplication Optimization

This repository implements six matrix multiplication kernels, with incremental optimizations applied to each. A detailed explanation of each kernel is available [here](https://seojune.org/post/cuda-matmul).

* Kernel 0: A naive implementation.
* Kernel 1: Adds block tiling and shared memory.
* Kernel 2: Introduces thread tiling, increasing the workload per thread.
* Kernel 3: Implements 2D thread tiling.
* Kernel 4: Leverages **tensor cores** for computation.
* Kernel 5: Combines tensor cores with warp tiling for further optimization.

The code was tested on NVIDIA GeForce RTX 3060 (CUDA Capability 8.6) with CUDA Driver 12.6

## Usage
1. Clone the repository to your local machine.
```
git clone --depth=1 https://github.com/vantaa89/cuda-matmul/
```
1. Build the code using `make`.
1. Run the program with `./main [-k kernel_num] M N K`. For example, `./main -k 5 4096 4096 4096` performs the matrix multiplication using kernel 5(tensor core + warp tiling) with matrix size $M=N=K=4096$.

## Performance Results

Below is the throughput measured for $M=N=K=4096$ on an **NVIDIA GeForce RTX 3060**:

![image](https://github.com/user-attachments/assets/c74671f7-7168-4941-92eb-87284df1ca62)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vantaa89/cuda-matmul

Awesome Lists containing this project

README