Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch
https://github.com/siboehm/SGEMM_CUDA

Last synced: about 1 month ago
JSON representation

Fast CUDA matrix multiplication from scratch

Awesome Lists containing this project

README

        

# Fast CUDA SGEMM from Scratch

Step-by-step optimization of matrix multiplication, implemented in CUDA.
For an explanation of each kernel, see [siboehm.com/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM).

## Overview

Running the kernels on a NVIDIA A6000 (Ampere):

![](benchmark_results.png)

GFLOPs at matrix size 4096x4096:

| Kernel | GFLOPs/s | Performance relative to cuBLAS |
|:------------------------------------|----------:|:-------------------------------|
| 1: Naive | `309.0` | 1.3% |
| 2: GMEM Coalescing | `1986.5` | 8.5% |
| 3: SMEM Caching | `2980.3` | 12.8% |
| 4: 1D Blocktiling | `8474.7` | 36.5% |
| 5: 2D Blocktiling | `15971.7` | 68.7% |
| 7: Avoid Bank Conflicts (Linearize) | `16213.4` | 69.7% |
| 8: Avoid Bank Conflicts (Offset) | `16459.2` | 70.8% |
| 11: Double Buffering | `17278.3` | 74.3% |
| 6: Vectorized Mem Access | `18237.3` | 78.4% |
| 9: Autotuning | `19721.0` | 84.8% |
| 10: Warptiling | `21779.3` | 93.7% |
| 0: cuBLAS | `23249.6` | 100.0% |

## Setup

1. Install dependencies: CUDA toolkit 12, Python (+ Seaborn), CMake, Ninja. See [environment.yml](environment.yml).
1. Configure NVCC compilation parameters. Look up your GPUs compute
capability [here](https://developer.nvidia.com/cuda-gpus). Then configure the `CMakeLists.txt` and change:
```cmake
set(CUDA_COMPUTE_CAPABILITY 80)
```
1. Build: `mkdir build && cd build && cmake .. && cmake --build .`
1. Run one of the kernels: `DEVICE= ./sgemm `
1. Profiling via [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute) (ncu): `make profile KERNEL=`

Credit goes to [wangzyon/NVIDIA_SGEMM_PRACTICE](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE) for the benchmarking setup.