https://github.com/siboehm/SGEMM_CUDA

Fast CUDA matrix multiplication from scratch
https://github.com/siboehm/SGEMM_CUDA

Last synced: 3 months ago
JSON representation

Fast CUDA matrix multiplication from scratch

Host: GitHub
URL: https://github.com/siboehm/SGEMM_CUDA
Owner: siboehm
License: mit
Created: 2022-11-13T00:44:54.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2023-12-28T01:20:12.000Z (over 1 year ago)
Last Synced: 2025-03-28T12:06:26.474Z (4 months ago)
Language: Cuda
Homepage: https://siboehm.com/articles/22/CUDA-MMM
Size: 2.77 MB
Stars: 668
Watchers: 4
Forks: 93
Open Issues: 10
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-gemm - SGEMM_CUDA: Step-by-Step Optimization

README

        # Fast CUDA SGEMM from Scratch

Step-by-step optimization of matrix multiplication, implemented in CUDA.

For an explanation of each kernel, see [siboehm.com/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM).

## Overview

Running the kernels on a NVIDIA A6000 (Ampere):

![](benchmark_results.png)

GFLOPs at matrix size 4096x4096:

| Kernel                              |  GFLOPs/s | Performance relative to cuBLAS |

|:------------------------------------|----------:|:-------------------------------|

| 1: Naive                            |   `309.0` | 1.3%                           |

| 2: GMEM Coalescing                  |  `1986.5` | 8.5%                           |

| 3: SMEM Caching                     |  `2980.3` | 12.8%                          |

| 4: 1D Blocktiling                   |  `8474.7` | 36.5%                          |

| 5: 2D Blocktiling                   | `15971.7` | 68.7%                          |

| 7: Avoid Bank Conflicts (Linearize) | `16213.4` | 69.7%                          |

| 8: Avoid Bank Conflicts (Offset)    | `16459.2` | 70.8%                          |

| 11: Double Buffering                | `17278.3` | 74.3%                          |

| 6: Vectorized Mem Access            | `18237.3` | 78.4%                          |

| 9: Autotuning                       | `19721.0` | 84.8%                          |

| 10: Warptiling                      | `21779.3` | 93.7%                          |

| 0: cuBLAS                           | `23249.6` | 100.0%                         |

## Setup

1. Install dependencies: CUDA toolkit 12, Python (+ Seaborn), CMake, Ninja. See [environment.yml](environment.yml).

1. Configure NVCC compilation parameters. Look up your GPUs compute

   capability [here](https://developer.nvidia.com/cuda-gpus). Then configure the `CMakeLists.txt` and change:

    ```cmake

    set(CUDA_COMPUTE_CAPABILITY 80)

    ```

1. Build: `mkdir build && cd build && cmake .. && cmake --build .`

1. Run one of the kernels: `DEVICE= ./sgemm `

1. Profiling via [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute) (ncu): `make profile KERNEL=`

Credit goes to [wangzyon/NVIDIA_SGEMM_PRACTICE](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE) for the benchmarking setup.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/siboehm/SGEMM_CUDA

Awesome Lists containing this project

README