Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/siboehm/SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
https://github.com/siboehm/SGEMM_CUDA
Last synced: about 1 month ago
JSON representation
Fast CUDA matrix multiplication from scratch
- Host: GitHub
- URL: https://github.com/siboehm/SGEMM_CUDA
- Owner: siboehm
- License: mit
- Created: 2022-11-13T00:44:54.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2023-12-28T01:20:12.000Z (12 months ago)
- Last Synced: 2024-08-02T14:05:13.884Z (4 months ago)
- Language: Cuda
- Homepage: https://siboehm.com/articles/22/CUDA-MMM
- Size: 2.77 MB
- Stars: 375
- Watchers: 3
- Forks: 49
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-gemm - SGEMM_CUDA: Step-by-Step Optimization
README
# Fast CUDA SGEMM from Scratch
Step-by-step optimization of matrix multiplication, implemented in CUDA.
For an explanation of each kernel, see [siboehm.com/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM).## Overview
Running the kernels on a NVIDIA A6000 (Ampere):
![](benchmark_results.png)
GFLOPs at matrix size 4096x4096:
| Kernel | GFLOPs/s | Performance relative to cuBLAS |
|:------------------------------------|----------:|:-------------------------------|
| 1: Naive | `309.0` | 1.3% |
| 2: GMEM Coalescing | `1986.5` | 8.5% |
| 3: SMEM Caching | `2980.3` | 12.8% |
| 4: 1D Blocktiling | `8474.7` | 36.5% |
| 5: 2D Blocktiling | `15971.7` | 68.7% |
| 7: Avoid Bank Conflicts (Linearize) | `16213.4` | 69.7% |
| 8: Avoid Bank Conflicts (Offset) | `16459.2` | 70.8% |
| 11: Double Buffering | `17278.3` | 74.3% |
| 6: Vectorized Mem Access | `18237.3` | 78.4% |
| 9: Autotuning | `19721.0` | 84.8% |
| 10: Warptiling | `21779.3` | 93.7% |
| 0: cuBLAS | `23249.6` | 100.0% |## Setup
1. Install dependencies: CUDA toolkit 12, Python (+ Seaborn), CMake, Ninja. See [environment.yml](environment.yml).
1. Configure NVCC compilation parameters. Look up your GPUs compute
capability [here](https://developer.nvidia.com/cuda-gpus). Then configure the `CMakeLists.txt` and change:
```cmake
set(CUDA_COMPUTE_CAPABILITY 80)
```
1. Build: `mkdir build && cd build && cmake .. && cmake --build .`
1. Run one of the kernels: `DEVICE= ./sgemm `
1. Profiling via [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute) (ncu): `make profile KERNEL=`Credit goes to [wangzyon/NVIDIA_SGEMM_PRACTICE](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE) for the benchmarking setup.