https://github.com/chanzhennan/cuda_gemm_benchmark

Base on gtest/benchmark, refer to https://github.com/Liu-xiandong/How_to_optimize_in_GPU
https://github.com/chanzhennan/cuda_gemm_benchmark

Last synced: 6 months ago
JSON representation

Base on gtest/benchmark, refer to https://github.com/Liu-xiandong/How_to_optimize_in_GPU

Host: GitHub
URL: https://github.com/chanzhennan/cuda_gemm_benchmark
Owner: chanzhennan
Created: 2023-01-22T23:29:12.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-09-09T07:50:07.000Z (about 1 year ago)
Last Synced: 2024-09-09T09:28:21.767Z (about 1 year ago)
Language: Cuda
Size: 207 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-cuda-and-hpc - chanzhennan/cuda_gemm_benchmark - xiandong/How_to_optimize_in_GPU](https://github.com/Liu-xiandong/How_to_optimize_in_GPU). (Learning Resources)

README

          ## Introduction

This repository showcases various features of GEMM aimed at enhancing its performance. 

```

C = alpha * A * B + beta * C

```

## Matrix Multiplication Algorithm Implementations

* MatrixMulCUDA0

    * Naive GEMM implementation.

* MatrixMulCUDA1

    * Utilizing warp/block for fused multiply-add (FMA) calculations.

* MatrixMulCUDA2

    * Loading data with strides from global memory to shared memory.

* MatrixMulCUDA3

    * Aligning shared memory for optimized memory access.

* MatrixMulCUDA4

    * Loading data twice per thread for improved data reuse.

* MatrixMulCUDA5

    * Minimizing bank conflicts in shared memory accesses.

* MatrixMulCUDA6

    * Using ping-pong buffer strategy.

* MatrixMulCUDA7

    * Implementing fast 128x128 block GEMM. (Note: A bug causing segment faults needs to be fixed.)

* MatrixMulCUDA8

    * desen refer to https://github.com/Cjkkkk/CUDA_gemm/blob/master/src/cuda/dense.cu

* MatrixMulCUDA9

    * Implementation using cuBLAS.

* MatrixMulCUDA10

    * yzaiust refer to  https://github.com/yzhaiustc/Optimizing-SGEMM-on-NVIDIA-Turing-GPUs

* MatrixMulCUDA11

    * yinghan resfer to https://github.com/Yinghan-Li/YHs_Sample/blob/master/cuda/gemm/sgemm.cu

#

## Installation

* Edit build.sh file

   * cmake -DCUDA_ARCH=/your/cuda/arch -DCUDA_TOOLKIT_ROOT_DIR=/local/cuda/path

* bash build.sh

  

  

 ![image](https://github.com/chanzhennan/cuda_gemm_benchmark/assets/7290453/e879009a-475e-4f05-9e51-7771d3d5b765)

  |---------------------------------------------------------------------------------------------------------------|

#

## Performance

Run on RTX 4070 Ti | Theoretical Performance: FP32 (float) 40.09 TFLOPS

   * Reference: [GeForce RTX 4070 Ti Specs](https://www.techpowerup.com/gpu-specs/geforce-rtx-4070-ti.c3950)

   

| Benchmark                                       | Time     | CPU      | Iterations | UserCounters                 |

|-------------------------------------------------|----------|----------|------------|------------------------------|

| Naive/Gemm_float/5120/4096/4096           | 1731 ms  | 1731 ms  | 1          | TFlops=0.099244/s, operation=171.799G |

| Blocker/Gemm_float/5120/4096/4096         | 103 ms   | 103 ms   | 6          | TFlops=1.66191/s, operation=1030.79G |

| Strider/Gemm_float/5120/4096/4096         | 19.9 ms  | 19.9 ms  | 30         | TFlops=8.62941/s, operation=5.15396T |

| Aligner/Gemm_float/5120/4096/4096         | 17.3 ms  | 17.3 ms  | 33         | TFlops=9.93519/s, operation=5.66936T |

| MultiLoader/Gemm_float/5120/4096/4096     | 19.8 ms  | 19.8 ms  | 31         | TFlops=8.67294/s, operation=5.32576T |

| BcAvoider/Gemm_float/5120/4096/4096       | 24.2 ms  | 24.2 ms  | 26         | TFlops=7.10627/s, operation=4.46677T |

| PpBuffer/Gemm_float/5120/4096/4096        | 20.9 ms  | 20.9 ms  | 28         | TFlops=8.2018/s, operation=4.81036T |

| Dense/Gemm_float/5120/4096/4096           | 11.0 ms  | 11.0 ms  | 61         | TFlops=15.5654/s, operation=10.4797T |

| Cublas/Gemm_float/5120/4096/4096          | 5.95 ms  | 5.95 ms  | 115        | TFlops=28.8656/s, operation=19.7568T |

| Yzaiustc/Gemm_float/5120/4096/4096        | 7.23 ms  | 7.23 ms  | 93         | TFlops=23.765/s, operation=15.9773T |

| Yhs/Gemm_float/5120/4096/4096             | 6.78 ms  | 6.78 ms  | 100        | TFlops=25.3418/s, operation=17.1799T |

## Todo

* Address the bug causing a segment fault in MatrixMulCUDA7.

* Fix the issue where CUDA implementations 0 to 6 cannot handle cases where m = 8 n = 4096 k = 4096.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/chanzhennan/cuda_gemm_benchmark

Awesome Lists containing this project

README