https://github.com/tgautam03/xgemm
Accelerated General (FP32) Matrix Multiplication from scratch in CUDA
https://github.com/tgautam03/xgemm
cuda-programming gpu-programming matrix-multiplication sgemm
Last synced: 7 months ago
JSON representation
Accelerated General (FP32) Matrix Multiplication from scratch in CUDA
- Host: GitHub
- URL: https://github.com/tgautam03/xgemm
- Owner: tgautam03
- License: mit
- Created: 2024-08-11T21:36:15.000Z (about 1 year ago)
- Default Branch: master
- Last Pushed: 2025-01-09T21:13:38.000Z (9 months ago)
- Last Synced: 2025-03-30T19:05:53.006Z (7 months ago)
- Topics: cuda-programming, gpu-programming, matrix-multiplication, sgemm
- Language: Cuda
- Homepage:
- Size: 5.8 MB
- Stars: 111
- Watchers: 2
- Forks: 6
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# xGeMM
Accelerated General (FP32) Matrix Multiplication. Tested on NVIDIA RTX 3090 using Ubuntu 24.04.1 LTS with nvidia-driver-550 and CUDA 12.4.**Watch the YouTube video (click the image below)**
[](https://youtu.be/GetaI7KhbzM?si=i9sMAfGqO4zyJZhq)
## Dependencies
- [Eigen 3.4.0](https://gitlab.com/libeigen/eigen/-/releases/3.4.0) (Put it in `lib`)## Running Benchmarks
### 1. Eigen (CPU) matrix multiplication**Compile**: `make 00a_benchmark_cpu.out`
**Execute**: `./00a_benchmark_cpu.out`
### 2. cuBLAS (GPU) matrix multiplication:
**Compile**: `make 00b_benchmark_cuBLAS.out`
**Execute**: `./00b_benchmark_cuBLAS.out`
### 3. Naive (GPU) matrix multiplication:
**Compile**: `make 01_benchmark_naive.out`
**Execute**: `./01_benchmark_naive.out`
### 4. Coalesced (GPU) matrix multiplication:
**Compile**: `make 02_benchmark_coalesced.out`
**Execute**: `./02_benchmark_coalesced.out`
### 5. Tiled (GPU) matrix multiplication:
**Compile**: `make 03_benchmark_tiled.out`
**Execute**: `./03_benchmark_tiled.out`
### 6. 1D thread coarsening (GPU) matrix multiplication:
**Compile**: `make 04_benchmark_coarse_1d.out`
**Execute**: `./04_benchmark_coarse_1d.out`
### 7. 2D thread coarsening (GPU) matrix multiplication:
**Compile**: `make 05_benchmark_coarse_2d.out`
**Execute**: `./05_benchmark_coarse_2d.out`
### 8. Vectorized Mmemory accesses (GPU) matrix multiplication:
**Compile**: `make 06_benchmark_coarse_2d_vec.out`
**Execute**: `./06_benchmark_coarse_2d_vec.out`