https://github.com/tgautam03/tgemm

General Matrix Multiplication using NVIDIA Tensor Cores
https://github.com/tgautam03/tgemm

cuda-kernels cuda-programming gpu-computing gpu-programming matrix-multiplication nvidia-cuda nvidia-gpu nvidia-tensor-cores sgemm tensor-cores

Last synced: about 1 month ago
JSON representation

General Matrix Multiplication using NVIDIA Tensor Cores

Host: GitHub
URL: https://github.com/tgautam03/tgemm
Owner: tgautam03
License: mit
Created: 2024-10-25T00:53:54.000Z (7 months ago)
Default Branch: master
Last Pushed: 2025-01-25T00:43:45.000Z (4 months ago)
Last Synced: 2025-04-15T12:11:53.708Z (about 1 month ago)
Topics: cuda-kernels, cuda-programming, gpu-computing, gpu-programming, matrix-multiplication, nvidia-cuda, nvidia-gpu, nvidia-tensor-cores, sgemm, tensor-cores
Language: Cuda
Homepage:
Size: 47.9 KB
Stars: 13
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # tGeMM

General Matrix Multiplication using NVIDIA Tensor Cores. **Tested on NVIDIA RTX 3090 using Ubuntu 24.04.1 LTS with nvidia-driver-550 and CUDA 12.4**.

Custom data structures `MatrixFP16` and `MatrixFP32` are defined (in *src*) to make working with matrices easy. Supported features are as follows:

1. Define half precision `n x n` matrix `A_FP16` on RAM (host memory):

    

    `MatrixFP16 A_FP16 = MatrixFP16(n, n, false);`

2. Define half precision `n x n` matrix `d_A_FP16` on VRAM (device global memory):

    

    `MatrixFP16 d_A_FP16 = MatrixFP16(n, n, true);`

3. Define single precision `n x n` matrix `A_FP32` on RAM (host memory):

    

    `MatrixFP32 A_FP32 = MatrixFP32(n, n, false);`

4. Define single precision `n x n` matrix `d_A_FP32` on VRAM (device global memory):

    

    `MatrixFP32 d_A_FP32 = MatrixFP32(n, n, true);`

3. Randomly initialize FP16 or FP32 matrices:

    

    `random_init_mat(A_FP16, -10, 10); // Random Initialization between -10 and 10`

    

    `random_init_mat(A_FP32, -10, 10); // Random Initialization between -10 and 10`

4. Move matrix data from RAM to VRAM:

    

    `A_FP16.copy_to_device(d_A_FP16);`

5. Move matrix data from VRAM to RAM:

    

    `d_A_FP16.copy_to_host(A_FP16);`

6. Free host/device memory:

    

    `A_FP16.free_mat();`

    

    `FP16.free_mat();`

## cuBLAS vs Custom Matrix Multiplication using Tensor Cores

- For cuBLAS version run the command: `make 00_benchmark_cuBLAS.out`

- For custom version run the command: `make 01_benchmark_naive.out`

![benchmark](txt_benchmarks/Benchmark.png)

The naive version is a fair bit slower than cuBLAS. However, my point (as of now) is to show how tensor cores can be programmed. I've kept everything as simple as possible so that it's easy to understand the workings of tensor cores.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tgautam03/tgemm

Awesome Lists containing this project

README