Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mathiasotnes/gemm
General Matrix Multiplication (GEMM) optimization in Cuda.
https://github.com/mathiasotnes/gemm
cuda gpu
Last synced: 14 days ago
JSON representation
General Matrix Multiplication (GEMM) optimization in Cuda.
- Host: GitHub
- URL: https://github.com/mathiasotnes/gemm
- Owner: Mathiasotnes
- Created: 2024-11-20T23:06:59.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-12-03T23:24:09.000Z (2 months ago)
- Last Synced: 2024-12-03T23:27:08.293Z (2 months ago)
- Topics: cuda, gpu
- Language: Cuda
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# GEMM
General Matrix Multiplication (GEMM) optimization in Cuda.### Notes
- I'm using square matrixes
- I'm setting alpha and beta to 1 and C to 0 to simplify
- The functions include the memory allocation
- When using the profiler at stream shmem I saw that it launched hundreds of different kernel instances. The other
methods only had a single kernel instance.
- CuBLAS uses 3D grid (8, 16, 5), and a blockSize of (128,1,1). When I tried to use this in my shmem implementation
I got the wrong answer, but it reduced the amount of cycles.### Talking points:
1. Introduce problem:
- Simple matrix multiplication variation of GEMM.2. Go through implementations:
- CPU: To compare with GPU implementation.
- naive: Basic implementation of parallell matrix multiplication.
- shmem: Utilizing shared memory in the same way as explaned in lecture (tile-based).
- stream: Tried to split the A-matrix into different tiles. Unsuccesfully.
- stream_shmem: Stream combined with shared memory.
- cublas: CuBLAS library wrapper.3. Go through results:
- results tile/block size 16:
- CPU was fastest on the small matrixes because it doesn't have to copy memory.
- Naive and shmem were close on all the sizes, but shmem turned out better when the size increased.
- CuBLAS excelled when the sizes became large enough.
- results tile/block size 32:
- naive and shmem were a lot faster on 2048, but slower on 1024. Kinda surprising since 32x32=1024.
- shmem nsight compute analysis:
- Close to zero bank conflicts.
- LSU bottleneck (load and store operations).
- cublas nsight compute analysis:
- Not bottlenecked in the same way as shmem by LSU.
- Profile summary:
- CuBLAS dimension
- CuBLAS using a lot less cycles.