Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/Cjkkkk/CUDA_gemm
A simple high performance CUDA GEMM implementation.
https://github.com/Cjkkkk/CUDA_gemm
Last synced: 3 months ago
JSON representation
A simple high performance CUDA GEMM implementation.
- Host: GitHub
- URL: https://github.com/Cjkkkk/CUDA_gemm
- Owner: Cjkkkk
- Created: 2019-12-26T15:02:14.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2024-01-04T16:33:32.000Z (about 1 year ago)
- Last Synced: 2024-08-04T02:06:36.640Z (6 months ago)
- Language: Cuda
- Homepage:
- Size: 677 KB
- Stars: 304
- Watchers: 5
- Forks: 35
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-cuda-triton-hpc - Cjkkkk/CUDA_gemm
- awesome-cuda-triton-hpc - Cjkkkk/CUDA_gemm
README
## introduction
A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.
```
C = alpha * A * B + beta * C
```
## algorithm
**located in src/cuda/*** MatrixMulCUDA
* one element of C is assigned one thread
* global memory coalesce of B
* MatrixMulCUDA1
* texture load
* MatrixMulCUDA2
* one 4 * 4 grid of C is assigned one thread
* MatrixMulCUDA3
* vectorized A B load
* MatrixMulCUDA4
* vectorized C store
* MatrixMulCUDA5
* block sparse version
* MatrixMulCUDA6
* vectorized A B load coalesce
* MatrixMulCUDA7
* warp shuffle to enable C store coalesce
* MatrixMulCUDAQuantize8bit
* 8 bit non-uniform quantized matmul## experiments
**located in benchmark/**
* benchmark_dense
* Compare My Gemm with Cublas
* benchmark_sparse
* Compare My block sparse Gemm with Cusparse
* benchmark_quantization_8bit
* Compare My Gemm with Cublas
* benchmark_quantization
* Compare My Gemm with My quantized non-uniform 8 bit Gemm## TODO
* (MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
* (MatrixMulCUDA8) double buffering## run
```
mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh
```## Note
* sparsity约为1%的时候, cusparse的性能可以超越cublas
* 合理分配寄存器 尽可能让参数在编译器确定节省计算资源和寄存器数目