Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/Cjkkkk/CUDA_gemm

A simple high performance CUDA GEMM implementation.
https://github.com/Cjkkkk/CUDA_gemm

Last synced: 3 months ago
JSON representation

A simple high performance CUDA GEMM implementation.

Awesome Lists containing this project

README

        

## introduction
A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.
```
C = alpha * A * B + beta * C
```
## algorithm
**located in src/cuda/**

* MatrixMulCUDA
* one element of C is assigned one thread
* global memory coalesce of B
* MatrixMulCUDA1
* texture load
* MatrixMulCUDA2
* one 4 * 4 grid of C is assigned one thread
* MatrixMulCUDA3
* vectorized A B load
* MatrixMulCUDA4
* vectorized C store
* MatrixMulCUDA5
* block sparse version
* MatrixMulCUDA6
* vectorized A B load coalesce
* MatrixMulCUDA7
* warp shuffle to enable C store coalesce
* MatrixMulCUDAQuantize8bit
* 8 bit non-uniform quantized matmul

## experiments
**located in benchmark/**
* benchmark_dense
* Compare My Gemm with Cublas
* benchmark_sparse
* Compare My block sparse Gemm with Cusparse
* benchmark_quantization_8bit
* Compare My Gemm with Cublas
* benchmark_quantization
* Compare My Gemm with My quantized non-uniform 8 bit Gemm

## TODO
* (MatrixMulCUDA7) write back to C matrix, warp shuffle to enable global memory coalesce
* (MatrixMulCUDA8) double buffering

## run
```
mkdir builds
make benchmark_[experiment name]
bash scripts/benchmark_[experiment name].sh
```

## Note
* sparsity约为1%的时候, cusparse的性能可以超越cublas
* 合理分配寄存器 尽可能让参数在编译器确定节省计算资源和寄存器数目