Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
https://github.com/tpoisonooo/how-to-optimize-gemm
arm64 armv7 cuda cuda-kernel gemm-optimization int4 ptx vulkan
Last synced: 1 day ago
JSON representation
row-major matmul optimization
- Host: GitHub
- URL: https://github.com/tpoisonooo/how-to-optimize-gemm
- Owner: tpoisonooo
- License: gpl-3.0
- Created: 2018-10-28T13:23:30.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2023-09-09T07:12:07.000Z (over 1 year ago)
- Last Synced: 2024-10-11T18:10:03.496Z (3 months ago)
- Topics: arm64, armv7, cuda, cuda-kernel, gemm-optimization, int4, ptx, vulkan
- Language: C++
- Homepage: https://zhuanlan.zhihu.com/p/65436463
- Size: 12.5 MB
- Stars: 589
- Watchers: 16
- Forks: 79
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-cuda-triton-hpc - tpoisonooo/how-to-optimize-gemm - to-optimize-gemm?style=social"/> : row-major matmul optimization. [zhuanlan.zhihu.com/p/65436463](https://zhuanlan.zhihu.com/p/65436463). (Learning Resources)
- awesome-cuda-triton-hpc - tpoisonooo/how-to-optimize-gemm - to-optimize-gemm?style=social"/> : row-major matmul optimization. [zhuanlan.zhihu.com/p/65436463](https://zhuanlan.zhihu.com/p/65436463). (Learning Resources)
- awesome-gemm - how-to-optimize-gemm (row-major matmul)
README
# how-to-optimize-gemm
English | [简体中文](README_ZH_CN.md)
## News
2023/08 aarch64 add cmake and mperf, try `-DMPERF_ENABLE=ON` !
## Introduction
row-major matmul optimization tutorial
| backend | armv7 | aarch64 | aarch64-int8 | cuda | cuda-int4 | vulkan | x86 |
| ----------- | ------- | -- | ---------- | ---------- | ---------- | --------- | --- |
| support | ✔️ | ✔️ | ✔️ | ✔️ | - | ✔️ | ✅ |All backends and corresponding tutorials
| backend | tutorial |
| ------- | -------- |
| aarch64 | [GEMM 入门](https://zhuanlan.zhihu.com/p/65436463) |
| aarch64 | [GEMM caching](https://zhuanlan.zhihu.com/p/69700540) |
| aarch64-int8 | - |
| armv7 | [ARMv7 4x4kernel 懒人优化小实践](https://zhuanlan.zhihu.com/p/333799799) |
| cuda | [cuda 入门的正确姿势:how-to-optimize-gemm](https://zhuanlan.zhihu.com/p/478846788) |
| cuda-int4 WIP | [int4 炼丹要术](https://zhuanlan.zhihu.com/p/580752390)
| vulkan | [如何火急火燎地上手 Vulkan](https://zhuanlan.zhihu.com/p/487583258) |## Build and run
Usage is similar for all backends:
1. Open the backend directory to be used, and change the `OLD` and `NEW` of `makefile` to the same implementation for the first run, for example
```bash
$ cd aarch64
$ cat makefile
OLD := MMult_4x4_10
NEW := MMult_4x4_10
..
```2. makefile` will compile and run the implementation which `NEW` point at, and copy `output_MMult_4x4_10.m` to `output_new.m`
```bash
$ make run
$ cat output_new.m
```3. It may not be intuitive to look at the numbers directly, so draw a line chart
```bash
$ python3 -m pip install -r ../requirements.txt
$ python3 plot.py
```## Differences between backends
Specific to each hardware, there are subtle differences:
* `NEW` may choose a different name
* vulkan/int4 needs prerequisitions## 1. armv7 and aarch64
A. Prepare armv7/aarch64 linux development environment, Raspberry Pi/rk3399/aws arm server are all fine.
B. By default `ARCH := native`, build and run directly
```
$ cd armv8 && make run
```## 2. aarch64 int8
[chgemm](https://github.com/tpoisonooo/chgemm) is an int8 gemm library.
* blue line is chgemm implementation
* orange line is aarch64 fp32 peak
![](./images/aarch64-fp32-peak-vs-int8.png)Compared to the code in this tutorial, the differences are:
1. Dealing with the boundary problem, unlike the tutorial where only multiples of 4 are considered;
2. Int8 reaches a maximum of 18.6 gflops (relative to the theoretical limit of fp32 is only 14.3 on RK3399, gemmlowp is about 12-14gflops);
3. Based on symmetric quantization, input value range must be in \[-127, +127\], and -128 cannot appear;
4. Built-in small example about how to integrate into android studiochgemm has been merged into [ncnn](https://github.com/tencent/ncnn) INT8 convolution implementation.
## 3. x86 original
[flame](https://github.com/flame/how-to-optimize-gemm/tree/4fcf39bd0963bca62f04bef2aeb49a06ee28508b) referenced by x86 is the original implementation, with some differences from this repo:1. The original is **column-major** `x86 SSE` version
2. Both are tutorials, and the `MMult_4x4_17.c` written now can reach 70% of the armv8.1 CPU peak
3. The boundary problem is not dealt with now, only the case where MNK is a multiple of 4 is considered; `sub_kernel` also only writes the simplest kind of assembly. Practical needs a simple adjustment;
4. In terms of drawing, `octave` was discarded (it is too troublesome to configure the environment once for embedded devices), and `python` was used instead.## 4. CUDA
This version is **faster than NVIDIA cuBLAS*** green line is MMult_cuda_12 without tensorcore
* blue line is cuBLAS without tensorcore![](images/cublas-vs-MMult_cuda_12.jpg)
1. Need to install cuda driver and nvcc by yourself
2. CPU OpenBLAS is required to be the baseline```bash
$ apt install libopenblas-dev
```## 5. Vulkan
1. vulkan build depends on kompute API packaging, see [vulkan build documentation](https://github.com/tpoisonooo/how-to-optimize-gemm/tree/master/vulkan) for details
2. More about how to learn compute shader
## 6. CUDA int4
WIP
## Some Tools
* [megpeak](https://github.com/MegEngine/MegPeak): For measuring hardware limit performance, support arm/x86/OCL..
* [perf](https://perf.wiki.kernel.org): Available in linux system tools, for system-level performance analysis and disassembly
* [YHs_Sample](https://github.com/Yinghan-Li/YHs_Sample): dalao 's implementation* [mperf](https://github.com/MegEngine/mperf): optimization tools
## License
[GPLv3](LICENSE)