{"id":28578402,"url":"https://github.com/xlite-dev/hgemm","last_synced_at":"2025-06-11T01:10:04.210Z","repository":{"id":265873685,"uuid":"896493975","full_name":"xlite-dev/HGEMM","owner":"xlite-dev","description":"⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.","archived":false,"fork":false,"pushed_at":"2025-05-10T06:36:17.000Z","size":2973,"stargazers_count":75,"open_issues_count":0,"forks_count":3,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-05-10T07:32:17.908Z","etag":null,"topics":["cuda","hgemm","tensor-cores"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/xlite-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-11-30T14:10:19.000Z","updated_at":"2025-05-10T06:36:20.000Z","dependencies_parsed_at":"2025-01-08T05:27:25.902Z","dependency_job_id":"bbeed59b-6691-485a-9181-23c41c0ab5d2","html_url":"https://github.com/xlite-dev/HGEMM","commit_stats":null,"previous_names":["deftruth/hgemm-tensorcores-mma","deftruth/cuhgemm-py","deftruth/hgemm-mma","xlite-dev/hgemm-mma","xlite-dev/hgemm-tensorcores-mma","xlite-dev/hgemm"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xlite-dev%2FHGEMM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xlite-dev%2FHGEMM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xlite-dev%2FHGEMM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xlite-dev%2FHGEMM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/xlite-dev","download_url":"https://codeload.github.com/xlite-dev/HGEMM/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/xlite-dev%2FHGEMM/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259178542,"owners_count":22817389,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","hgemm","tensor-cores"],"created_at":"2025-06-11T01:10:03.548Z","updated_at":"2025-06-11T01:10:04.172Z","avatar_url":"https://github.com/xlite-dev.png","language":"Cuda","readme":"\n## ⚡️⚡️Toy-HGEMM: Achieve the 98%~100% TFLOPS of cuBLAS 🎉🎉\n\n![toy-hgemm-library](https://github.com/user-attachments/assets/962bda14-b494-4423-b8eb-775da9f5503d)\n\n[📖Toy-HGEMM Library⚡️⚡️](./kernels/hgemm) is a library that write many HGEMM kernels from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API, thus, can achieve `98%~100%` performance of **cuBLAS**. The codes here are source from 📖[LeetCUDA](https://github.com/xlite-dev/LeetCUDA)  ![](https://img.shields.io/github/stars/xlite-dev/LeetCUDA.svg?style=social) and exported as a standalone library, please checkout [LeetCUDA](https://github.com/xlite-dev/LeetCUDA) for latest updates. Welcome to 🌟👆🏻star this repo to support me, many thanks ~ 🎉🎉\n\n\u003cdiv id=\"hgemm-sgemm\"\u003e\u003c/div\u003e  \n\n\u003cdiv align='center'\u003e\n  \u003cimg src='https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99' height=\"170px\" width=\"270px\"\u003e\n  \u003cimg src='https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85' height=\"170px\" width=\"270px\"\u003e\n  \u003cimg src='https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078' height=\"170px\" width=\"270px\"\u003e\n\u003c/div\u003e \n\n\nCurrently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores math algorithm `CUBLAS_GEMM_DEFAULT_TENSOR_OP`, the `HGEMM (WMMA/MMA/CuTe)` implemented in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](./kernels/hgemm) for more details.\n\n|📚Feature |📚Feature |📚Feature |📚Feature|\n|:---:|:---:|:---:|:---:|\n|✔️CUDA/**Tensor Cores**|✔️Loop over K|✔️Tile Block(BMxBK)|✔️Tile Threads(T 8x8)|\n|✔️WMMA(m16n16k16)|✔️MMA(m16n8k16)|✔️Pack LDST(128 bits)|✔️SMEM Padding|\n|✔️Copy Async|✔️Tile MMAs|✔️Tile Warps|✔️**Multi Stages(2~4)**|  \n|✔️Register Double Buffers|✔️**Block Swizzle**|✔️**Warp Swizzle**|✔️**SMEM Swizzle**(CuTe/MMA)|\n|✔️Collective Store(Shfl)|✔️Layout NN|✔️Layout TN|✔️SGEMM FP32/TF32|\n\n## ©️Citations🎉🎉\n\n```BibTeX\n@misc{HGEMM@2024,\n  title={HGEMM: Write HGEMM from scratch using Tensor Cores with WMMA, MMA PTX and CuTe API.},\n  url={https://github.com/xlite-dev/HGEMM},\n  note={Open-source software available at https://github.com/xlite-dev/HGEMM},\n  author={xlite-dev etc},\n  year={2024}\n}\n```\n\n## 📖 HGEMM CUDA Kernels in Toy-HGEMM Library 🎉🎉 \n\n\u003cdiv id=\"kernels\"\u003e\u003c/div\u003e  \n\n```C++  \nvoid hgemm_naive_f16(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_sliced_k_f16(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k_f16x4(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k_f16x4_pack(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k_f16x4_bcf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k_f16x4_pack_bcf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k_f16x8_pack_bcf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k_f16x8_pack_bcf_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k16_f16x8_pack_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k16_f16x8_pack_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k32_f16x8_pack_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_8x8_sliced_k32_f16x8_pack_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_16x8_sliced_k32_f16x8_pack_dbuf(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_t_16x8_sliced_k32_f16x8_pack_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_cublas_tensor_op_nn(torch::Tensor a, torch::Tensor b, torch::Tensor c); \nvoid hgemm_cublas_tensor_op_tn(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_wmma_m16n16k16_naive(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_wmma_m16n16k16_mma4x2(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_wmma_m16n16k16_mma4x2_warp2x4(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_wmma_m16n16k16_mma4x2_warp2x4_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_wmma_m32n8k16_mma2x4_warp2x4_dbuf_async(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_wmma_m16n16k16_mma4x2_warp2x4_stages(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_wmma_m16n16k16_mma4x2_warp2x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_wmma_m16n16k16_mma4x2_warp4x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);                                                        \nvoid hgemm_wmma_m16n16k16_mma4x4_warp4x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_naive(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4(torch::Tensor a, torch::Tensor b, torch::Tensor c);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4_stages(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_x4(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_rr(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4_stages_dsmem_tn(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_stages_block_swizzle_tn_cute(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_swizzle(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\nvoid hgemm_mma_m16n8k16_mma2x4_warp4x4x2_stages_dsmem_tn_swizzle_x4(torch::Tensor a, torch::Tensor b, torch::Tensor c, int stages, bool swizzle, int swizzle_stride);\n```\n\n## 📖 Contents\n\n- [📖 Prerequisites](#prerequisites)\n- [📖 Installation](#install)\n- [📖 Python Testing](#test)\n- [📖 C++ Testing](#test-cpp)\n- [📖 NVIDIA L20 bench](#perf-l20)\n- [📖 NVIDIA RTX 4090 bench](#perf-4090)\n- [📖 NVIDIA RTX 3080 Laptop bench](#perf-3080)\n- [📖 Docs](#opt-docs)\n- [📖 References](#ref)\n\n## 📖 Prerequisites\n\u003cdiv id=\"prerequisites\"\u003e\u003c/div\u003e  \n\n- PyTorch \u003e= 2.0, CUDA \u003e= 12.0\n- Recommended: PyTorch 2.5.1, CUDA 12.5\n\n## 📖 Installation  \n\n\u003cdiv id=\"install\"\u003e\u003c/div\u003e  \n\nThe HGEMM implemented in this repo can be install as a python library, namely, `toy-hgemm` library (optional). \n```bash\ncd kernels/hgemm\ngit submodule update --init --recursive --force # Fetch `CUTLASS` submodule， needed\npython3 setup.py bdist_wheel \u0026\u0026 cd dist \u0026\u0026 python3 -m pip install *.whl # pip uninstall toy-hgemm -y \n```\n\n## 📖 Python Testing\n\n\u003cdiv id=\"test\"\u003e\u003c/div\u003e  \n\n**CUTLASS**: Fetch `CUTLASS` submodule. Currently, I use `v3.5.1` for HGEMM CuTe kernel.\n```bash\ngit submodule update --init --recursive --force\n```\n\nYou can test many custom HGEMM kernel via Python script and figure out the difference in their performance.\n\n```bash\n# You can test Ada or Ampere only, also, Volta, Ampere, Ada, Hopper, ...\nexport TORCH_CUDA_ARCH_LIST=Ada # for Ada only\nexport TORCH_CUDA_ARCH_LIST=Ampere # for Ampere only\npython3 hgemm.py --wmma # test defalut wmma kernels for all MNK\npython3 hgemm.py --mma  # test defalut mma kernels for all MNK\npython3 hgemm.py --M 16384 --N 16384 --K 8192 --i 10 --wmma # test default wmma kernels for specific MNK\npython3 hgemm.py --M 16384 --N 16384 --K 8192 --i 10 --mma # test default mma kernels for specific MNK\npython3 hgemm.py --wmma-all # test all wmma kernels for all MNK\npython3 hgemm.py --mma-all # test all mma kernels for all MNK\npython3 hgemm.py --cuda-all --wmma-all --mma-all # test all kernels for all MNK\npython3 hgemm.py --cute-tn --no-default # test cute hgemm kernels with smem swizzle for all MNK\n```\nIf you want to draw a TFLOPS curve, you need to install `matplotlib` first and set the --plot-flops (or --plot) option.\n```bash\npython3 -m pip install matplotlib\n# Specify topk to plot only the top k kernels with the best performance.\npython3 hgemm.py --mma-all --plot --topk 8\n# test default mma kernels \u0026 cute hgemm kernels with smem swizzle for all MNK\npython3 hgemm.py --cute-tn --mma --plot \n```\n\n## 📖 C++ Testing\n\n\u003cdiv id=\"test-cpp\"\u003e\u003c/div\u003e  \n\nThe HGEMM benchmark also supports C++ testing. Currently, it supports comparisons between the following implementations:\n\n- MMA HGEMM NN implemented in this repository\n- CuTe HGEMM TN implemented in this repository\n- cuBLAS HGEMM TN use default Tensor Cores math algorithm\n\nPerformance data obtained from C++ binary tests tend to be slightly better than those from Python tests. This difference may be attributed to additional overhead introduced by the PyTorch Python bindings.\n```bash\nmake\n./hgemm_mma_stage.bin\n# NVIDIA L20\nALGO = MMA16816 HGEMM NN MMA=2x4 WARP=4x4x2 STAGES=2 BLOCK SWIZZLE=2048\nM N K =  12544  12544  12544, Time =   0.03445555   0.03446098   0.03447399 s, AVG Performance =   114.5541 Tflops\nM N K =  15360  15360  15360, Time =   0.06307226   0.06307789   0.06308864 s, AVG Performance =   114.9017 Tflops\nM N K =  15616  15616  15616, Time =   0.06612480   0.06612798   0.06613094 s, AVG Performance =   115.1739 Tflops\nM N K =  15872  15872  15872, Time =   0.06969549   0.06970215   0.06971290 s, AVG Performance =   114.7305 Tflops\nM N K =  16128  16128  16128, Time =   0.07295078   0.07295406   0.07295693 s, AVG Performance =   115.0064 Tflops\nM N K =  16384  16384  16384, Time =   0.07663001   0.07663534   0.07664947 s, AVG Performance =   114.7785 Tflops\n\n./hgemm_cute.bin\n# NVIDIA L20\nALGO = CuTe HGEMM, TN, STAGES=2, SMEM SWIZZLE=\u003c3, 3, 3\u003e, BLOCK SWIZZLE=2048\nM N K =  12544  12544  12544, Time =   0.03413504   0.03414354   0.03415450 s, AVG Performance =   115.6191 Tflops\nM N K =  15360  15360  15360, Time =   0.06227354   0.06228111   0.06228992 s, AVG Performance =   116.3717 Tflops\nM N K =  15616  15616  15616, Time =   0.06492467   0.06493727   0.06496666 s, AVG Performance =   117.2858 Tflops\nM N K =  15872  15872  15872, Time =   0.06843085   0.06843873   0.06844723 s, AVG Performance =   116.8485 Tflops\nM N K =  16128  16128  16128, Time =   0.07200256   0.07200881   0.07201792 s, AVG Performance =   116.5161 Tflops\nM N K =  16384  16384  16384, Time =   0.07564493   0.07565752   0.07567462 s, AVG Performance =   116.2620 Tflops\n\n./hgemm_cublas.bin\n# NVIDIA L20\nALGO = cuBLAS CUBLAS_GEMM_DEFAULT_TENSOR_OP TN\nM N K =  12544  12544  12544, Time =   0.03472691   0.03472968   0.03473408 s, AVG Performance =   113.6678 Tflops\nM N K =  15360  15360  15360, Time =   0.06332416   0.06333143   0.06334157 s, AVG Performance =   114.4417 Tflops\nM N K =  15616  15616  15616, Time =   0.06649446   0.06650184   0.06651699 s, AVG Performance =   114.5264 Tflops\nM N K =  15872  15872  15872, Time =   0.06977024   0.06977659   0.06978355 s, AVG Performance =   114.6081 Tflops\nM N K =  16128  16128  16128, Time =   0.07319142   0.07320709   0.07326925 s, AVG Performance =   114.6089 Tflops\nM N K =  16384  16384  16384, Time =   0.07668429   0.07669371   0.07670784 s, AVG Performance =   114.6912 Tflops\n```\n\n## 📖 Benchmark  \n\n\u003cdiv id=\"perf-l20\"\u003e\u003c/div\u003e  \n\n### 📖 NVIDIA L20  \n\u003c!--\n目前最优的实现，在L20上（理论Tensor Cores FP16算力为 119.5 TFLOPS），整体上能达到cuBLAS大概`99~100+%`左右的性能。使用WMMA API能达到cuBLAS大概`95%~98%`左右的性能(105-113 TFLOPS vs 105-115 TFLOPS)，使用MMA API能达到115 TFLOPS，部分 case 会超越 cuBLAS。CuTe 版本的 HGEMM 实现了 Block Swizzle（L2 Cache friendly）和 SMEM Swizzle（bank conflicts free），性能最优，大规模矩阵乘能达到 116-117 TFLOPS，是 cuBLAS 大概`98%~100%+`左右的性能，很多case会超越cuBLAS。目前通过 SMEM Padding 和 SMEM Swizzle 的方式缓解 bank conflicts。对于 NN layout，使用 SMEM Padding 缓解 bank conflicts；对于 TN layout，通过 CUTLASS/CuTe 的 SMEM Swizzle 消除 bank conflicts。\n--\u003e\nThe current best implementation, on the L20 (with a theoretical Tensor Cores FP16 performance of 119.5 TFLOPS), achieves performance that is approximately 99~100+% of cuBLAS.\n\n- Using the WMMA API, it can achieve around 95%~98% of cuBLAS performance (105-113 TFLOPS vs 105-115 TFLOPS).\n- Using the MMA API, it can reach 115 TFLOPS, surpassing cuBLAS in some cases.\n- The CuTe version of HGEMM implements Block Swizzle (L2 Cache friendly) and SMEM Swizzle (bank conflicts free), achieving the best performance. For large-scale matrix multiplication, it can reach 116-117 TFLOPS, which is approximately 98%~100%+ of cuBLAS performance, and it outperforms cuBLAS in many cases.\n\nCurrently, SMEM Padding and SMEM Swizzle are used to mitigate bank conflicts:\n\n- For the NN layout, SMEM Padding is used to alleviate bank conflicts.\n- For the TN layout, CUTLASS/CuTe's SMEM Swizzle is used to eliminate bank conflicts.\n\n\u003cdiv id=\"NV-L20\"\u003e\u003c/div\u003e\n\n\n![NVIDIA_L20_NN+TN+v2](https://github.com/user-attachments/assets/71927ac9-72b3-4ce9-b0e2-788b5885bc99)\n\n  \nThe command for testing all MNK setups (Tip: Performance data for each MNK tested individually is more accurate.)\n```bash\npython3 hgemm.py --cute-tn --mma --plot\n```\n\n### 📖 NVIDIA GeForce RTX 4090\n\n\u003cdiv id=\"perf-4090\"\u003e\u003c/div\u003e  \n\n\u003c!--\n在NVIDIA RTX 4090上(FP16 Tensor Cores算力为330 TFLOPS)，WMMA(m16n16k16)性能表现比MMA(m16n8k16)要更好，大分部MNK下，本仓库的实现能达到cuBLAS 95%~99%的性能，某些case能超过cuBLAS。就本仓库的实现而言，在RTX 4090上，大规模矩阵乘(MNK\u003e=8192)，WMMA表现更优，小规模矩阵乘，MMA表现更优。\n--\u003e\n\nOn the NVIDIA RTX 4090 (with an FP16 Tensor Cores performance of 330 TFLOPS), the WMMA (m16n16k16) implementation shows better performance compared to MMA (m16n8k16). For most MNK configurations, this repository's implementation achieves 95%~99% of cuBLAS performance, and in certain cases, it can surpass cuBLAS. Specifically:\n\n- For large-scale matrix multiplications (MNK \u003e= 8192), the WMMA implementation performs better.\n- For small-scale matrix multiplications, the MMA implementation is more efficient.\n\n\n![NVIDIA_GeForce_RTX_4090_NN+TN+v4](https://github.com/user-attachments/assets/05ef4f5e-d999-48ea-b58e-782cffb24e85)\n\n```bash\npython3 hgemm.py --cute-tn --mma --wmma-all --plot\n```\n\n### 📖 NVIDIA GeForce RTX 3080 Laptop   \n\n\u003cdiv id=\"perf-3080\"\u003e\u003c/div\u003e  \n\n\u003c!--\n在NVIDIA GeForce RTX 3080 Laptop上测试，使用mma4x4_warp4x4（16 WMMA m16n16k16 ops, warp tile 64x64）以及Thread block swizzle，大部分case能持平甚至超过cuBLAS，使用Windows WSL2 + RTX 3080 Laptop进行测试。\n--\u003e\nTesting was conducted on a NVIDIA GeForce RTX 3080 Laptop using the mma4x4_warp4x4 configuration (which includes 16 WMMA m16n16k16 operations with a warp tile size of 64x64) along with Thread block swizzle. In most cases, this setup matches or even exceeds cuBLAS performance. The tests were performed using Windows WSL2 + RTX 3080 Laptop.\n\n![image](https://github.com/user-attachments/assets/9472e970-c083-4b31-9252-3eeecc761078)\n\n```bash\npython3 hgemm.py --wmma-all --plot\n```\n\n\u003cdetails\u003e\n\u003csummary\u003e 🔑️ Performance Optimization Notes(TODO)\u003c/summary\u003e    \n\n## 📖 Performance Optimization Notes\n\n\u003cdiv id=\"opt-docs\"\u003e\u003c/div\u003e  \n\n### PyTorch HGEMM Profile\n\n在Ada架构下，PyTorch 2.4对FP16使用matmul时，会调用:\n```C++\nampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nn_kernel\n```\n内部实际使用HMMA(Tensor Cores)进行计算，在3080上profile发现使用:\n```C++\nsm80_xmma_gemm_f16f16_f16f32_f32_nn_n_tilesize96x64x32_stage3_warpsize2x2x1_tensor16x8x16_kernel\n```\n因此，只有实现使用Tensor Cores的HGEMM，才有可能接近PyTorch/cuBLAS的性能。\n```bash\nncu -o hgemm.prof -f python3 bench/prof.py\nnsys profile --stats=true -t cuda,osrt,nvtx -o hgemm.prof --force-overwrite true python3 prof.py\n```\n- SASS (L20)\n\n```C\n// ampere_fp16_s1688gemm_fp16_128x128_ldg8_f2f_stages_32x1_nn_kernel\n310\t00007f41 37d5b850\t      LDSM.16.M88.4 R192, [R169+UR8+0x2000] \n311\t00007f41 37d5b860\t      LDSM.16.M88.4 R196, [R169+UR8+0x2800]\n336\t00007f41 37d5b9f0\t      HMMA.1688.F32 R112, R182, R196, R112\n...\n```\n\n### SMEM Padding  \n\n#### Bank Conflicts的产生\n  \n含义：在访问shared memory时，因多个线程读写同一个Bank中的不同数据地址时，导致shared memory 并发读写 退化 成顺序读写的现象叫做Bank Conflict；\n\n![](https://github.com/PaddleJitLab/CUDATutorial/blob/develop/docs/09_optimize_reduce/02_bank_conflict/images/ef322be7c3e5b6b9be69d2b90e88083f50569a58a97129f348e483b946ab4edf.png)\n\nSM调度单位为一个warp（一个warp内32个Thread），shared_memory 可以 被一个warp中的所有（32个）线程进行访问，shared_memory 映射到大小相等的32个Bank上，Bank的数据读取带宽为32bit / cycle (4 bytes)，因此，主要需要考虑一个Warp内32线程的访问共享内存时的bank冲突。\n对于多个线程读取同一个Bank数据时（不同地址），硬件把内存读写请求，拆分成 conflict-free requests，进行顺序读写，此时将会触发多次内存事务。特别地，当一个warp中的所有线程读写同一个地址时，会触发broadcast机制，此时不会退化成顺序读写。上面提到触发broadcast机制的条件是all threads acess same address，但在翻阅cuda-c-programming-guide以及最新版本的[NVProfGuide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html) 时，发现只要是多个thread 读写就会触发broadcast（不需要All）。\n  \n- 多个线程读同一个数据时，仅有一个线程读，然后broadcast到其他线程\n- 多个线程写同一个数据时，仅会有一个线程写成功\n\nNVIDIA的[文章](https://developer.nvidia.com/blog/using-shared-memory-cuda-cc/)中指出，我们还可以通过 `cudaDeviceSetSharedMemConfig()` 函数设置默认Bank Size（默认为4 bytes）来避免bank conflicts，可设置为cudaSharedMemBankSizeFourByte或者cudaSharedMemBankSizeEightByte。对于某些场景来说，设置cudaSharedMemBankSizeEightByte或许更加合适，比如使用double数据类型时。 \n\n```C\ncudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);\n```\n目前通过 SMEM Padding 和 SMEM swizzle的方式缓解bank conflicts。对于 NN layout，使用 SMEM Padding 缓解 bank conflicts；对于 TN layout，通过cutlass cute的 SMEM Swizzle 消除 bank conflicts。\n\n### 双缓冲 Double Buffers\n\n本仓库实现的HGEMM Double Buffers策略如下：1）主循环从bk = 1 开始，第一次数据加载在主循环之前，最后一次计算在主循环之后，这是pipeline 的特点决定的；2）由于计算和下一次访存使用的Shared Memory不同，因此主循环中每次循环只需要一次__syncthreads()即可，对比非double buffers版本，总共节省了 ((K + BK - 1) / BK) - 1 次block内的同步操作。比如，bk=1时，HFMA计算使用的是s_a[0]和s_b[0]，因此，和s_a[1]和s_b[1]的加载是没有依赖关系的。HFMA计算，从global内存到s_a[1]和s_b[1]和HFMA计算可以并行。s_a[1]和s_b[1]用于加载下一块BK需要的数据到共享内存；3）由于GPU不能向CPU那样支持乱序执行，主循环中需要先将下一次循环计算需要的Gloabal Memory中的数据load 到寄存器，然后进行本次计算，之后再将load到寄存器中的数据写到Shared Memory，这样在LDG指令向Global Memory做load时，不会影响后续HFMA及其它运算指令的 launch 执行，也就达到了Double Buffers的目的，具体代码见[hgemm.cu](./hgemm.cu)。\n\n\n### Tile Block\n\nTODO\n\n### Tile Thread\n\nTODO\n\n### Pack LDST 128 bits\n\nTODO\n\n### Async Copy\n\nTODO\n\n### Multi Stages\n\nTODO\n\n### Tensor Cores(WMMA/MMA)\n\nTODO\n\n### Tile MMA/Warp\n\nTODO \n\n### Thread Block Swizze \n\nTODO\n\n### Warp Swizzle\n\nTODO\n\n### Reg Double Buffers\n\nTODO\n\n### Collective Store(Reg Reuse\u0026Warp Shuffle)\n\nTODO\n\n### SMEM Swizzle/Permuted\n\nTODO\n\n\u003c/details\u003e\n\n## 📖 References \n\n\u003cdiv id=\"ref\"\u003e\u003c/div\u003e  \n\n- [flash-attention-minimal](https://github.com/tspeterkim/flash-attention-minimal)\n- [tiny-flash-attention](https://github.com/66RING/tiny-flash-attention)\n- [cute-gemm](https://github.com/reed-lau/cute-gemm)\n- [cutlass_flash_atten_fp8](https://github.com/weishengying/cutlass_flash_atten_fp8)\n- [cuda_learning](https://github.com/ifromeast/cuda_learning)\n- [cuda_hgemm](https://github.com/Bruce-Lee-LY/cuda_hgemm)\n- [cuda-tensorcore-hgemm](https://github.com/nicolaswilde/cuda-tensorcore-hgemm)\n- [How_to_optimize_in_GPU](https://github.com/Liu-xiandong/How_to_optimize_in_GPU/tree/master/sgemv)\n- [cute_gemm](https://github.com/weishengying/cute_gemm)\n- [cutlass](https://github.com/NVIDIA/cutlass)\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxlite-dev%2Fhgemm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxlite-dev%2Fhgemm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxlite-dev%2Fhgemm/lists"}