{"id":17632662,"url":"https://github.com/han-minhee/sgemm_hip","last_synced_at":"2026-04-27T23:35:07.882Z","repository":{"id":257679475,"uuid":"858733730","full_name":"han-minhee/SGEMM_HIP","owner":"han-minhee","description":"SGEMM implementations in HIP for NVIDIA / AMD GPUs","archived":false,"fork":false,"pushed_at":"2024-09-20T04:58:59.000Z","size":997,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-30T03:31:44.526Z","etag":null,"topics":["cuda","gpgpu","gpu","hip","rocm"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/han-minhee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-17T12:49:12.000Z","updated_at":"2024-12-27T08:12:07.000Z","dependencies_parsed_at":"2024-09-18T04:04:36.782Z","dependency_job_id":"ff957250-5d01-471b-ab93-abc631633637","html_url":"https://github.com/han-minhee/SGEMM_HIP","commit_stats":null,"previous_names":["han-minhee/sgemm_hip"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/han-minhee%2FSGEMM_HIP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/han-minhee%2FSGEMM_HIP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/han-minhee%2FSGEMM_HIP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/han-minhee%2FSGEMM_HIP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/han-minhee","download_url":"https://codeload.github.com/han-minhee/SGEMM_HIP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246273533,"owners_count":20750904,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpgpu","gpu","hip","rocm"],"created_at":"2024-10-23T01:45:00.647Z","updated_at":"2026-04-27T23:35:07.846Z","avatar_url":"https://github.com/han-minhee.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# HIP SGEMM Kernels for NVIDIA/AMD GPUs\nA HIP port from the [CUDA version](https://github.com/siboehm/SGEMM_CUDA) and [the original repo](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE) with some modifications. \n\n**HIP kernels codes are actually almost identical to CUDA kernels, so I barely touched the kernel codes.**\n\n## Overview\n\nGFLOPs at matrix size 4096x4096:\n\n#### Test Results on an AMD RX 7900 GRE (gfx1100):\n![](RX7900GRE_benchmark_results.png)\n| Kernel                              |   GFLOPs/s | Performance relative to hipBLAS   |\n|:------------------------------------|-----------:|:----------------------------------|\n| 1: Naive                            |      261.8 | 1.3%                              |\n| 2: GMEM Coalescing                  |      320.8 | 1.6%                              |\n| 3: SMEM Caching                     |     2733.9 | 13.9%                             |\n| 4: 1D Blocktiling                   |     5343   | 27.1%                             |\n| 5: 2D Blocktiling                   |    10269.3 | 52.0%                             |\n| 8: Avoid Bank Conflicts (Offset)    |    10695.9 | 54.2%                             |\n| 6: Vectorized Mem Access            |    14820.4 | 75.1%                             |\n| 7: Avoid Bank Conflicts (Linearize) |    17735.4 | 89.9%                             |\n| 9: Autotuning                       |    18650.8 | 94.5%                             |\n| 0: hipBLAS                          |    19731.5 | 100.0%                            |\n| 10: Warptiling                      |    22355   | 113.3%                            |\n\n#### Test Results on an NVIDIA RTX 4060 Laptop (sm89, TGP 55W):\n![](RTX4060Laptop_benchmark_results.png)\n| Kernel                              |   GFLOPs/s | Performance relative to hipBLAS   |\n|:------------------------------------|-----------:|:----------------------------------|\n| 1: Naive                            |      119.5 | 2.1%                              |\n| 2: GMEM Coalescing                  |      670.3 | 11.6%                             |\n| 3: SMEM Caching                     |     1011.9 | 17.6%                             |\n| 4: 1D Blocktiling                   |     2649.2 | 46.0%                             |\n| 5: 2D Blocktiling                   |     4848.7 | 84.1%                             |\n| 7: Avoid Bank Conflicts (Linearize) |     4938.3 | 85.7%                             |\n| 8: Avoid Bank Conflicts (Offset)    |     4973.6 | 86.3%                             |\n| 6: Vectorized Mem Access            |     5592.8 | 97.0%                             |\n| 0: hipBLAS                          |     5763.8 | 100.0%                            |\n| 9: Autotuning                       |     5868.7 | 101.8%                            |\n| 10: Warptiling                      |     6252.1 | 108.5%                            |\n\n#### Your Results\n![](benchmark_results.png)\n\u003c!-- benchmark_results --\u003e\n\n\u003c!-- benchmark_results --\u003e\n\n### Slight Modifications\n1. **Ported to HIP to work with AMD GPUs and NVIDIA GPUs**\n2. Changed the project structure a little bit and added error checks on HIP functions calls (which eliminate the compile time warnings and is a good practice)\n3. Changed the way auto tuning works, with the tuning results integrated into the CMake.\n\n### Requirements\n\n#### Hardware Requirements\nAny AMD GPU / NVIDIA GPU systems compatible with HIP software stack (and the full ROCm for AMD GPUs and CUDA for NVIDIA GPUs) should work. It was tested on a system with two RX 7900 GRE GPUs / a system with RTX 4060 Laptop.\n\n#### Software Requirements\nIt was tested on Ubuntu 24.04, ROCm 6.2. Other than ROCm, some basic packages from the Ubuntu repository.(`build-essential`, `cmake`, `bc`) To parse the run result and plot the graph, conda environment is required. For a NVIDIA GPUs, refer to the [note](HIP_NVIDIA.md).\n\n### Build\nTo build for an AMD GPU, if your target device is `gfx1100`:\n```\nmkdir build \u0026\u0026 cd build\ncmake -DROCM_TARGET_DEVICE=gfx1100 ..\ncmake --build .\n```\n\nTo build for an NVIDIA GPU, use `-DCUDA_TARGET_ARCH` instead of `-DROCM_TARGET_DEVICE`, and if your target device is `sm70`:\n```\ncmake -DCUDA_TARGET_ARCH=70 ..\n```\n\n### Run the Benchmark\nAfter build, you can run a kernel by running `./sgemm [kernel_num]`. To run the kernel 8, \n```\n./sgemm 8\n```\n\nTo run the benchmark for all kernels and plot the result to a png file,\n\n```\nconda env create -f environment.yml\nconda activate SGEMM_HIP\n./gen_benchmark_results.sh\n```\n\nAfter it finishes running, the numbers in the `README.md` will be changed and `benchmark_results.png` will be generated.\n\n### Auto-tuning for Kernel 9 and Kernel 10\n\n#### Disclaimer\nAs of now, the cleaning up after each build isn't working as expected. So, you should first build the program like above, and then run the tuning scripts. When one script is run, clean up `./build` directory, rebuild and run the other script.\n\nRunning `./scripts/kernel_9_autotuner.sh` and `./scripts/kernel_10_autotuner.sh` will make `./benchmark_results/best_kernel_9_config.cmake` and `./benchmark_results/best_kernel_10_config.cmake` respectively. After running through possible combinations, they will save the best combination to cmake files that will be used during the build process. They will automatically included in the main build if they exist.\n\nIf the compilation keeps failing during the auto-tuning, try to clean up the build directory and build the main program again.\n\n## Credits\n[CUDA version](https://github.com/siboehm/SGEMM_CUDA) for the source code base\n\n[wangzyon/NVIDIA_SGEMM_PRACTICE](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE) for the original article and the implementation\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhan-minhee%2Fsgemm_hip","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhan-minhee%2Fsgemm_hip","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhan-minhee%2Fsgemm_hip/lists"}