{"id":13568597,"url":"https://github.com/siboehm/SGEMM_CUDA","last_synced_at":"2025-04-04T04:31:27.179Z","repository":{"id":118393841,"uuid":"565304906","full_name":"siboehm/SGEMM_CUDA","owner":"siboehm","description":"Fast CUDA matrix multiplication from scratch","archived":false,"fork":false,"pushed_at":"2023-12-28T01:20:12.000Z","size":2906,"stargazers_count":668,"open_issues_count":10,"forks_count":93,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-03-28T12:06:26.474Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://siboehm.com/articles/22/CUDA-MMM","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/siboehm.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2022-11-13T00:44:54.000Z","updated_at":"2025-03-27T09:45:18.000Z","dependencies_parsed_at":null,"dependency_job_id":"c71e83df-e3e8-4073-9a5c-7ace142e6ea0","html_url":"https://github.com/siboehm/SGEMM_CUDA","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siboehm%2FSGEMM_CUDA","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siboehm%2FSGEMM_CUDA/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siboehm%2FSGEMM_CUDA/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/siboehm%2FSGEMM_CUDA/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/siboehm","download_url":"https://codeload.github.com/siboehm/SGEMM_CUDA/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247123072,"owners_count":20887259,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:00:28.885Z","updated_at":"2025-04-04T04:31:22.170Z","avatar_url":"https://github.com/siboehm.png","language":"Cuda","readme":"# Fast CUDA SGEMM from Scratch\n\nStep-by-step optimization of matrix multiplication, implemented in CUDA.\nFor an explanation of each kernel, see [siboehm.com/CUDA-MMM](https://siboehm.com/articles/22/CUDA-MMM).\n\n## Overview\n\nRunning the kernels on a NVIDIA A6000 (Ampere):\n\n![](benchmark_results.png)\n\nGFLOPs at matrix size 4096x4096:\n\u003c!-- benchmark_results --\u003e\n| Kernel                              |  GFLOPs/s | Performance relative to cuBLAS |\n|:------------------------------------|----------:|:-------------------------------|\n| 1: Naive                            |   `309.0` | 1.3%                           |\n| 2: GMEM Coalescing                  |  `1986.5` | 8.5%                           |\n| 3: SMEM Caching                     |  `2980.3` | 12.8%                          |\n| 4: 1D Blocktiling                   |  `8474.7` | 36.5%                          |\n| 5: 2D Blocktiling                   | `15971.7` | 68.7%                          |\n| 7: Avoid Bank Conflicts (Linearize) | `16213.4` | 69.7%                          |\n| 8: Avoid Bank Conflicts (Offset)    | `16459.2` | 70.8%                          |\n| 11: Double Buffering                | `17278.3` | 74.3%                          |\n| 6: Vectorized Mem Access            | `18237.3` | 78.4%                          |\n| 9: Autotuning                       | `19721.0` | 84.8%                          |\n| 10: Warptiling                      | `21779.3` | 93.7%                          |\n| 0: cuBLAS                           | `23249.6` | 100.0%                         |\n\u003c!-- benchmark_results --\u003e\n\n## Setup\n\n1. Install dependencies: CUDA toolkit 12, Python (+ Seaborn), CMake, Ninja. See [environment.yml](environment.yml).\n1. Configure NVCC compilation parameters. Look up your GPUs compute\n   capability [here](https://developer.nvidia.com/cuda-gpus). Then configure the `CMakeLists.txt` and change:\n    ```cmake\n    set(CUDA_COMPUTE_CAPABILITY 80)\n    ```\n1. Build: `mkdir build \u0026\u0026 cd build \u0026\u0026 cmake .. \u0026\u0026 cmake --build .`\n1. Run one of the kernels: `DEVICE=\u003cdevice_id\u003e ./sgemm \u003ckernel number\u003e`\n1. Profiling via [NVIDIA Nsight Compute](https://developer.nvidia.com/nsight-compute) (ncu): `make profile KERNEL=\u003ckernel number\u003e`\n\nCredit goes to [wangzyon/NVIDIA_SGEMM_PRACTICE](https://github.com/wangzyon/NVIDIA_SGEMM_PRACTICE) for the benchmarking setup.\n","funding_links":[],"categories":["Cuda","Example Implementations 💡"],"sub_categories":["Blogs 🖋️"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiboehm%2FSGEMM_CUDA","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsiboehm%2FSGEMM_CUDA","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsiboehm%2FSGEMM_CUDA/lists"}