{"id":18122952,"url":"https://github.com/lawmurray/gpu-gemm","last_synced_at":"2026-03-01T04:09:20.928Z","repository":{"id":260396740,"uuid":"865825218","full_name":"lawmurray/gpu-gemm","owner":"lawmurray","description":"CUDA kernel for matrix-matrix multiplication on Nvidia GPUs, using a Hilbert curve to improve L2 cache utilization.","archived":false,"fork":false,"pushed_at":"2024-11-04T04:47:42.000Z","size":35,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T22:52:17.646Z","etag":null,"topics":["cplusplus","cuda","cuda-kernels","cuda-programming","gpu","gpu-computing","gpu-programming","matrix-multiplication","numerical-methods","scientific-computing"],"latest_commit_sha":null,"homepage":"https://indii.org/blog/gpu-matrix-multiply","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lawmurray.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-10-01T07:43:01.000Z","updated_at":"2025-01-15T20:16:53.000Z","dependencies_parsed_at":"2024-10-31T04:25:29.126Z","dependency_job_id":"1ae70137-6c96-4040-ac55-f3bfccf7a83c","html_url":"https://github.com/lawmurray/gpu-gemm","commit_stats":null,"previous_names":["lawmurray/gpu-gemm"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lawmurray/gpu-gemm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lawmurray%2Fgpu-gemm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lawmurray%2Fgpu-gemm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lawmurray%2Fgpu-gemm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lawmurray%2Fgpu-gemm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lawmurray","download_url":"https://codeload.github.com/lawmurray/gpu-gemm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lawmurray%2Fgpu-gemm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29960236,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T01:47:18.291Z","status":"online","status_checked_at":"2026-03-01T02:00:07.437Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cplusplus","cuda","cuda-kernels","cuda-programming","gpu","gpu-computing","gpu-programming","matrix-multiplication","numerical-methods","scientific-computing"],"created_at":"2024-11-01T07:07:22.116Z","updated_at":"2026-03-01T04:09:20.904Z","avatar_url":"https://github.com/lawmurray.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CUDA Kernel for Matrix-Matrix Multiplication on Nvidia GPUs\n\nThis code accompanies the blog post [Matrix Multiplication Faster Than Nvidia, Sometimes](https://indii.org/blog/gpu-matrix-multiply). It provides a CUDA kernel for single-precision matrix-matrix multiplication, with two notable features:\n\n* use of a Hilbert curve to improve L2 cache efficiency,\n* avoidance of synchronization across whole thread blocks, instead replaced with synchronization across half and quarter blocks.\n\n\n## License\n\nThis is open source software. It is licensed under the Apache License,\nVersion 2.0 (the \"License\"); you may not use it except in compliance with the\nLicense. You may obtain a copy of the License at\n\u003chttp://www.apache.org/licenses/LICENSE-2.0\u003e.\n\n\n## Requirements\n\nYou will need:\n\n* an Nvidia graphics card,\n* a working [CUDA](https://developer.nvidia.com/cuda-downloads) installation,\n* [cmake](https://cmake.org) to build the code.\n\nThe code as been tested with an Nvidia GeForce RTX 4080 Laptop GPU, using CUDA 12.6.1 on a laptop running Ubuntu 24.04.\n\n\n## Building\n\nBuild with:\n\n    mkdir build\n    cd build\n    cmake -DCMAKE_BUILD_TYPE=Release ..\n    cmake --build .\n\n\n## Running\n\nFrom within that same `build` directory, run with:\n\n    ./gemm\n\nA table of results is output in Markdown format. You may want to tweak the actual tests that are run by editing `src/gemm.cu` (at the bottom) before compiling, especially if you need to reduce the number of trials or remove the larger matrix sizes to fit within memory constraints. Without changes, a GPU with at least 4 GB of device memory is ideal.\n\nThe code uses 32-bit array indexing, and will not work with matrices larger than 32768x32768 without modification to avoid integer overflow.\n\n\u003e Consider sharing your results as a [discussion](https://github.com/lawmurray/gpu-gemm/discussions). You can copy and paste the output table directly, as it is in GitHub compatible Markdown.\n\n\n## Benchmarking\n\nFor the purposes of benchmarking, refer to the blog post [Matrix Multiplication Faster than Nvidia, Sometimes](https://indii.org/blog/gpu-matrix-multiply) for a discussion of some appropriate protocols.\n\nYou may wish to lock the clock and memory speed on your GPU for benchmarking purpoes (or, you may not, refer to the blog post). To do so, run the following commands:\n\n    sudo nvidia-smi --lock-gpu-clocks=1150\n    sudo nvidia-smi --lock-memory-clocks=6000\n\nChanging the numbers as desired. Once benchmarking is complete, unlock them again with:\n\n    sudo nvidia-smi --reset-gpu-clocks\n    sudo nvidia-smi --reset-memory-clocks\n\n\n## Contributing\n\nContributions are welcome. This is prototype not production code, so most contributions would aim at improving understanding of matrix-matrix multiplication. That might include:\n\n* Running the code on your own system and reporting the results. The output of the program is a Markdown table that you can easily copy and paste into a [discussion](https://github.com/lawmurray/gpu-gemm/discussions).\n* Improving the performance of the code. Please send a [pull request](https://github.com/lawmurray/gpu-gemm/pulls), and perhaps consider writing a blog post or the like on the improvement.\n* Improving the benchmarking protocol. Perhaps you think the methodology can be improved for more accurate measurement, or there is an interesting scenario that is not currently considered. Again, please send a [pull request](https://github.com/lawmurray/gpu-gemm/pulls), or [start a discussion](https://github.com/lawmurray/gpu-gemm/discussions) if required.\n* Expanding to new use cases such as half precision or double precision.\n* Expanding to new hardware.\n* Fixing bugs if found.\n\nOf course, these are just suggestions and not an exhaustive list.\n\n\n## Contact\n\nLawrence Murray, \u003chttps://indii.org\u003e, \u003clawrence@indii.org\u003e.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flawmurray%2Fgpu-gemm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flawmurray%2Fgpu-gemm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flawmurray%2Fgpu-gemm/lists"}