{"id":23534990,"url":"https://github.com/andreasholt/cuda-matmul-benchmarking","last_synced_at":"2025-11-01T00:30:27.357Z","repository":{"id":268142659,"uuid":"902079316","full_name":"AndreasHolt/cuda-matmul-benchmarking","owner":"AndreasHolt","description":"Implementing and benchmarking various matmul implementations in CUDA","archived":false,"fork":false,"pushed_at":"2024-12-29T23:58:37.000Z","size":35,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-17T06:13:07.412Z","etag":null,"topics":["cuda","matrix-multiplication"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AndreasHolt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-11T21:38:01.000Z","updated_at":"2024-12-29T23:58:40.000Z","dependencies_parsed_at":null,"dependency_job_id":"74003be5-cc48-4024-a97c-e5f17d6d0974","html_url":"https://github.com/AndreasHolt/cuda-matmul-benchmarking","commit_stats":null,"previous_names":["andreasholt/cuda-matmul-benchmarking"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndreasHolt%2Fcuda-matmul-benchmarking","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndreasHolt%2Fcuda-matmul-benchmarking/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndreasHolt%2Fcuda-matmul-benchmarking/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AndreasHolt%2Fcuda-matmul-benchmarking/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AndreasHolt","download_url":"https://codeload.github.com/AndreasHolt/cuda-matmul-benchmarking/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239242112,"owners_count":19605954,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","matrix-multiplication"],"created_at":"2024-12-26T01:14:07.885Z","updated_at":"2025-11-01T00:30:27.297Z","avatar_url":"https://github.com/AndreasHolt.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# CUDA Matrix Multiplication Benchmarking\n\nThis project implements and benchmarks different approaches to matrix multiplication using CUDA:\n- Sequential CPU implementation\n- Naive GPU implementation\n- GPU implementation with memory coalescing (via thread remapping)\n- Tiled GPU implementation using shared memory\n- Tiled GPU implementation using shared memory and memory coalescing (via matrix B transposition)\n\nFuture work:\n- More optimizations planned (vectorized memory access, register tiling, etc.)\n\nThe implementations and results are discussed in detail in these blog posts on my personal site:\n- [Part 1: Naive GPU Implementation, Explanation, and CPU vs naive GPU Benchmarking](https://andreasholt.com/posts/gpu-vs-cpu-matmul/)\n- [Part 2: Tiled Matrix Multiplication Explained and Implemented, Benchmarking against naive GPU, and Performance Analysis with Nsight Compute](https://andreasholt.com/posts/shared-tiled-matmul/)\n\n\n\n## Building the Project\n\n```bash\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake\n```\n\n## Usage\n\nThe executable supports different modes:\n\n### Run Full Benchmark Suite\n```bash\n./matmul\n```\nThis runs benchmarks for all implementations across matrix sizes: 32×32, 256×256, 1024×1024, and 2048×2048.\n\n### Profile Specific Implementation\n```bash\n./matmul profile \u003ctype\u003e \u003cdim\u003e\n```\n- `\u003ctype\u003e`: Implementation type (`naive_gpu`, `coalesced_gpu`, `tiled_gpu`, `tiled_coalesced_gpu`)\n- `\u003cdim\u003e`: Matrix dimension (creates dim×dim matrices)\n\nExample:\n```bash\n./matmul profile tiled_gpu 1024\n```\n\n### NVIDIA Nsight Compute Profiling\nFor detailed GPU metrics:\n```bash\nncu --set full -o naive_2048_full.ncu-rep ./matmul profile naive_gpu 2048\nncu --set full -o tiled_2048_full.ncu-rep ./matmul profile tiled_gpu 2048\nncu --set full -o tiled_coalesced_2048_full.ncu-rep ./matmul profile tiled_coalesced_gpu 2048\n```\n\n## Implementation Details\n\nThe project implements matrix multiplication using different approaches:\n- Each thread computes one element of the output matrix (`naive_gpu`)\n- Uses shared memory tiling to improve memory access patterns (`tiled_gpu`)\n- Basic CPU implementation for baseline comparison (`sequential_cpu`)\n\nEach implementation can be benchmarked and profiled independently to compare performance across different metrics. For now these metrics include GFLOPS and time (ms).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreasholt%2Fcuda-matmul-benchmarking","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandreasholt%2Fcuda-matmul-benchmarking","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandreasholt%2Fcuda-matmul-benchmarking/lists"}