{"id":22353452,"url":"https://github.com/mathiasotnes/gemm","last_synced_at":"2025-03-26T12:26:36.082Z","repository":{"id":263909764,"uuid":"891764466","full_name":"Mathiasotnes/GEMM","owner":"Mathiasotnes","description":"General Matrix Multiplication (GEMM) optimization in Cuda.","archived":false,"fork":false,"pushed_at":"2024-12-06T20:41:05.000Z","size":3711,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-31T13:34:08.395Z","etag":null,"topics":["cuda","gpu"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mathiasotnes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-11-20T23:06:59.000Z","updated_at":"2024-12-06T20:41:09.000Z","dependencies_parsed_at":"2024-11-21T00:19:56.571Z","dependency_job_id":"0fdf954e-1f4c-44a9-887c-3f8757960578","html_url":"https://github.com/Mathiasotnes/GEMM","commit_stats":null,"previous_names":["mathiasotnes/gemm"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mathiasotnes%2FGEMM","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mathiasotnes%2FGEMM/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mathiasotnes%2FGEMM/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mathiasotnes%2FGEMM/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mathiasotnes","download_url":"https://codeload.github.com/Mathiasotnes/GEMM/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245652317,"owners_count":20650462,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpu"],"created_at":"2024-12-04T13:08:32.938Z","updated_at":"2025-03-26T12:26:36.061Z","avatar_url":"https://github.com/Mathiasotnes.png","language":"Cuda","funding_links":[],"categories":[],"sub_categories":[],"readme":"# GEMM\nGeneral Matrix Multiplication (GEMM) optimization in Cuda.\n\n### Notes\n- I'm using square matrixes\n- I'm setting alpha and beta to 1 and C to 0 to simplify\n- The functions include the memory allocation\n- When using the profiler at stream shmem I saw that it launched hundreds of different kernel instances. The other\n  methods only had a single kernel instance.\n- CuBLAS uses 3D grid (8, 16, 5), and a blockSize of (128,1,1). When I tried to use this in my shmem implementation\n  I got the wrong answer, but it reduced the amount of cycles.\n\n### Talking points:\n\n1. Introduce problem:\n    - Simple matrix multiplication variation of GEMM.\n\n2. Go through implementations:\n    - CPU:              To compare with GPU implementation.\n    - naive:            Basic implementation of parallell matrix multiplication.\n    - shmem:            Utilizing shared memory in the same way as explaned in lecture (tile-based).\n    - stream:           Tried to split the A-matrix into different tiles. Unsuccesfully.\n    - stream_shmem:     Stream combined with shared memory.\n    - cublas:           CuBLAS library wrapper.\n\n3. Go through results:\n    - results tile/block size 16:\n        - CPU was fastest on the small matrixes because it doesn't have to copy memory.\n        - Naive and shmem were close on all the sizes, but shmem turned out better when the size increased.\n        - CuBLAS excelled when the sizes became large enough.\n    - results tile/block size 32:\n        - naive and shmem were a lot faster on 2048, but slower on 1024. Kinda surprising since 32x32=1024.\n    - shmem nsight compute analysis:\n        - Close to zero bank conflicts.\n        - LSU bottleneck (load and store operations).\n    - cublas nsight compute analysis:\n        - Not bottlenecked in the same way as shmem by LSU.\n    - Profile summary:\n        - CuBLAS dimension\n        - CuBLAS using a lot less cycles.\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmathiasotnes%2Fgemm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmathiasotnes%2Fgemm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmathiasotnes%2Fgemm/lists"}