{"id":24827981,"url":"https://github.com/parxd/cuda-optim","last_synced_at":"2025-03-26T01:17:07.542Z","repository":{"id":270274749,"uuid":"909457163","full_name":"Parxd/cuda-optim","owner":"Parxd","description":"optimizing CUDA kernels","archived":false,"fork":false,"pushed_at":"2025-02-17T20:26:48.000Z","size":1236,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-17T21:29:29.516Z","etag":null,"topics":["cuda","machine-learning"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Parxd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-28T19:00:16.000Z","updated_at":"2025-02-17T20:26:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"19a36641-c0b5-4309-8542-2ce6593a912c","html_url":"https://github.com/Parxd/cuda-optim","commit_stats":null,"previous_names":["parxd/ml-cuda-kernels","parxd/cuda-optim"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Parxd%2Fcuda-optim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Parxd%2Fcuda-optim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Parxd%2Fcuda-optim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Parxd%2Fcuda-optim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Parxd","download_url":"https://codeload.github.com/Parxd/cuda-optim/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245568570,"owners_count":20636803,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","machine-learning"],"created_at":"2025-01-30T22:24:31.349Z","updated_at":"2025-03-26T01:17:07.531Z","avatar_url":"https://github.com/Parxd.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Optimzing CUDA Kernels\nCurrently a work-in-progress repo containing CUDA kernels optimized for various common machine learning algorithms.\n\n## SGEMM (single-precision general matrix-multiplication)\nKernel source code lives under `src/gemm/kernels`\n\nKernels 0 through 5 are written in pure CUDA C++, but following kernels use either NVIDIA's CUTLASS/CuTe or cuBLAS libraries, as copying/tiling/MMA indices quickly become too complex to manually track.\n\n*Note: Kernels 0 through 4 have lots of hard-coded kernel launch parameters and messy code, as they were mostly for demonstration and learning general good GEMM concepts (block/thread-tiling). As a result, there is also no bounds checking for kernels 2 through 4, so they are no guarantees for correctness when using non-square `M`, `N`, `K` dimensions that aren't multiples of `64` (i.e. =/= 64, 128, 192, etc.).*\n\nKernels 5 and above have far cleaner (and performant) code.\nThe general approach used for these kernels (CTA-tiling, warp-tiling, thread-tiling) can be visualized with this diagram from NVIDIA's CUTLASS...\n![img](res/3.png)\n\nThe following performance tests were run on my RTX 3070 Mobile for `M = N = K = 512`.\n\n- Kernel 0: [Naive](src/gemm/kernel/0_naive.cuh)\n    - Time: \n    - GFLOPs: \n\n- Kernel 1: [SMEM Blocktiling](src/gemm/kernel/1_shared_mem.cuh)\n    - Time: \n    - GFLOPs: \n\n- Kernel 2: [SMEM Blocktiling + 1D Threadtiling](src/gemm/kernel/2_onedim_blocktile.cuh)\n    - Time: \n    - GFLOPs: \n\n- Kernel 3: [SMEM Blocktiling + 2D Threadtiling](src/gemm/kernel/3_twodim_blocktile.cuh)\n    - Time: \n    - GFLOPs: \n\n- Kernel 4: [SMEM Blocktiling + Threadtiling + Vectorized Transactions](src/gemm/kernel/4_twodim_blocktile_vectorized.cuh)\n    - Time: \n    - GFLOPs: \n    \n- Kernel 5: [SMEM Blocktiling + Warptiling + Threadtiling + Vectorized Transactions](src/gemm/kernel/5_warptile.cuh)\n    - Time:\n    - GFLOPs: ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparxd%2Fcuda-optim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fparxd%2Fcuda-optim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fparxd%2Fcuda-optim/lists"}