Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/parxd/cuda-optim
optimizing CUDA kernels
https://github.com/parxd/cuda-optim
cuda machine-learning
Last synced: 14 days ago
JSON representation
optimizing CUDA kernels
- Host: GitHub
- URL: https://github.com/parxd/cuda-optim
- Owner: Parxd
- Created: 2024-12-28T19:00:16.000Z (about 2 months ago)
- Default Branch: main
- Last Pushed: 2025-01-28T01:50:35.000Z (16 days ago)
- Last Synced: 2025-01-28T02:36:12.073Z (16 days ago)
- Topics: cuda, machine-learning
- Language: C++
- Homepage:
- Size: 1.17 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CUDA Kernels for ML
Currently a work-in-progress repo containing CUDA kernels optimized for various common machine learning algorithms.## SGEMM (single-precision general matrix-multiplication)
Kernel source code lives under `src/gemm/kernels`Kernels 0 through 4 are written in pure CUDA C++, but following kernels use either NVIDIA's CUTLASS/CuTe or cuBLAS libraries, as copying/tiling/MMA indices quickly become too complex to manually track.
*Note: Kernels 0 through 4 have tons of hard-coded values, especially for kernel launch parameters. These kernels were mostly for demonstration and learning general good GEMM concepts (block/thread-tiling). As a result, there is also no bounds checking for kernels 2 through 4, so they are no guarantees for correctness when using non-square `M`, `N`, `K` dimensions that aren't multiples of `64` (i.e. 64, 128, 192, etc.).*
The following performance tests were run on my RTX 3070 Mobile for `M = N = K = 512`.
- Kernel 0: [Naive](src/gemm/kernel/0_naive.cuh)
- Time:
- GFLOPS:- Kernel 1: [SMEM Blocktiling](src/gemm/kernel/1_shared_mem.cuh)
- Time:
- GFLOPS:- Kernel 2: [SMEM Blocktiling + 1D Threadtiling](src/gemm/kernel/2_onedim_blocktile.cuh)
- Time:
- GFLOPS:- Kernel 3: [SMEM Blocktiling + 2D Threadtiling](src/gemm/kernel/3_twodim_blocktile.cuh)
- Time:
- GFLOPS:- Kernel 4: [SMEM Blocktiling + 2D Threadtiling + Vectorized Transactions](src/gemm/kernel/4_twodim_blocktile_vectorized.cuh)
- Time:
- GFLOPS: