https://github.com/aryagxr/cuda

100 Days of CUDA!!!
https://github.com/aryagxr/cuda

cuda gpu-programming kernels parallel-programming

Last synced: 9 months ago
JSON representation

100 Days of CUDA!!!

Host: GitHub
URL: https://github.com/aryagxr/cuda
Owner: aryagxr
License: mit
Created: 2025-01-30T03:27:43.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-14T01:17:44.000Z (about 1 year ago)
Last Synced: 2025-04-14T10:53:50.390Z (about 1 year ago)
Topics: cuda, gpu-programming, kernels, parallel-programming
Language: Cuda
Homepage:
Size: 120 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: readme.md
- License: LICENSE

Awesome Lists containing this project

README

          **CUDA Progress**

| **Day**    | **Code Summary**                                                   |

|------------|--------------------------------------------------------------------|

| Day 1      |  CUDA set up and kernel that prints "Hello World"                  |

| Day 2      |  CUDA kernel that adds two vectors                                 |

| Day 3      |  Adding matrices                                                   |

| Day 4      |  Vector addition using cuBLAS                                      |

| Day 5      |  Naive matmul                                                      |

| Day 6      |  Tiled matmul using shared memory                                  |

| Day 7      |  Naive 1D convolution with boundary checks                         |

| Day 8      |  Matrix multiplication using cuBLAS                                |

| Day 9      |  Matrix Transpose                                                  |

| Day 10 🥳  |  Naive Softmax                                                     |

| Day 11     |  Softmax using shared memory and reductions                        |

| Day 12     |  Softmax using warp shuffle functions                              |

| Day 13     |  1D complex-to-complex fourier transform using cuFFT               |

| Day 14     |  Naive layer normalization                                         |

| Day 15     |  Optimizing layer norm using shared memory                         |

| Day 16     |  Optimizing layer norm using warp shuffle functions                |

| Day 17     |  Optimizing layer norm using vectorized loads                      |

| Day 18     |  Tiled 1D convolution and halo cells                               |

| Day 19     |  1D convolution using L2 cache                                     |

| Day 20 🥳  |  [Blog Post: Optimizing Layer Normalization with CUDA](https://aryagxr.com/blogs/cuda-optimizing-layernorm) |

| Day 21     |  Simple self attention                                             |

| Day 22     |  Optimizing self attention                                         |

| Day 23     |  Causal attention with masking                                     |

| Day 24     |  Causal attention + torch binding                                  |

| Day 25     |  Multi-head attention                                              |

| Day 26     |  Parallel add using koggle stone algorithm                         |

| Day 27     |  MHA debug                                                         |

| Day 28     |  Flash Attention 1 (algorithm 1) Forward pass                      |

| Day 29     |  Flash Attention 1 (algorithm 1) Forward pass continued            |

| Day 30 🥳  |  Flash Attention 1 (algorithm 1) Forward pass                      |

| Day 31     |  HGEMV matvec using fp16                                           |

| Day 32     |  HGEMV matvec using Bfloat16                                       |

| Day 33     |  Matmul using Tensor cores                                         |

| Day 34     |  Swizzle patterns on matrix transpose                              |

| Day 35     |  Swizzled matrix transpose using Tensor Memory Accelerators        |

| Day 36     |  Brent Kung Parallel Scan algorithm                                |

| Day 37     |  Matvec using integer fixed point arithmetic                       |

| Day 38     |  Transfered 1D array from gmem->smem->gmem using TMA               |

| Day 39     |  Memory Coalesced layernorm + revisited Flash attention            |

| Day 40 🥳  |  revisited Flash Attention 1                                       |

| Day 41     |  Flash Attention 1                                                 |

| Day 42     |  Flash Attention 1                                                 |

| Day 43     |  ReLU Activation - FP32, FP32x4, FP16, FP16x2 vectorized           |

| Day 44     |  Overlapping data transfers using CUDA Streams (Vector add)        |

| Day 45     |  ReLU using CUDA Streams + benchmarked                             |

| Day 46     |  Packed 128 bit ReLU FP16x8 kernel                                 |

| Day 47     |  Sparse matrix-vector mul (spMV)                                   |

| Day 48     |  Sparse padded matrix-vector mul                                   |

| Day 49     |  RoPE Kernel: Rotary Position Embedding naive fp32                 |

| Day 50 🥳  |  Optimized RoPE using vectorized loads and half precision (18x)    |

| Day 51     |  Flash Attention 2 Forward                                         |

| Day 52     |  Flash Attention 2 Forward                                         |

| Day 53     |  Flash Attention 2 Forward                                         |

| Day 54     |  Gaussian Elimination                                              |

| Day 55     |  PTX vector add kernel                                             |

| Day 56     |  GELU activatation naive fp32 kernel                               |

| Day 57     |  GELU activation vectorized                                        |

| Day 58     |  Backward pass kernel for Relu activation                          |

| Day 59     |  Backward pass kernel for GELU activation                          |

| Day 60 🥳  |  LeetGPU challenge - reduction                                     |

| Day 61     |  Optimize + benchmarked gelu kernels                               |

| Day 62     |  Micrograd in CUDA                                                 |

| Day 63     |  Micrograd in CUDA                                                 |

| Day 64     |  Micrograd in CUDA                                                 |

| Day 65     |  Micrograd in CUDA                                                 |

| Day 66     |  Optimized Sigmoid activation                                      |

| Day 67 - Day 70  🥳  |  Micrograd in CUDA                                       |

| Day 71     |  Sigmoid with half precision                                       |

| Day 72     |  Sigmoid with fp16 vectorized                                      |

| Day 73     |  Swish kernel                                                      |

| Day 74     |  Swish kernel vectorized                                           |

| Day 75     |  AMD hip kernel intro + vector add kernel                          |

| Day 76     |  Revisiting gemm optimizations                                     |

| Day 77     |  Gemm coalesced                                                    |

| Day 78     |  fp16 swish                                                        |

| Day 79     |  AMD competition fp8 gemm & swish optimizations                    |

| Day 80 🥳  |  AMD competition fp8 gemm optimizations                            |

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aryagxr/cuda

Awesome Lists containing this project

README