{"id":25411353,"url":"https://github.com/aryagxr/cuda","last_synced_at":"2025-10-05T17:55:28.570Z","repository":{"id":274908790,"uuid":"924459213","full_name":"aryagxr/cuda","owner":"aryagxr","description":"100 Days of CUDA!!!","archived":false,"fork":false,"pushed_at":"2025-04-14T01:17:44.000Z","size":123,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-14T10:53:50.390Z","etag":null,"topics":["cuda","gpu-programming","kernels","parallel-programming"],"latest_commit_sha":null,"homepage":"","language":"Cuda","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aryagxr.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-30T03:27:43.000Z","updated_at":"2025-04-14T01:17:47.000Z","dependencies_parsed_at":"2025-03-19T02:33:38.614Z","dependency_job_id":"7404759c-c1ba-4a2c-a9f2-c8f110413ab7","html_url":"https://github.com/aryagxr/cuda","commit_stats":null,"previous_names":["aryagxr/cuda"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aryagxr/cuda","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryagxr%2Fcuda","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryagxr%2Fcuda/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryagxr%2Fcuda/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryagxr%2Fcuda/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aryagxr","download_url":"https://codeload.github.com/aryagxr/cuda/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aryagxr%2Fcuda/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273213953,"owners_count":25065059,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-01T02:00:09.058Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cuda","gpu-programming","kernels","parallel-programming"],"created_at":"2025-02-16T10:17:14.120Z","updated_at":"2025-10-05T17:55:23.530Z","avatar_url":"https://github.com/aryagxr.png","language":"Cuda","readme":"**CUDA Progress**\n\n| **Day**    | **Code Summary**                                                   |\n|------------|--------------------------------------------------------------------|\n| Day 1      |  CUDA set up and kernel that prints \"Hello World\"                  |\n| Day 2      |  CUDA kernel that adds two vectors                                 |\n| Day 3      |  Adding matrices                                                   |\n| Day 4      |  Vector addition using cuBLAS                                      |\n| Day 5      |  Naive matmul                                                      |\n| Day 6      |  Tiled matmul using shared memory                                  |\n| Day 7      |  Naive 1D convolution with boundary checks                         |\n| Day 8      |  Matrix multiplication using cuBLAS                                |\n| Day 9      |  Matrix Transpose                                                  |\n| Day 10 🥳  |  Naive Softmax                                                     |\n| Day 11     |  Softmax using shared memory and reductions                        |\n| Day 12     |  Softmax using warp shuffle functions                              |\n| Day 13     |  1D complex-to-complex fourier transform using cuFFT               |\n| Day 14     |  Naive layer normalization                                         |\n| Day 15     |  Optimizing layer norm using shared memory                         |\n| Day 16     |  Optimizing layer norm using warp shuffle functions                |\n| Day 17     |  Optimizing layer norm using vectorized loads                      |\n| Day 18     |  Tiled 1D convolution and halo cells                               |\n| Day 19     |  1D convolution using L2 cache                                     |\n| Day 20 🥳  |  [Blog Post: Optimizing Layer Normalization with CUDA](https://aryagxr.com/blogs/cuda-optimizing-layernorm) |\n| Day 21     |  Simple self attention                                             |\n| Day 22     |  Optimizing self attention                                         |\n| Day 23     |  Causal attention with masking                                     |\n| Day 24     |  Causal attention + torch binding                                  |\n| Day 25     |  Multi-head attention                                              |\n| Day 26     |  Parallel add using koggle stone algorithm                         |\n| Day 27     |  MHA debug                                                         |\n| Day 28     |  Flash Attention 1 (algorithm 1) Forward pass                      |\n| Day 29     |  Flash Attention 1 (algorithm 1) Forward pass continued            |\n| Day 30 🥳  |  Flash Attention 1 (algorithm 1) Forward pass                      |\n| Day 31     |  HGEMV matvec using fp16                                           |\n| Day 32     |  HGEMV matvec using Bfloat16                                       |\n| Day 33     |  Matmul using Tensor cores                                         |\n| Day 34     |  Swizzle patterns on matrix transpose                              |\n| Day 35     |  Swizzled matrix transpose using Tensor Memory Accelerators        |\n| Day 36     |  Brent Kung Parallel Scan algorithm                                |\n| Day 37     |  Matvec using integer fixed point arithmetic                       |\n| Day 38     |  Transfered 1D array from gmem-\u003esmem-\u003egmem using TMA               |\n| Day 39     |  Memory Coalesced layernorm + revisited Flash attention            |\n| Day 40 🥳  |  revisited Flash Attention 1                                       |\n| Day 41     |  Flash Attention 1                                                 |\n| Day 42     |  Flash Attention 1                                                 |\n| Day 43     |  ReLU Activation - FP32, FP32x4, FP16, FP16x2 vectorized           |\n| Day 44     |  Overlapping data transfers using CUDA Streams (Vector add)        |\n| Day 45     |  ReLU using CUDA Streams + benchmarked                             |\n| Day 46     |  Packed 128 bit ReLU FP16x8 kernel                                 |\n| Day 47     |  Sparse matrix-vector mul (spMV)                                   |\n| Day 48     |  Sparse padded matrix-vector mul                                   |\n| Day 49     |  RoPE Kernel: Rotary Position Embedding naive fp32                 |\n| Day 50 🥳  |  Optimized RoPE using vectorized loads and half precision (18x)    |\n| Day 51     |  Flash Attention 2 Forward                                         |\n| Day 52     |  Flash Attention 2 Forward                                         |\n| Day 53     |  Flash Attention 2 Forward                                         |\n| Day 54     |  Gaussian Elimination                                              |\n| Day 55     |  PTX vector add kernel                                             |\n| Day 56     |  GELU activatation naive fp32 kernel                               |\n| Day 57     |  GELU activation vectorized                                        |\n| Day 58     |  Backward pass kernel for Relu activation                          |\n| Day 59     |  Backward pass kernel for GELU activation                          |\n| Day 60 🥳  |  LeetGPU challenge - reduction                                     |\n| Day 61     |  Optimize + benchmarked gelu kernels                               |\n| Day 62     |  Micrograd in CUDA                                                 |\n| Day 63     |  Micrograd in CUDA                                                 |\n| Day 64     |  Micrograd in CUDA                                                 |\n| Day 65     |  Micrograd in CUDA                                                 |\n| Day 66     |  Optimized Sigmoid activation                                      |\n| Day 67 - Day 70  🥳  |  Micrograd in CUDA                                       |\n| Day 71     |  Sigmoid with half precision                                       |\n| Day 72     |  Sigmoid with fp16 vectorized                                      |\n| Day 73     |  Swish kernel                                                      |\n| Day 74     |  Swish kernel vectorized                                           |\n| Day 75     |  AMD hip kernel intro + vector add kernel                          |\n| Day 76     |  Revisiting gemm optimizations                                     |\n| Day 77     |  Gemm coalesced                                                    |\n| Day 78     |  fp16 swish                                                        |\n| Day 79     |  AMD competition fp8 gemm \u0026 swish optimizations                    |\n| Day 80 🥳  |  AMD competition fp8 gemm optimizations                            |\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faryagxr%2Fcuda","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faryagxr%2Fcuda","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faryagxr%2Fcuda/lists"}