https://github.com/gpu-mode/lectures
Material for gpu-mode lectures
https://github.com/gpu-mode/lectures
Last synced: 1 day ago
JSON representation
Material for gpu-mode lectures
- Host: GitHub
- URL: https://github.com/gpu-mode/lectures
- Owner: gpu-mode
- License: apache-2.0
- Created: 2024-01-20T19:28:02.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-09T17:36:02.000Z (2 months ago)
- Last Synced: 2025-04-06T13:01:28.667Z (8 days ago)
- Language: Jupyter Notebook
- Homepage: https://www.youtube.com/@GPUMODE
- Size: 102 MB
- Stars: 4,181
- Watchers: 62
- Forks: 421
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-cuda-and-hpc - gpu-mode/lectures - mode/lectures?style=social"/> : Material for gpu-mode lectures. [www.youtube.com/@GPUMODE](https://www.youtube.com/@GPUMODE) (Learning Resources)
README
# Supplementary Material for Lectures
[](https://discord.gg/gpumode)[YouTube Channel](https://www.youtube.com/@GPUMODE)
The PMPP Book: [Programming Massively Parallel Processors: A Hands-on Approach](https://a.co/d/2S2fVzt) (Amazon link)
## Lecture 1: Profiling and Integrating CUDA kernels in PyTorch
- Speaker: [Mark Saroufim](https://twitter.com/marksaroufim)
- Notebook and slides in [lecture_001](./lecture_001/) folder## Lecture 2: Recap Ch. 1-3 from the PMPP book
- Speaker: [Andreas Koepf](https://twitter.com/neurosp1ke)
- Slides: The powerpoint file [lecture_002/cuda_mode_lecture2.pptx](./lecture_002/cuda_mode_lecture2.pptx) can be found in the root directory of this repository. Alternatively [here](https://docs.google.com/presentation/d/1deqvEHdqEC4LHUpStO6z3TT77Dt84fNAvTIAxBJgDck/edit#slide=id.g2b1444253e5_1_75) as Google docs presentation.## Lecture 3: Getting Started With CUDA
- Speaker: [Jeremy Howard](https://twitter.com/jeremyphoward)
- Notebook: See the [lecture_003](./lecture_003/) folder, or run the [Colab version](https://colab.research.google.com/drive/180uk6frvMBeT4tywhhYXmz3PJaCIA_uk?usp=sharing)## Lecture 4: Intro to Compute and Memory Architecture
- Speaker: [Thomas Viehmann](https://lernapparat.de/)
- Notebook and slides in the [lecture_004](./lecture_004/) folder.## Lecture 5: Going Further with CUDA for Python Programmers
- Speaker: [Jeremy Howard](https://twitter.com/jeremyphoward)
- Notebook in the [lecture_005](./lecture_005/) folder.## Lecture 6: Optimizing PyTorch Optimizers
- Speaker: [Jane Xu](https://github.com/janeyx99)
- [Slides](https://docs.google.com/presentation/d/13WLCuxXzwu5JRZo0tAfW0hbKHQMvFw4O/edit#slide=id.p1)## Lecture 7: Advanced Quantization
- Speaker: [Charles Hernandez](https://github.com/HDCharles)
- [Slides](https://www.dropbox.com/scl/fi/hzfx1l267m8gwyhcjvfk4/Quantization-Cuda-vs-Triton.pdf?rlkey=s4j64ivi2kpp2l0uq8xjdwbab&dl=0)## Lecture 8: CUDA Performance Checklist
- Speaker: [Mark Saroufim](https://github.com/msaroufim)
- Code in the [lecture_008](./lecture_008/) folder
- [Slides](https://docs.google.com/presentation/d/1cvVpf3ChFFiY4Kf25S4e4sPY6Y5uRUO-X-A4nJ7IhFE/edit?usp=sharing)## Lecture 9: Reductions
- Speaker: [Mark Saroufim](https://github.com/msaroufim)
- Code in the [lecture_009](./lecture_009/) folder
- [Slides](https://docs.google.com/presentation/d/1s8lRU8xuDn-R05p1aSP6P7T5kk9VYnDOCyN5bWKeg3U/edit?usp=drive_link)## Lecture 10: Build a Prod Ready CUDA Library
* Speaker: [Oscar Amoros Huguet](https://github.com/morousg)
* [slides](https://drive.google.com/drive/folders/158V8BzGj-IkdXXDAdHPNwUzDLNmr971_?usp=drive_link)## Lecture 11: Sparsity
* Speaker: [Jesse Cai](https://github.com/jcaip)
* [Slides](./lecture_011/sparsity.pptx)## Lecture 12: Flash Attention
- Speaker: [Thomas Viehmann](https://lernapparat.de/)## Lecture 13: Ring Attention
- Speaker: [Andreas Koepf](https://twitter.com/neurosp1ke)
- [Slides](./lecture_013/ring_attention.pptx)## Lecture 14: Practitioner's Guide to Triton
- Date: 2024-04-13, Speaker: [Umer Adil](https://twitter.com/UmerHAdil)
- [Notebook](./lecture_014/A_Practitioners_Guide_to_Triton.ipynb)## Lecture 15: CUTLASS
- Speaker: [Eric Auld](https://github.com/ericauld)## Lecture 16: On Hands profiling
- Speaker: [Taylor Robbie](https://www.linkedin.com/in/taylor-robie/)## Bonus Lecture: CUDA C++ llm.cpp
- Speaker: [Jake Hemstad & Georgii Evtushenko]()
- [Slides](https://drive.google.com/drive/folders/1T-t0d_u0Xu8w_-1E5kAwmXNfF72x-HTA)## Lecture 17: GPU Collective Communication (NCCL)
- Speaker: [Dan Johnson](https://physbam.stanford.edu/~dansj/)
- Code in the [lecture_017](./lecture_017/) folder## Lecture 18: Fused Kernels
- Speaker: [Kapil Sharma](https://www.kapilsharma.dev/)
- Code in the [lecture_018](./lecture_018/) folder## Lecture 19: Data Processing on GPUs
- Speaker: [Devavret Makkar](https://github.com/devavret)## Lecture 20: Scan Algorithm
- Speaker: [Izzat El Haj](https://ielhajj.github.io/)
- [Slides](https://docs.google.com/presentation/d/1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)## Lecture 21: Scan Algorithm Part 2
- Speaker: [Izzat El Haj](https://ielhajj.github.io/)
- [Slides](https://docs.google.com/presentation/d/1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)## Lecture 22: Hacker's Guide to Speculative Decoding in VLLM
- Speaker: [Cade Daniel](https://x.com/cdnamz)
- [Slides](https://docs.google.com/presentation/d/1p1xE-EbSAnXpTSiSI0gmy_wdwxN5XaULO3AnCWWoRe4/edit#slide=id.p)## Lecture 23: Tensor Cores
- Speaker: Vijay Thakkar & Pradeep Ramani
- [Slides](https://drive.google.com/file/d/18sthk6IUOKbdtFphpm_jZNXoJenbWR8m/view)## Lecture 24: Scan at the Speed of Light
- Speaker: Jake Hemstad & Georgii Evtushenko## Lecture 25: Speaking Composable Kernel
- Speaker: Haocong Wang
- [Slides](./lecture_025/AMD_ROCm_Speaking_Composable_Kernel_July_20_2024.pdf)## Lecture 26: SYCL MODE (Intel GPU)
- Speaker: Patric Zhao
- [Slides](https://docs.google.com/presentation/d/1SW4XKomAJhhJSH5-jpZI9Qlwp7TEunbV/edit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)## Lecture 27: gpu.cpp
- Speaker: [Austin Huang](https://x.com/austinvhuang)
- [Slides](https://gpucpp-presentation.answer.ai/)## Lecture 28: Liger Kernel
- Speaker: [Byron Hsu](https://x.com/hsu_byron)
- [Slides](https://docs.google.com/presentation/d/1CGTV-uKw9crrBo13q1jAzAFCFzlpZFjeL4bnK67pTd8/edit?usp=sharing)
- Hands-on Notebooks
1. [RMSNorm: Verifying Correctness and Performance](https://colab.research.google.com/drive/1CQYhul7MVG5F0gmqTBbx1O1HgolPgF0M?usp=sharing)
2. [FusedLinearCrossEntropy: Verifying Memory Reduction](https://colab.research.google.com/drive/1Z2QtvaIiLm5MWOs7X6ZPS1MN3hcIJFbj?usp=sharing)
3. [Convergence Comparison: Triton Kernel Patched vs. Original Model Layer-by-Layer](https://colab.research.google.com/drive/1e52FH0BcE739GZaVp-3_Dv7mc4jF1aif?usp=sharing)
4. [Contiguity is the hidden killer](https://colab.research.google.com/drive/1llnAdo0hc9FpxYRRnjih0l066NCp7Ylu?usp=sharing)
5. [Address int32 overflow](https://colab.research.google.com/drive/1WgaU_cmaxVzx8PcdKB5P9yHB6_WyGd4T?usp=sharing)## Lecture 29: Triton Internals
- Speaker: [Kapil Sharma](https://www.kapilsharma.dev/)
- Code/presentation in the [lecture_029](./lecture_029/) folder## Lecture 30: Quantized training
- Speaker: [Thien Tran](https://github.com/gau-nernst)
- Code/presentation in the [lecture_030](./lecture_030/) folder## Lecture 31: Beginners Guide to Metal Kernels
- Speaker: [Nikita Shulga](https://github.com/gau-nernst)
- Code/presentation in the [lecture_031](./lecture_031/) folder## Lecture 32: Unsloth - LLM Systems Engineering
- Speaker: [Daniel Han](https://x.com/danielhanchen)
- [Slides](https://docs.google.com/presentation/d/1BvgbDwvOY6Uy6jMuNXrmrz_6Km_CBW0f2espqeQaWfc/edit?usp=sharing)## Lecture 33: BitBLAS
- Speaker: [Wang Lei](https://github.com/LeiWang1999)
- Code/presentation in the [lecture_033](./lecture_033/) folder## Lecture 34: Low Bit Triton Kernels
- Speaker: [Hicham Badri](https://github.com/mobicham)
- [Slides](https://docs.google.com/presentation/d/1R9B6RLOlAblyVVFPk9FtAq6MXR1ufj1NaT0bjjib7Vc/edit)## Lecture 35: SGLang Performance Optimization
- Speaker: [Yineng Zhang](https://linkedin.com/in/zhyncs)
- [Slides](https://github.com/zhyncs/lectures/blob/main/lecture_035/SGLang-Performance-Optimization-YinengZhang.pdf)# Lecture 36: CUTLASS and Flash ATtention 3
- Speaker: [Jay Shah](https://research.colfax-intl.com/blog/)
- [Slides](lecture_036/)# Lecture 37: Introduction to SASS & GPU Microarchitecture
- Speaker: [Arun Demeure](https://github.com/ademeure)
- [Slides](lecture_037/)# Lecture 38: Lowbit kernels for ARM CPU
- Speaker: [Scott Roy](https://github.com/metascroy)
- [Slides](lecture_038/)