{"id":13562517,"url":"https://github.com/gpu-mode/resource-stream","last_synced_at":"2025-05-14T20:06:02.520Z","repository":{"id":214334338,"uuid":"736261057","full_name":"gpu-mode/resource-stream","owner":"gpu-mode","description":"GPU programming related news and material links","archived":false,"fork":false,"pushed_at":"2025-01-06T10:23:10.000Z","size":118,"stargazers_count":1436,"open_issues_count":0,"forks_count":84,"subscribers_count":47,"default_branch":"main","last_synced_at":"2025-04-02T02:11:48.595Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://discord.gg/gpumode","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gpu-mode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-27T12:21:19.000Z","updated_at":"2025-04-01T07:02:06.000Z","dependencies_parsed_at":"2024-03-06T07:30:38.740Z","dependency_job_id":"b83777ff-2f5b-4824-becf-9748f737a9ca","html_url":"https://github.com/gpu-mode/resource-stream","commit_stats":null,"previous_names":["cuda-mode/resource-stream"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2Fresource-stream","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2Fresource-stream/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2Fresource-stream/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2Fresource-stream/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gpu-mode","download_url":"https://codeload.github.com/gpu-mode/resource-stream/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247968373,"owners_count":21025823,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T13:01:09.474Z","updated_at":"2025-04-09T03:12:02.496Z","avatar_url":"https://github.com/gpu-mode.png","language":null,"funding_links":[],"categories":["Others","Learning Resources"],"sub_categories":[],"readme":"# GPU MODE Resource Stream\n[![](https://dcbadge.limes.pink/api/server/gpumode?style=flat)](https://discord.gg/gpumode)\n\n[https://discord.gg/gpumode](https://discord.gg/gpumode)\n\nHere you find a collection of CUDA related material (books, papers, blog-post, youtube videos, tweets, implementations etc.). We also collect information to higher level tools for performance optimization and kernel development like [Triton](https://triton-lang.org) and `torch.compile()` ... whatever makes the GPUs go brrrr.\n\nYou know a great resource we should add? Please see [How to contribute](#how-to-contribute).\n\n\n## Lectures / Reading Group Live Sessions\n\nYou find a list of upcoming lectures in the Events option in the channel list (side bar) of our [discord server](https://discord.gg/gpumode).\n\nRecordings of the weekly lectures are published on our [YouTube channel](https://www.youtube.com/@GPUMODE). Material (code, slides) for the individual lectures can be found in the [lectures](https://github.com/gpu-mode/lectures) repository.\n\n\n## 1st Contact with CUDA\n- [An Easy Introduction to CUDA C and C++](https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/)\n- [An Even Easier Introduction to CUDA](https://developer.nvidia.com/blog/even-easier-introduction-cuda/)\n- [CUDA Toolkit Documentation ](https://docs.nvidia.com/cuda/)\n- Basic terminology: Thread block, Warp, Streaming Multiprocessor: [Wiki: Thread Block](https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)), [A tour of CUDA](https://tbetcke.github.io/hpc_lecture_notes/cuda_introduction.html)\n- [GPU Performance Background User's Guide](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html)\n- [OLCF NVIDIA CUDA Training Series](https://www.olcf.ornl.gov/cuda-training-series/), talk recordings can be found under the presentation footer for each lecture; [exercises](https://github.com/olcf/cuda-training-series)\n- [GTC 2022 - CUDA: New Features and Beyond - Stephen Jones](https://www.youtube.com/watch?v=SAm4gwkj2Ko)\n- Intro video: [Writing Code That Runs FAST on a GPU](https://youtu.be/8sDg-lD1fZQ)\n- 12 hrs CUDA tutorial: [Introduction of CUDA and writing kernels in CUDA](https://www.youtube.com/watch?v=86FAWCzIe_4)\n\n\n## 2nd Contact\n- [CUDA Refresher](https://developer.nvidia.com/blog/tag/cuda-refresher/)\n\n## Hazy Research\n\nThe MLSys-oriented research group at Stanford led by Chris Re, with\nalumni Tri Dao, Dan Fu, and many others. A goldmine.\n\n- [Building Blocks for AI\n  Systems](https://github.com/HazyResearch/aisys-building-blocks):\n  Their collection of resources similar to this one, many great links\n- [Data-Centric AI](https://github.com/HazyResearch/data-centric-ai):\n  An older such collection\n- [Blog](https://hazyresearch.stanford.edu/blog)\n- [ThunderKittens](https://hazyresearch.stanford.edu/blog/2024-05-12-tk):\n  (May 2024) A DSL within CUDA, this blog post has good background on\n  getting good H100 performance\n- [Systems for Foundation Models, and Foundation Models for\n  Systems](https://neurips.cc/virtual/2023/invited-talk/73990): Chris\n  Re's keynote from NeurIPS Dec 2023\n\n## Papers, Case Studies\n- [A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library](https://arxiv.org/abs/2312.11918)\n- [How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog](https://siboehm.com/articles/22/CUDA-MMM)\n- [Anatomy of high-performance matrix multiplication](https://dl.acm.org/doi/10.1145/1356052.1356053)\n\n\n## Books\n- [Programming Massively Parallel Processors: A Hands-on Approach](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311)\n- [Cuda by Example: An Introduction to General-Purpose Gpu Programming](https://edoras.sdsu.edu/~mthomas/docs/cuda/cuda_by_example.book.pdf); [code](https://github.com/tpn/cuda-by-example)\n- [The CUDA Handbook](https://www.cudahandbook.com/)\n- [The Book of Shaders](https://thebookofshaders.com/) guide through the abstract and complex universe of Fragment Shader (not cuda but GPU related)\n- [Art of HPC](https://theartofhpc.com/) 4 books on HPC more generally, does not specifically cover GPUs but lessons broadly apply\n\n## Cuda Courses\n- [HetSys: Programming Heterogeneous Computing Systems with GPUs and other Accelerators](https://safari.ethz.ch/projects_and_seminars/fall2022/doku.php?id%253Dheterogeneous_systems)\n- [Heterogeneous Parallel Programming Class](https://www.youtube.com/playlist?list=PLzn6LN6WhlN06hIOA_ge6SrgdeSiuf9Tb) (YouTube playlist) Prof. Wen-mei Hwu, University of Illinois\n- [Official YouTube channel for \"Programming Massively Parallel Processors: A Hands-on Approach\"](https://www.youtube.com/@pmpp-book), course playlist: [Applied Parallel Programming](https://www.youtube.com/playlist?list=PLRRuQYjFhpmvu5ODQoY2l7D0ADgWEcYAX)\n- [Programming Parallel Computers](https://ppc-exercises.cs.aalto.fi/courses); covers both CUDA and CPU-parallelism. Use [Open Course Version](https://ppc-exercises.cs.aalto.fi/course/open2024a) and you can even submit your own solutions to the exercises for testing and benchmarking. \n\n\n## CUDA Grandmasters\n\n### Tri Dao\n- x: [@tri_dao](https://twitter.com/tri_dao), gh: [tridao](https://github.com/tridao)\n- [Dao-AILab/flash-attention](https://github.com/Dao-AILab/flash-attention), [paper](https://arxiv.org/abs/2205.14135)\n- [state-spaces/mamba](https://github.com/state-spaces/mamba), paper: [Mamba: Linear-Time Sequence Modeling with Selective State Spaces](https://arxiv.org/abs/2312.00752), minimal impl: [mamba-minimal](https://github.com/johnma2006/mamba-minimal)\n\n\n### Tim Dettmers\n- x: [@Tim_Dettmers](https://twitter.com/Tim_Dettmers), gh: [TimDettmers](https://github.com/TimDettmers)\n- [TimDettmers/bitsandbytes](https://github.com/TimDettmers/bitsandbytes), docs: [docs](https://bitsandbytes.readthedocs.io/en/latest/)\n- [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)\n\n\n### Sasha Rush\n- x: [@srush_nlp](https://twitter.com/srush_nlp), gh: [srush](https://github.com/srush)\n- [Sasha Rush's GPU Puzzles](https://github.com/srush/GPU-Puzzles), dshah3's [CUDA C++ version](https://github.com/dshah3/GPU-Puzzles) \u0026 [walkthrough video](https://www.youtube.com/watch?v=3frRR6fycgM)\n- [Mamba: The Hard Way](https://srush.github.io/annotated-mamba/hard.html), code: [srush/annotated-mamba](https://github.com/srush/annotated-mamba)\n\n\n## Practice\n- [Adnan Aziz and Anupam Bhatnagar GPU Puzzlers](http://www.gpupuzzlers.com/)\n\n\n## PyTorch Performance Optimization\n- [Accelerating Generative AI with PyTorch: Segment Anything, Fast](https://pytorch.org/blog/accelerating-generative-ai/)\n- [Accelerating Generative AI with PyTorch II: GPT, Fast](https://pytorch.org/blog/accelerating-generative-ai-2/)\n- [Speed, Python: Pick Two. How CUDA Graphs Enable Fast Python Code for Deep Learning](https://blog.fireworks.ai/speed-python-pick-two-how-cuda-graphs-enable-fast-python-code-for-deep-learning-353bf6241248)\n- [Performance Debugging of Production PyTorch Models at Meta](https://pytorch.org/blog/performance-debugging-of-production-pytorch-models-at-meta/)\n\n\n## PyTorch Internals \u0026 Debugging\n- [TorchDynamo Deep Dive](https://pytorch.org/docs/stable/torch.compiler_dynamo_overview.html)\n- [PyTorch Compiler Troubleshooting](https://github.com/pytorch/pytorch/blob/main/docs/source/torch.compiler_troubleshooting.rst)\n- [PyTorch internals](http://blog.ezyang.com/2019/05/pytorch-internals/)\n- [Pytorch 2 internals](https://drive.google.com/file/d/1XBox0G3FI-71efQQjmqGh0-VkCd-AHPL/view)\n- Understanding GPU memory: [1: Visualizing All Allocations over Time](https://pytorch.org/blog/understanding-gpu-memory-1/), [2: Finding and Removing Reference Cycles](https://pytorch.org/blog/understanding-gpu-memory-2/)\n- Debugging memory using snapshots: [Debugging PyTorch memory use with snapshots](https://zdevito.github.io/2022/08/16/memory-snapshots.html)\n- CUDA caching allocaator: [https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html](https://zdevito.github.io/2022/08/04/cuda-caching-allocator.html)\n- Trace Analyzer:  [PyTorch Trace Analysis for the Masses](https://pytorch.org/blog/trace-analysis-for-masses/)\n- [Holistic Trace Analysis (HTA)](https://hta.readthedocs.io/en/latest/), gh: [facebookresearch/HolisticTraceAnalysis](https://github.com/facebookresearch/HolisticTraceAnalysis)\n\n\n## Code / Libs\n- [NVIDIA/cutlass](https://github.com/NVIDIA/cutlass)\n\n\n## Essentials\n- [Triton compiler tutorials](https://triton-lang.org/main/getting-started/tutorials/index.html)\n- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/)\n- [PyTorch: Custom C++ and CUDA Extensions](https://pytorch.org/tutorials/advanced/cpp_extension.html), Code: [pytorch/extension-cpp](https://github.com/pytorch/extension-cpp/tree/master)\n- [PyTorch C++ API](https://pytorch.org/cppdocs/index.html)\n- [pybind11 documentation](https://pybind11.readthedocs.io/en/stable/)\n- [NVIDIA Tensor Core Programming](https://leimao.github.io/blog/NVIDIA-Tensor-Core-Programming/)\n- [GPU Programming: When, Why and How?](https://enccs.github.io/gpu-programming/#)\n- [How GPU Computing Works | GTC 2021](https://youtu.be/3l10o0DYJXg?si=t5FHswnibAbo3s0t) (more basic than the 2022 version)\n- [How CUDA Programming Works | GTC 2022](https://youtu.be/n6M8R8-PlnE?si=cJ4dWtpYaPoIuJ0q)\n- [CUDA Kernel optimization Part 1](https://www.youtube.com/watch?v=hOi3NWOPVR8) [Part 2](https://www.youtube.com/watch?v=NrWhZMHrP4w)\n- [PTX and ISA Programming Guide](https://docs.nvidia.com/cuda/parallel-thread-execution/index.html) (V8.3)\n- Compiler Explorer: Inspect PTX: [div 256 -\u003e shr 8 example](https://godbolt.org/z/odb3191vK)\n\n\n## Profiling\n- [Nsight Compute Profiling Guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html)\n- [mcarilli/nsight.sh](https://gist.github.com/mcarilli/376821aa1a7182dfcf59928a7cde3223) - Favorite nsight systems profiling commands for PyTorch scripts\n- [Profiling GPU Applications with Nsight Systems](https://www.youtube.com/watch?v=kKANP0kL_hk)\n\n\n## Python GPU Computing\n- [PyTorch](https://pytorch.org/)\n- [Trtion](https://triton-lang.org/main/index.html), github: [openai/triton](https://github.com/openai/triton/)\n- [numba @cuda.jit](https://numba.readthedocs.io/en/stable/cuda/kernels.html)\n- [Apache TVM](https://tvm.apache.org/)\n- [JAX Pallas](https://jax.readthedocs.io/en/latest/pallas/index.html)\n- [CuPy](https://cupy.dev/) NumPy compatible GPU Computing\n- [NVidia Fuser](https://github.com/NVIDIA/Fuser/)\n- [Codon @gpu.kernel](https://docs.exaloop.io/codon/advanced/gpu), github: [exaloop/codon](https://github.com/exaloop/codon)\n- [Mojo](https://docs.modular.com/mojo/manual/) (part of commercial [MAX Plattform](https://www.modular.com/max) by [Modular](https://www.modular.com))\n- NVIDIA Python Bindings: [CUDA Python](https://github.com/NVIDIA/cuda-python) (calling NVRTC to compile kernels, malloc, copy, launching kernels, ..), [cuDNN FrontEnd(FE) API](https://github.com/NVIDIA/cudnn-frontend), [CUTLASS Python Interface](https://github.com/NVIDIA/cutlass/tree/main/python)\n\n\n## Advanced Topics, Research, Compilers\n- [TACO](http://tensor-compiler.org/): The Tensor Algebra Compiler, gh: [tensor-compiler/taco](https://github.com/tensor-compiler/taco)\n- [Mosaic compiler](https://github.com/manya-bansal/mosaic) C++ DSL for sparse and dense tensors algebra (built on top of TACO), [paper](https://dl.acm.org/doi/10.1145/3591236), [presentation](https://aha.stanford.edu/mosaic-interoperable-compiler-tensor-algebra)\n\n\n## News\n- [SemiAnalysis](https://www.semianalysis.com/)\n\n\n## Technical Blog Posts\n- [Cooperative Groups: Flexible CUDA Thread Programming](https://developer.nvidia.com/blog/cooperative-groups/) (Oct 04, 2017)\n- [A friendly introduction to machine learning compilers and optimizers](https://huyenchip.com/2021/09/07/a-friendly-introduction-to-machine-learning-compilers-and-optimizers.html) (Sep 7, 2021)\n\n\n## Hardware Architecture\n- [NVIDIA H100 Whitepaper](https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper)\n- [NVIDIA GH200 Whitepaper](https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper)\n- [AMD CDNA 3 Whitepaper](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf)\n- [AMD MI300X Data Sheet](https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf)\n- Video: [Can SRAM Keep Shrinking?](https://youtu.be/2G4_RZo41Zw) (by [Asianometry](https://www.asianometry.com/))\n\n\n## GPU-MODE Community Projects\n\n## ring-attention\n- see our [ring-attention](https://github.com/gpu-mode/ring-attention) repo\n\n## pscan\n- GPU Gems: [Parallel Prefix Sum (Scan) with CUDA](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), [PDF version (2007)](https://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/scan/doc/scan.pdf), impl: [stack overflow](https://stackoverflow.com/a/30835030/387870), nicer impl: [mattdean1/cuda](https://github.com/mattdean1/cuda)\n- [Accelerating Reduction and Scan Using Tensor Core Units](https://arxiv.org/abs/1811.09736)\n- Thrust: [Prefix Sums](https://docs.nvidia.com/cuda/thrust/index.html#prefix-sums), Reference: [scan variants](https://thrust.github.io/doc/group__prefixsums.html)\n- [CUB](https://nvlabs.github.io/cub/), part of cccl: [NVIDIA/cccl/tree/main/cub](https://github.com/NVIDIA/cccl/tree/main/cub)\n- SAM Algorithm: [Higher-Order and Tuple-Based Massively-Parallel Prefix Sums](https://userweb.cs.txstate.edu/~mb92/papers/pldi16.pdf) (licensed for non commercial use only)\n- CUB Algorithm: [Single-pass Parallel Prefix Scan with Decoupled Look-back](https://research.nvidia.com/publication/2016-03_single-pass-parallel-prefix-scan-decoupled-look-back)\n- Group Experiments: [johnryan465/pscan](https://github.com/johnryan465/pscan), [andreaskoepf/pscan_kernel](https://github.com/andreaskoepf/pscan_kernel)\n\n\n## Triton Kernels / Examples\n\n- [`unsloth`](https://github.com/unslothai/unsloth) that implements custom kernels in Triton for faster QLoRA training\n- Custom implementation of relative position attention ([link](https://github.com/pytorch-labs/segment-anything-fast/blob/main/segment_anything_fast/flash_4.py))\n- Tri Dao's Triton implementation of Flash Attention: [flash_attn_triton.py](https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_triton.py)\n- YouTube playlist: [Triton Conference 2023](https://www.youtube.com/watch?v=ZGU0Yw7mORE\u0026list=PLc_vA1r0qoiRZfUC3o4_yjj0FtWvodKAz)\n- [LightLLM](https://github.com/ModelTC/lightllm) with different triton kernels for different LLMs\n\n\n## How to contribute\nTo share interesting CUDA related links please create a pull request for this file. See [editing files](https://docs.github.com/en/repositories/working-with-files/managing-files/editing-files) in the github documentation.\n\nOr contact us on the **GPU MODE** discord server: [https://discord.gg/gpumode](https://discord.gg/gpumode)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpu-mode%2Fresource-stream","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgpu-mode%2Fresource-stream","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpu-mode%2Fresource-stream/lists"}