{"id":97403,"url":"https://github.com/goabiaryan/awesome-gpu-engineering","name":"awesome-gpu-engineering","description":"GPU Engineering for AI Systems","projects_count":61,"last_synced_at":"2026-06-03T07:00:21.481Z","repository":{"id":318849222,"uuid":"1046725736","full_name":"goabiaryan/awesome-gpu-engineering","owner":"goabiaryan","description":"GPU Engineering for AI Systems","archived":false,"fork":false,"pushed_at":"2026-05-17T11:48:04.000Z","size":934,"stargazers_count":304,"open_issues_count":0,"forks_count":36,"subscribers_count":4,"default_branch":"main","last_synced_at":"2026-05-17T18:31:30.963Z","etag":null,"topics":["awesome","awesome-lists","cuda-programming","gpu-engineering","gpu-programming","kernels"],"latest_commit_sha":null,"homepage":"https://gpuengineering.com","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/goabiaryan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":null,"code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-29T06:03:38.000Z","updated_at":"2026-05-17T11:48:09.000Z","dependencies_parsed_at":"2025-10-16T19:05:37.852Z","dependency_job_id":"5a8b2030-6e85-4227-bf77-710ce03936ac","html_url":"https://github.com/goabiaryan/awesome-gpu-engineering","commit_stats":null,"previous_names":["goabiaryan/awesome-gpu-engineering"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/goabiaryan/awesome-gpu-engineering","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goabiaryan%2Fawesome-gpu-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goabiaryan%2Fawesome-gpu-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goabiaryan%2Fawesome-gpu-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goabiaryan%2Fawesome-gpu-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/goabiaryan","download_url":"https://codeload.github.com/goabiaryan/awesome-gpu-engineering/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/goabiaryan%2Fawesome-gpu-engineering/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33852295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"created_at":"2025-10-29T22:02:12.661Z","updated_at":"2026-06-03T07:00:21.482Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["🧾 License","💻 GPU Programming Frameworks","🧩 Optimization and Performance","⚙️ Systems and Multi-GPU Engineering","📘 Foundational Books","🧪 Tutorials and Courses","⭐ Acknowledgements","📄 Research Papers and Articles","🧰 Tools and Utilities","🧠 Architecture and Low-Level Design"],"sub_categories":["Learning Tools"],"readme":"# Awesome GPU Engineering [![Awesome](https://awesome.re/badge.svg)](https://awesome.re)\n\n\u003e A curated list of resources for mastering GPU engineering from architecture and kernel programming to large-scale distributed systems and AI acceleration.\n\n---\n\n## 📘 Foundational Books\n\n- **Programming Massively Parallel Processors: A Hands-on Approach** — *David B. Kirk \u0026 Wen-mei W. Hwu* \n  The canonical introduction to CUDA, memory hierarchies, and parallel patterns. *[Amazon](https://www.amazon.com/Programming-Massively-Parallel-Processors-Hands/dp/0323912311)* , notes: [*Abi's Concise Notes*](https://github.com/goabiaryan/awesome-gpu-engineering/blob/main/notes/Abi's%20PMPP%20Notes.pdf)\n- **CUDA by Example** — *Jason Sanders \u0026 Edward Kandrot*  \n  A practical introduction to CUDA for beginners. *[Amazon](https://www.amazon.com/CUDA-Example-Introduction-General-Purpose-Programming/dp/0131387685)*\n- **The Ultra-Scale Playbook: Training LLMs on GPU Clusters** - Hugging Face *[Web Version](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=high-level_overview)*\n\n\n## 💻 GPU Programming Frameworks\n\n- **[CUDA](https://developer.nvidia.com/cuda-toolkit)** — NVIDIA’s proprietary GPU programming platform.\n  - Libraries: [cuBLAS](https://developer.nvidia.com/cublas), [cuDNN](https://developer.nvidia.com/cudnn)\n- **[ROCm](https://github.com/RadeonOpenCompute/ROCm)** — AMD’s open compute stack.\n- **[OpenCL](https://www.khronos.org/opencl/)** — Cross-platform parallel computing standard.\n- **[SYCL / oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html)** — Intel’s C++ abstraction for heterogeneous compute.\n- **[Vulkan Compute](https://www.khronos.org/vulkan/)** — Low-level GPU compute API.\n- **[Kompute](github.com/komputeproject/kompute)** — Higher level general purpose GPU compute framework built on Vulkan.\n- **[Metal Performance Shaders](https://developer.apple.com/metal/)** — Apple’s GPU framework.\n- **[Mojo🔥](https://mojolang.org)** - Write like Python, run like C++.\n\n\n## 🧩 Optimization and Performance\n\n- **[NVIDIA Nsight Systems](https://developer.nvidia.com/nsight-systems)** — System-wide GPU profiler.\n- **[Nsight Compute](https://developer.nvidia.com/nsight-compute)** — Kernel-level performance analysis.\n- **Occupancy Calculator** — NVIDIA spreadsheet for kernel configuration.\n- **[CUTLASS](https://github.com/NVIDIA/cutlass)** — CUDA templates for linear algebra subroutines.\n- **[TensorRT](https://developer.nvidia.com/tensorrt)** — High-performance deep learning inference.\n- **[OpenAI Triton](https://triton-lang.org/)** — Python DSL for writing high-performance GPU kernels.\n- **[Helion](https://helionlang.com)** - A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.\n- **[Roofline Model](https://jax-ml.github.io/scaling-book/)** — Analytical model to reason about compute/memory bottlenecks.\n\n\n## 🧠 Architecture and Low-Level Design\n\n- **[NVIDIA Ampere Whitepaper](https://developer.nvidia.com/ampere-architecture)**\n- **[AMD RDNA \u0026 CDNA Architectures](https://gpuopen.com/learn/)**\n- SIMT execution and warp scheduling\n- Memory hierarchy and coalescing\n- Shared memory and cache optimization\n- Warp divergence and thread occupancy\n\n\n## ⚙️ Systems and Multi-GPU Engineering\n\n- **[NCCL](https://developer.nvidia.com/nccl)** — Multi-GPU communication primitives.\n- **[vLLM](https://github.com/vllm-project/vllm)** - Inference and serving engine for LLMs\n- **[Hugging Face Accelerate](https://github.com/huggingface/accelerate)** - Simplify abstractions for distributed training\n- **[SGLang](https://github.com/sgl-project/sglang)**\n- **[Prime Intellect](https://github.com/PrimeIntellect-ai/prime-cli)**\n- **[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM)**\n- **[TGI by Hugging Face](https://huggingface.co/docs/text-generation-inference/en/index)**\n- **[Horovod](https://github.com/horovod/horovod)** — Distributed deep learning across GPUs.\n- **NVLink \u0026 PCIe Topology** — GPU interconnects and bandwidth optimization.\n- **[GPUDirect RDMA](https://developer.nvidia.com/gpudirect)** — Zero-copy GPU networking.\n- **[Ray Train](https://docs.ray.io/en/latest/train/index.html)**, **[DeepSpeed](https://github.com/microsoft/DeepSpeed)**, **[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)** — Large-scale GPU orchestration frameworks.\n- **[Iris by AMD](https://github.com/ROCm/iris)** - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.\n\n\n## 🧪 Tutorials and Courses\n\n- [CUDA C++ Programming Guide](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html)\n- [Triton Tutorials (OpenAI)](https://triton-lang.org/main/getting-started/tutorials/index.html)\n- [CUDA in 12 hours by FreeCodeCamp](https://www.youtube.com/watch?v=86FAWCzIe_4)  and [Video Repo](https://github.com/infatoshi/cuda-course)\n- [Stanford CS149, Fall 2025 Parallel Computing Course Fall 2025](https://gfxcourses.stanford.edu/cs149/fall25/)\n- [CMU 15-418/618: Parallel Computer Architecture \u0026 Programming](https://www.cs.cmu.edu/~418/)\n- [MIT 6.5940: TinyML and Efficient Deep Learning Computing](https://hanlab.mit.edu/courses/2024-fall-65940)\n- [GPU MODE video lecture series](https://www.youtube.com/@GPUMODE/videos)\n- [Red Hat vLLM Office Hours video series](https://www.youtube.com/playlist?list=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3)\n- [The courses of the Programming Massively Parallel Processors book's authors](https://www.youtube.com/@pmpp-book/playlists)\n\n\n\n## 📄 Research Papers and Articles\n\n- *[Optimization techniques for GPU programming](https://dl.acm.org/doi/pdf/10.1145/3570638)* - Hijma, Pieter, et al.\n- *[Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=11186485)* - Oden, Lena, and Klaus Nölp\n- *[Evolving GPU Architecture](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9623445\u0026casa_token=Zknb-Go77Y4AAAAA:03tRVI5oLoyDZMx-UZZiWp9h7JRTc-UHNmiHykq2MZWBKNFBwjxEUpuddkX54Z246I6gjDUpdw\u0026tag=1)* — Kirk \u0026 Hwu\n- *[Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision](https://arxiv.org/abs/2205.11913)* - Wei Gao et al\n- *[Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=11064558)*  - Niteesh, L., and M. B. Ampareeshan\n- NVIDIA Research Papers on *[Model Parallelism](https://dl.acm.org/doi/pdf/10.1145/3458817.3476209?casa_token=p3epEa_Z4xEAAAAA:fZgVzYD2uMH5NcafdBN9g7EgIbESqB7WsHjL0X6LU2zdm6EdgQkMyIFk0yZAfWGl1o3PeUSB4xhg)* and *[Megatron-LM](https://arxiv.org/pdf/1909.08053)*\n- *[GPU Virtualization and Multi-Tenant Scheduling](https://dl.acm.org/doi/pdf/10.1145/3068281?casa_token=bbU9Dvrt3vsAAAAA:jxP-NNGr8GEmjOng-EFlb1Rd6wVSQAXg65GTK1jDPlGIkGjNIirMWkDZcjnTw0xDZmLGZ489LwHX)*\n- *[A Survey of Multi-Tenant Deep Learning Inference on GPU](https://arxiv.org/abs/2203.09040)*\n- *[Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception](https://www.youtube.com/watch?v=e54BVwcdJ4Y)*\n\n\n## 🧰 Tools and Utilities\n\n- **nvprof**, **nvvp**, **Nsight Systems / Compute** — NVIDIA profiling tools.\n- **cuda-memcheck**, **compute-sanitizer** — Memory and correctness tools.\n- **[GPGPU-Sim](https://github.com/gpgpu-sim/gpgpu-sim)**, **[Accel-Sim](https://accel-sim.github.io/)** — GPU simulation frameworks.\n- **[Ingero](https://github.com/ingero-io/ingero)** — eBPF-based GPU causal observability agent. Traces CUDA Runtime/Driver APIs and host kernel events to build causal chains explaining GPU latency. \u003c2% overhead, production-safe.\n- **Perfetto**, **Nsight UI** — Visual profilers for tracing GPU workloads.\n\n### Learning Tools\n\n- **[Tensara](https://gpuengineering.com)**\n- **[LeetGPU](https://leetgpu.com/)**\n- **[GPU MODE Discord](https://discord.gg/FnjEVAhW)**\n- **[GPU Glossary](https://modal.com/gpu-glossary)** - A dictionary of terms related to programming GPUs\n- **[Mojo🔥 GPU Puzzles](https://puzzles.modular.com)**\n\n\n## 🧑‍🔬 GPU for AI \u0026 ML\n\n- **PyTorch CUDA Extensions** — Custom kernels for PyTorch.\n- **JAX + XLA** — Compiler-based GPU vectorization.\n- **TensorFlow XLA Compiler** — Ahead-of-time GPU graph compilation.\n- **FlashAttention**, **FlashConv** — Kernel optimization techniques for transformers.\n- **DeepSpeed**, **FSDP**, **Megatron-LM** — Distributed training systems.\n\n## 🧱 GPU Systems Design Topics For Interview Prep\n\n- FlashAttention and PagedAttention\n- Matmul Operations\n- GPU scheduling algorithms and runtime systems.\n- Memory oversubscription and unified memory models.\n- Resource allocation in GPU clusters.\n- GPU virtualization\n- Kernel fusion and graph execution\n- Dataflow optimization\n- Persistent threads model\n\n---\n\n## 🧑‍💻 Contributors\n\nContributions welcome!\nPlease read the [contribution guidelines](CONTRIBUTING.md) before submitting a pull request.\n\n## 🧾 License\n\n[CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) — feel free to share and adapt with attribution.\n\n## ⭐ Acknowledgements\n\nInspired by:\n- [Awesome HPC](https://github.com/trevor-vincent/awesome-high-performance-computing)\n- [Awesome Computer Architecture](https://github.com/aalhour/awesome-computer-architecture)\n- [Awesome CUDA](https://github.com/coderonion/awesome-cuda-and-hpc)\n\n---\n\n\u003e “GPU engineering is not just about writing kernels. It’s about understanding how systems work.”  — [Model Craft](https://modelcraft.substack.com/p/fundamentals-of-gpu-engineering)\n\n","projects_url":"https://awesome.ecosyste.ms/api/v1/lists/goabiaryan%2Fawesome-gpu-engineering/projects"}