awesome-gpu-engineering
GPU Engineering for AI Systems
https://github.com/goabiaryan/awesome-gpu-engineering
Last synced: 8 days ago
JSON representation
-
β Acknowledgements
-
Learning Tools
-
-
π§ Architecture and Low-Level Design
-
π Foundational Books
- Amazon - gpu-engineering/blob/main/notes/Abi's%20PMPP%20Notes.pdf)
- Amazon
- Web Version
-
π» GPU Programming Frameworks
- CUDA
- cuBLAS
- ROCm
- OpenCL - platform parallel computing standard.
- SYCL / oneAPI
- Vulkan Compute - level GPU compute API.
- Metal Performance Shaders
- Mojoπ₯ - Write like Python, run like C++.
-
π§Ύ License
-
Learning Tools
-
-
π§© Optimization and Performance
- NVIDIA Nsight Systems - wide GPU profiler.
- Nsight Compute - level performance analysis.
- CUTLASS
- TensorRT - performance deep learning inference.
- OpenAI Triton - performance GPU kernels.
- Roofline Model
- Helion - A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
- Helion - A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
-
π Research Papers and Articles
- Optimization techniques for GPU programming - Hijma, Pieter, et al.
- Efficient Multi-GPU Programming in Python: Reducing Synchronization and Access Overheads - Oden, Lena, and Klaus NΓΆlp
- Evolving GPU Architecture
- Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision - Wei Gao et al
- Optimizing Machine Learning Models with CUDA: A Comprehensive Performance Analysis - Niteesh, L., and M. B. Ampareeshan
- Model Parallelism - LM](https://arxiv.org/pdf/1909.08053)*
- GPU Virtualization and Multi-Tenant Scheduling
- A Survey of Multi-Tenant Deep Learning Inference on GPU
- Efficient Performance-Aware GPU Sharing with Compatibility and Isolation through Kernel Space Interception
-
βοΈ Systems and Multi-GPU Engineering
- NCCL - GPU communication primitives.
- vLLM - Inference and serving engine for LLMs
- Hugging Face Accelerate - Simplify abstractions for distributed training
- SGLang
- TensorRT-LLM
- TGI by Hugging Face
- Horovod
- GPUDirect RDMA - copy GPU networking.
- Iris by AMD - open-source multi-GPU programming framework built for compiler-visible performance and optimized multi-GPU execution.
- Prime Intellect
- Ray Train - LM](https://github.com/NVIDIA/Megatron-LM)** β Large-scale GPU orchestration frameworks.
-
π§° Tools and Utilities
-
Learning Tools
- LeetGPU
- GPU MODE Discord
- GPU Glossary - A dictionary of terms related to programming GPUs
- Tensara
- Mojoπ₯ GPU Puzzles
- Tensara
- GPU MODE Discord
- Mojoπ₯ GPU Puzzles
-
-
π§ͺ Tutorials and Courses
- CUDA C++ Programming Guide
- Triton Tutorials (OpenAI)
- CUDA in 12 hours by FreeCodeCamp - course)
- Stanford CS149, Fall 2025 Parallel Computing Course Fall 2025
- CMU 15-418/618: Parallel Computer Architecture & Programming
- MIT 6.5940: TinyML and Efficient Deep Learning Computing
- GPU MODE video lecture series
- Red Hat vLLM Office Hours video series
- The courses of the Programming Massively Parallel Processors book's authors
- Stanford CS149, Fall 2025 Parallel Computing Course Fall 2025
- GPU MODE video lecture series
- Red Hat vLLM Office Hours video series
- The courses of the Programming Massively Parallel Processors book's authors
Categories
π§ͺ Tutorials and Courses
13
βοΈ Systems and Multi-GPU Engineering
11
π§° Tools and Utilities
10
π Research Papers and Articles
9
π§© Optimization and Performance
8
π» GPU Programming Frameworks
8
β Acknowledgements
4
π Foundational Books
3
π§ Architecture and Low-Level Design
2
π§Ύ License
1
Sub Categories
Keywords
cuda
4
llm
3
llama
3
pytorch
3
deepseek
3
deep-learning
2
qwen
2
gpu
2
transformer
2
inference
2
hpc
2
awesome
2
nvidia
1
amd
1
gpt
1
hpu
1
inferentia
1
llm-serving
1
llmops
1
mlops
1
deep-learning-library
1
cpp
1
uber
1
tensorflow
1
spark
1
ray
1
mxnet
1
mpi
1
machinelearning
1
machine-learning
1
keras
1
deeplearning
1
baidu
1
tvm
1
triton
1
tensorrt-llm
1
tensorrt
1
ptx
1
openblas
1
mlir
1
gemm
1
cutlass
1
cudnn
1
cublas
1
blas
1
awesome-list
1
wan
1
vlm
1
reinforcement-learning
1
qwen-image
1