{"id":31431069,"url":"https://github.com/adichat/ai-cheatsheet","last_synced_at":"2026-02-14T05:35:17.188Z","repository":{"id":316963892,"uuid":"1062750223","full_name":"AdiChat/AI-cheatsheet","owner":"AdiChat","description":"AI blogs you should read.","archived":false,"fork":false,"pushed_at":"2025-11-16T19:25:38.000Z","size":110,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-16T21:15:21.553Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AdiChat.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-09-23T17:14:05.000Z","updated_at":"2025-11-16T19:25:41.000Z","dependencies_parsed_at":"2025-09-28T16:16:55.945Z","dependency_job_id":null,"html_url":"https://github.com/AdiChat/AI-cheatsheet","commit_stats":null,"previous_names":["adichat/ai-cheatsheet"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/AdiChat/AI-cheatsheet","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdiChat%2FAI-cheatsheet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdiChat%2FAI-cheatsheet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdiChat%2FAI-cheatsheet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdiChat%2FAI-cheatsheet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AdiChat","download_url":"https://codeload.github.com/AdiChat/AI-cheatsheet/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AdiChat%2FAI-cheatsheet/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29438609,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-14T05:24:35.651Z","status":"ssl_error","status_checked_at":"2026-02-14T05:24:34.830Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-30T09:22:39.757Z","updated_at":"2026-02-14T05:35:17.183Z","avatar_url":"https://github.com/AdiChat.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI Engineering cheatsheet\n*ai/ml resources to master state-of-the-art (SOTA) techniques from engineers and researchers* 🧠💻\n\n---\n\nContents:\n* **End to end free guides to follow**\n* Interesting papers you MUST read\n* Main AI blogs to read regularly (continuous learning)\n* **Deep dive into all core AI concepts** [Learn step-by-step]\n* MAYBE guides you may go through\n* Want to contribute in leading AI open-source projects?\n\n---\n\n## End to end free guides to follow\n\n\u003cimg height=\"50\" alt=\"image\" src=\"https://github.com/user-attachments/assets/82fdef14-cc94-4a78-bdd0-fd5e7d38bd0e\" /\u003e \u003cimg height=\"50\" alt=\"image\" src=\"https://github.com/user-attachments/assets/b7d30827-1b3d-4bb9-b792-8f47aa98e529\" /\u003e\n\nMUST:\n- [ ] [CSE223](https://hao-ai-lab.github.io/cse234-w25/): **ML Sys** course by Prof Hao Zhang (rating 10/10) by UC San Diego (core engineering LLM serving concepts)\n- [ ] [AI Engineering Silicon Cheatsheet](https://amzn.to/3Wl5Tum): **Cheatsheet** covering all major concepts in modern AI; Must for reference\n- [ ] [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook?section=first_steps:_training_on_one_gpu): Training LLMs on GPU Clusters\n- [ ] [Llama visualization](https://www.alphaxiv.org/labs/tensor-trace): step by step [analyze each tensor](https://www.alphaxiv.org/labs/fly-through-llama) as it is processed in Llama\n\n---\n\n## Core AI engineering papers you MUST read\n\n\u003cimg height=\"50\" alt=\"image\" src=\"https://github.com/user-attachments/assets/b0bcbfd3-5e89-4133-89a3-55e858fa82a5\" /\u003e \u003cimg height=\"50\" alt=\"image\" src=\"https://github.com/user-attachments/assets/50fbc127-b4b2-4bb9-8f02-cc72ad126da0\" /\u003e\n\n- [ ] [AI and memory wall](https://arxiv.org/pdf/2403.14123): How memory is the main bottleneck for LLM?\n- [ ] [Collective Communication for 100k+ GPUs](https://arxiv.org/abs/2510.20171) by Meta\n- [ ] [The Landscape of GPU-Centric Communication](https://arxiv.org/pdf/2409.09874v2)\n- [ ] [Pre-training under infinite compute](https://arxiv.org/pdf/2509.14786) by Stanford University\n- [ ] [Give Me BF16 or Give Me Death](https://arxiv.org/pdf/2411.02355) by RedHat and [Give Me FP32 or Give Me Death?](https://arxiv.org/pdf/2506.09501v1)\n\nOthers:\n- [ ] [LLMs don't just memorize, they build a geometric map that helps them reason](https://arxiv.org/pdf/2510.26745) by Google\n- [ ] [Self-Adapting Language Models](https://arxiv.org/pdf/2506.10943) by MIT\n---\n\n## Main AI blogs to read regularly (continuous learning)\n\n- [ ] [NVIDIA Developer Blog](https://developer.nvidia.com/blog/): Deep dive into multiple AI topics.\n- [ ] [TensorRT LLM tech blogs](https://github.com/NVIDIA/TensorRT-LLM/tree/main/docs/source/blogs/tech_blog): Deep dive into technical techniques/optimizations in one of the leading LLM inference library. (13 posts as of now)\n- [ ] [SGLang tech blog](https://lmsys.org/blog/): SGLang is one of the leading LLM serving framework. Most blogs are around SGLang but is rich in technical information.\n- [ ] [AI System co-design](https://aisystemcodesign.github.io/) at Meta\n\nYouTube channels to follow regularly:\n\n- [ ] [vLLM office hours](https://www.youtube.com/watch?v=uWQ489ONvng\u0026list=PLbMP1JcGBmSHxp4-lubU5WYmJ9YgAQcf3): Deep dive into various technical topics in vLLM\n- [ ] [GPU Mode](https://www.youtube.com/@GPUMODE/videos): Deep dive into various LLM topics from guests from the AI community\n- [ ] [PyTorch channel](https://www.youtube.com/@PyTorch/videos): videos of various PyTorch events covering keynotes of technical topics like torch.compile.\n\n---\n\n## Deep dive into AI concepts [Learn step-by-step]\n_Listed only high-quality resources. No need to read 100s of posts to get an idea. Just one post should be enough._\n\n* **GPU architecture**\n\u003cbr\u003e Current SOTA AI/LLM workloads are possible only because of GPUs. Understanding GPU architecture gives you an engineering edge.\n- [ ] [Understanding GPU architecture with MatMul](https://www.aleksagordic.com/blog/matmul), [intuition about GPUs](https://jax-ml.github.io/scaling-book/gpus/)\n- [ ] [GPU Shared memory banks / microbenchmarks](https://feldmann.nyc/blog/smem-microbenchmarks)\nGPU programming concepts:\n- [ ] [CUDA programming model](https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/), [GPU memory management](https://www.nvidia.com/en-us/on-demand/session/gtc24-s62550/): Mark Harris's GTC Talk on Coalesced Memory Access, [Prefix Sum/ Scan in GPU](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda)\n- [ ] [Programming Massively Parallel Processors](https://www.youtube.com/playlist?list=PLRRuQYjFhpmubuwx-w8X964ofVkW1T8O4) series on YT\n\n* Performance\n- [ ] [Performance metrics](https://github.com/NVIDIA/nccl-tests/blob/master/doc/PERFORMANCE.md#performance-reported-by-nccl-tests) by nccl tests, [Profiling guide](https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#profiling-guide) by Nsight, [Understanding DL performance](https://horace.io/brrr_intro.html)\n\n* **Transformer**\n- [ ] [CME 295](https://cme295.stanford.edu/syllabus/): Basics of Transformer and LLM course by Stanford University\n- [ ] [Transformer overall](https://www.krupadave.com/articles/everything-about-transformers): Encoder-only and Decoder-only models\n- [ ] BERT (_insightful_): [BERT as text diffusion step](https://nathan.rs/posts/roberta-diffusion/)\n- [ ] [Memory requirements for LLM](https://themlsurgeon.substack.com/p/the-memory-anatomy-of-large-language). There are 4 parts: activation, parameter, gradient, optimizer states. [How LLM handle memory?](https://fastpaca.com/blog/llm-memory-systems-explained)\n\n* **Attention**\n\n\u003cimg height=\"100\" alt=\"image\" src=\"https://github.com/user-attachments/assets/610b462d-ae36-4b25-a657-fd05f210eb53\" /\u003e \u003cimg height=\"100\" alt=\"image\" src=\"https://github.com/user-attachments/assets/39dabb64-bd91-40a0-870c-d1218ac005c3\" /\u003e\n\n- [ ] [Self-attention / Multi-head attention](https://magazine.sebastianraschka.com/p/understanding-and-coding-self-attention) (MHA), Multi-Query attention (MQA), [Group Query Attention](https://www.ibm.com/think/topics/grouped-query-attention) (GQA), MLA (used in DeepSeek)\n- [ ] [FlashAttention](https://gordicaleksa.medium.com/eli5-flash-attention-5c44017022ad) ([paper1](https://arxiv.org/abs/2205.14135), paper for [v2](https://arxiv.org/abs/2307.08691), paper for [v3](https://tridao.me/blog/2024/flash3/), Online softmax, [Implementation](https://github.com/Dao-AILab/flash-attention) by Tri Dao \n- [ ] [Ring Attention](https://christianjmills.com/posts/cuda-mode-notes/lecture-013/) (links to Context Parallelism CP): Handles large sequence length, [Flex Attention](https://arxiv.org/abs/2412.05496) by PyTorch\n- [ ] KV cache, FP8 KV cache, [Paged Attention](https://hamzaelshafie.bearblog.dev/paged-attention-from-first-principles-a-view-inside-vllm/)\n- [ ] [Data Parallel (DP) Attention](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models)\n\n**Core operations**\n\n- [ ] [GEMM](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html) / MatMul, [API of GEMM](https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-dpcpp/2025-0/gemm.html), [GEMM as core of AI](https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/), [W4A8 GEMM Kernel](https://arxiv.org/pdf/2509.01229)\n- [ ] [MoE](https://www.ibm.com/think/topics/mixture-of-experts) (Mixture of experts)\n- [ ] [Embedding](https://huggingface.co/spaces/hesamation/primer-llm-embedding?section=bert_(bidirectional_encoder_representations_from_transformers)) (deepdive), RoPE ([paper](https://arxiv.org/pdf/2104.09864))\n\n* **Quantization**\n\n\u003cimg height=\"20\" alt=\"image\" src=\"https://github.com/user-attachments/assets/ad731cfe-ede3-4e53-b28b-e48221aab6c9\" /\u003e\n\n- [ ] [Quantization basics](https://themlsurgeon.substack.com/p/the-machine-learning-surgeons-guide)\n- [ ] [Different data type stimulations](https://www.quant.exposed/)\n- [ ] [INT8 quantization using QAT](https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/), [LLM quantization with PTQ](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/), [FP8 datatype](https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/), [AWQ](https://hamzaelshafie.bearblog.dev/awq-activation-aware-weight-quantisation/)\n- [ ] [Per-tensor and per-block scaling](https://developer.nvidia.com/blog/per-tensor-and-per-block-scaling-strategies-for-effective-fp8-training/)\n- [ ] [NVFP4 training](https://developer.nvidia.com/blog/nvfp4-trains-with-precision-of-16-bit-and-speed-and-efficiency-of-4-bit/), [Optimizing FP4 Mixed-Precision Inference on AMD GPUs](https://lmsys.org/blog/2025-09-21-petit-amdgpu/), Recent [LLM quantization progress](https://blog.openvino.ai/blog-posts/q325-technology-update---low-precision-and-model-optimization)\n- [ ] [Quantization on CPU (GGUF, AWQ, GPTQ)](https://www.ionio.ai/blog/llms-on-cpu-the-power-of-quantization-with-gguf-awq-gptq), [GGUF quantization method](https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overview_of_gguf_quantization_methods/), [GPTQ](https://arxiv.org/pdf/2210.17323): Post training quantization for LLM. OBQ: [Post-Training Quantization and Pruning](https://arxiv.org/pdf/2208.11580)\n- [ ] [Mixed precision training](https://docs.nvidia.com/deeplearning/performance/mixed-precision-training/index.html)\n- [ ] [NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/), [GEMV kernel](https://veitner.bearblog.dev/nvfp4-gemv/) for NVFP4\n- [ ] [Details on FP8 training](https://research.colfax-intl.com/deepseek-r1-and-fp8-mixed-precision-training/)\n---\n- [ ] [Pruning and distillation](https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/)\n\n* Post-training\n- [ ] [Post training concepts with SFT, RLHF, RLFR](https://tokens-for-thoughts.notion.site/post-training-101)\n- [ ] [Smol Training Playbook: The Secrets to Building World-Class LLMs](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#introduction)\n\n* Optimizations\n\n- [ ] [LLM Inference optimizations](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/); [optimizations v2](https://gaurigupta19.github.io/llms/distributed%20ml/optimization/2025/10/02/efficient-ml.html)\n- [ ] 5D parallelism [PP, SP, DP, TP, CP, EP](https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/features/parallelisms.html), [parallelism](https://themlsurgeon.substack.com/p/data-parallelism-scaling-llm-training) concept for LLM scaling. [Parallelism in PyTorch](https://ggrigorev.me/posts/introduction-to-parallelism/)\n- [ ] [Chunk prefill - SARATHI paper](https://arxiv.org/pdf/2308.16369), [dynamic and continuous batching](https://bentoml.com/llm/inference-optimization/static-dynamic-continuous-batching)\n- [ ] [KV cache offloading](https://bentoml.com/llm/inference-optimization/kv-cache-offloading), [KVcache early reuse](https://developer.nvidia.com/blog/5x-faster-time-to-first-token-with-nvidia-tensorrt-llm-kv-cache-early-reuse/)\n- [ ] [Speculative decoding](https://bentoml.com/llm/inference-optimization/speculative-decoding), ([basic introduction](https://github.com/zhaochenyang20/Awesome-ML-SYS-Tutorial/blob/main/sglang/speculative-decoding/speculative-decoding.md#speculative-decoding)), [Look-ahead reasoning](https://hao-ai-lab.github.io/blogs/lookaheadreasoning/), [Paper from Google](https://arxiv.org/pdf/2211.17192) and [DeepMind](https://arxiv.org/pdf/2302.01318)\n- [ ] [MoE using Wide Expert Parallelism EP](https://developer.nvidia.com/blog/scaling-large-moe-models-with-wide-expert-parallelism-on-nvl72-rack-scale-systems/)\n\n\u003cimg height=\"100\" alt=\"image\" src=\"https://github.com/user-attachments/assets/68297949-8f6f-41e3-aa43-a9cae8c52102\" /\u003e\n\n* Scheduling / Routing\n\n- [ ] [P/D disaggregation](https://hao-ai-lab.github.io/blogs/distserve-retro/), [DistServe P/D disaggregation paper](https://arxiv.org/pdf/2401.09670)\n- [ ] [KVCache-centric disaggregated architecture](https://arxiv.org/pdf/2407.00079) by MooncakeAI\n- [ ] [OverFill: Two-Stage Models for Efficient Language Model Decoding](https://arxiv.org/pdf/2508.08446) by Cornell University\n\n* Software tools AI\n- [ ] [vLLM arch](https://www.aleksagordic.com/blog/vllm): architecture of the leading LLM serving engine.\n\nInsights:\n\n- [ ] [MinMax M2 using Full Attention](https://x.com/zpysky1125/status/1983383094607347992): why full attention is better than masked attention?\n\nPractical:\n\n- [ ] [CUDA Compiler \u0026 PTX](https://blog.alpindale.net/posts/top_k_cuda/) with example\n- [ ] [CUTLASS](https://www.kapilsharma.dev/posts/learn-cutlass-the-hard-way/): template library to code in CUDA easily\n- [ ] [Matrix transpose using CUTLASS](https://research.colfax-intl.com/tutorial-matrix-transpose-in-cutlass/)\n- [ ] [SGLang inference engine architecture](https://github.com/sgl-project/sgl-learning-materials/blob/main/slides/lmsys_1st_meetup_sglang.pdf)\n- [ ] [FlexAttention using CuTE DSL](https://research.colfax-intl.com/a-users-guide-to-flexattention-in-flash-attention-cute-dsl/)\n- [ ] [MatMul using WGMMA](https://research.colfax-intl.com/cutlass-tutorial-wgmma-hopper/), [GEMM with pipelining in CUTLASS](https://research.colfax-intl.com/cutlass-tutorial-design-of-a-gemm-kernel/)\n\n\n## MAYBE guides you may go through\n\n- [ ] [Scaling a model](https://jax-ml.github.io/scaling-book/) by Jax (Google) (rating 7/10)\n- [ ] [Smol training playbook](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook#introduction) by HuggingFace to train LLMs\n- [ ] [GPU Gems 3](https://developer.nvidia.com/gpugems/gpugems3): if you want to dive deep into GPU programming\n- [ ] (blog) [OpenVINO optimizations and engineering](https://blog.openvino.ai/) by Intel\n- [ ] (blog) [Engineering posts by Colfax Research](https://research.colfax-intl.com/blog/)\n- [ ] (blog) [GPU MODE lecture notes](https://christianjmills.com/series/notes/cuda-mode-notes.html)\n- [ ] (blog) [Connectionism- Thinking Machine blog](https://thinkingmachines.ai/blog/): AI startup. Founded by Mira Murati, former CTO at OpenAI. Solved nondeterminism problem in LLM.\n\n\n## Want to contribute in leading AI open-source projects?\n\nGet started in these:\n\n- [ ] [SGLang](https://github.com/sgl-project/sglang): LLM serving engine originally from UC Berkeley.\n- [ ] [vLLM](https://github.com/vllm-project/vllm): LLM inference engine originally from UC Berkeley.\n- [ ] [PyTorch](https://github.com/pytorch/pytorch): Leading AI framework by Meta\n- [ ] [TensorFlow](https://github.com/tensorflow/tensorflow): AI framework by Google\n- [ ] [TensorRT](https://github.com/NVIDIA/TensorRT): High performance inference library by NVIDIA\n- [ ] [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM): LLM inference library by NVIDIA\n- [ ] [NCCL](https://github.com/NVIDIA/nccl): High performance GPU communication library by NVIDIA\n- [ ] See other [NVIDIA libraries](https://github.com/orgs/NVIDIA/repositories?language=\u0026q=\u0026sort=\u0026type=all).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadichat%2Fai-cheatsheet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadichat%2Fai-cheatsheet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadichat%2Fai-cheatsheet/lists"}