{"id":22960492,"url":"https://github.com/gpu-mode/awesomemlsys","last_synced_at":"2026-01-29T09:09:31.662Z","repository":{"id":238404287,"uuid":"796480284","full_name":"gpu-mode/awesomeMLSys","owner":"gpu-mode","description":"An ML Systems Onboarding list","archived":false,"fork":false,"pushed_at":"2025-01-24T03:11:46.000Z","size":23,"stargazers_count":795,"open_issues_count":0,"forks_count":29,"subscribers_count":18,"default_branch":"main","last_synced_at":"2025-06-04T02:07:01.690Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gpu-mode.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-06T02:54:42.000Z","updated_at":"2025-06-03T14:01:47.000Z","dependencies_parsed_at":"2024-05-23T00:40:57.248Z","dependency_job_id":"7d4c74e2-5d6d-4df8-9898-f81bfa0fb19a","html_url":"https://github.com/gpu-mode/awesomeMLSys","commit_stats":null,"previous_names":["msaroufim/mlsysonboarding","gpu-mode/awesomemlsys"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2FawesomeMLSys","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2FawesomeMLSys/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2FawesomeMLSys/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2FawesomeMLSys/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gpu-mode","download_url":"https://codeload.github.com/gpu-mode/awesomeMLSys/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gpu-mode%2FawesomeMLSys/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259184739,"owners_count":22818266,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-14T18:34:10.933Z","updated_at":"2026-01-29T09:09:31.589Z","avatar_url":"https://github.com/gpu-mode.png","language":null,"funding_links":[],"categories":["Other Lists"],"sub_categories":["TeX Lists"],"readme":"## ML Systems Onboarding Reading List\n\nThis is a reading list of papers/videos/repos I've personally found useful as I was ramping up on ML Systems and that I wish more people would just sit and study carefully during their work hours. If you're looking for more recommendations, go through the citations of the below papers and enjoy!\n\n[Conferences](conferences.md) where MLSys papers get published\n\n## Attention Mechanism\n* [Attention is all you need](https://arxiv.org/abs/1706.03762): Start here, Still one of the best intros\n* [Online normalizer calculation for softmax](https://arxiv.org/abs/1805.02867): A must read before reading the flash attention. Will help you get the main \"trick\" \n* [Self Attention does not need O(n^2) memory](https://arxiv.org/abs/2112.05682): \n* [Flash Attention 2](https://arxiv.org/abs/2307.08691): The diagrams here do a better job of explaining flash attention 1 as well\n* [Llama 2 paper](https://arxiv.org/abs/2307.09288): Skim it for the model details\n* [gpt-fast](https://github.com/pytorch-labs/gpt-fast): A great repo to come back to for minimal yet performant code\n* [Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation](https://arxiv.org/abs/2108.12409): There's tons of papers on long context lengths but I found this to be among the clearest\n* Google the different kinds of attention: cosine, dot product, cross, local, sparse, convolutional\n\n## Performance Optimizations\n* [Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems](https://arxiv.org/abs/2312.15234): Wonderful survey, start here\n* [Efficiently Scaling transformer inference](https://arxiv.org/abs/2211.05102): Introduced many ideas most notably KV caches\n* [Making Deep Learning go Brrr from First Principles](https://horace.io/brrr_intro.html): One of the best intros to fusions and overhead\n* [Fast Inference from Transformers via Speculative Decoding](https://arxiv.org/abs/2211.17192): This is the paper that helped me grok the difference in performance characteristics between prefill and autoregressive decoding\n* [Group Query Attention](https://arxiv.org/pdf/2305.13245): KV caches can be chunky this is how you fix it\n* [Orca: A Distributed Serving System for Transformer-Based Generative Models](https://www.usenix.org/conference/osdi22/presentation/yu): introduced continuous batching (great pre-read for the PagedAttention paper).\n* [Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180): the most crucial optimization for high throughput batch inference\n* [Colfax Research Blog](https://research.colfax-intl.com/blog/): Excellent blog if you're interested in learning more about CUTLASS and modern GPU programming\n* [Sarathi LLM](https://arxiv.org/abs/2308.16369): Introduces chunked prefill to make workloads more balanced between prefill and decode\n* [Epilogue Visitor Tree](https://dl.acm.org/doi/10.1145/3620666.3651369): Fuse custom epilogues by adding more epilogues to the same class (visitor design pattern) and represent the whole epilogue as a tree\n\n## Quantization\n* [A White Paper on Neural Network Quantization](https://arxiv.org/abs/2106.08295): Start here this is will give you the foundation to quickly skim all the other papers\n* [LLM.int8](https://arxiv.org/abs/2208.07339): All of Dettmers papers are great but this is a natural intro\n* [FP8 formats for deep learning](https://arxiv.org/abs/2209.05433): For a first hand look of how new number formats come about\n* [Smoothquant](https://arxiv.org/abs/2211.10438): Balancing rounding errors between weights and activations\n* [Mixed precision training](https://arxiv.org/abs/1710.03740): The OG paper describing mixed precision training strategies for half\n\n## Long context length\n* [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864): The paper that introduced rotary positional embeddings\n* [YaRN: Efficient Context Window Extension of Large Language Models](https://arxiv.org/abs/2309.00071): Extend base model context lengths with finetuning\n* [Ring Attention with Blockwise Transformers for Near-Infinite Context](https://arxiv.org/abs/2310.01889): Scale to infinite context lengths as long as you can stack more GPUs\n\n## Sparsity\n* [Venom](https://arxiv.org/pdf/2310.02065): Vectorized N:M Format for sparse tensor cores when hardware only supports 2:4\n* [Megablocks](https://arxiv.org/pdf/2211.15841): Efficient Sparse training with mixture of experts\n* [ReLu Strikes Back](https://openreview.net/pdf?id=osoWxY8q2E): Really enjoyed this paper as an example of doing model surgery for more efficient inference\n\n## Distributed\n* [Singularity](https://arxiv.org/abs/2202.07848): Shows how to make jobs preemptible, migratable and elastic\n* [Local SGD](https://arxiv.org/abs/1805.09767): So hot right now\n* [OpenDiloco](https://arxiv.org/abs/2407.07852): Asynchronous training for decentralized training\n* [torchtitan](https://arxiv.org/abs/2410.06511): Minimal repository showing how to implement 4D parallelism in pure PyTorch\n* [pipedream](https://arxiv.org/abs/1806.03377): The pipeline parallel paper\n* [jit checkpointing](https://dl.acm.org/doi/pdf/10.1145/3627703.3650085): a very clever alternative to periodic checkpointing\n* [Reducing Activation Recomputation in Large Transformer models](https://arxiv.org/abs/2205.05198): THe paper thatt introduced selective activation checkpointing and goes over activation recomputation strategies\n* [Breaking the computation and communication abstraction barrier](https://arxiv.org/abs/2105.05720): God tier paper that goes over research at the intersection of distributed computing and compilers to maximize comms overlap\n* [ZeRO: Memory Optimizations Toward Training Trillion Parameter Models](https://arxiv.org/abs/1910.02054): The ZeRO algorithm behind FSDP and DeepSpeed intelligently reducing memory usage for data parallelism.\n* [Megatron-LM](https://arxiv.org/abs/1909.08053): For an introduction to Tensor Parallelism\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpu-mode%2Fawesomemlsys","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgpu-mode%2Fawesomemlsys","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgpu-mode%2Fawesomemlsys/lists"}