{"id":43511,"url":"https://github.com/lambda7xx/awesome-AI-system","name":"awesome-AI-system","description":"paper and its code for AI System","projects_count":335,"last_synced_at":"2026-06-08T12:00:34.934Z","repository":{"id":152972786,"uuid":"598379257","full_name":"lambda7xx/awesome-AI-system","owner":"lambda7xx","description":"paper and its code for AI System","archived":false,"fork":false,"pushed_at":"2026-05-14T06:08:19.000Z","size":1030,"stargazers_count":362,"open_issues_count":2,"forks_count":23,"subscribers_count":11,"default_branch":"main","last_synced_at":"2026-05-22T22:24:58.538Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lambda7xx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2023-02-07T01:27:45.000Z","updated_at":"2026-05-14T06:08:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"3ff78e8f-e97f-4580-9356-f4be7706291a","html_url":"https://github.com/lambda7xx/awesome-AI-system","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lambda7xx/awesome-AI-system","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lambda7xx%2Fawesome-AI-system","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lambda7xx%2Fawesome-AI-system/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lambda7xx%2Fawesome-AI-system/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lambda7xx%2Fawesome-AI-system/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lambda7xx","download_url":"https://codeload.github.com/lambda7xx/awesome-AI-system/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lambda7xx%2Fawesome-AI-system/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34061123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"created_at":"2024-01-13T21:18:40.344Z","updated_at":"2026-06-08T12:00:34.934Z","primary_language":null,"list_of_lists":false,"displayable":true,"categories":["Paper-Code"],"sub_categories":["Serving-Inference","LLM Serving","Communication","GNN","LLM Serving Framework","Schedule and Resource Management","LLM FineTune","Fancy LLM","Framework","Parallellism Training","LoRA","Researcher","Training","MoE","GPU Cluster Management","Optimization","Fine-Tune","Energy","Misc","LLM Evaluation Platform","LLM Inference (System Side)","Attention","LLM Robustness and Debugging"],"readme":"# Awesome AI System\n\nThis repo is motivated by [awesome tensor compilers](https://github.com/merrymercy/awesome-tensor-compilers.git).\n## Contents\n\n- [Paper-Code](#paper-code)\n  - [Researcher](#researcher)\n  - [LLM Serving Framework](#llm-serving-framework)\n  - [LLM Evaluation Platform](#llm-evaluation-platform)\n  - [LLM Robustness and Debugging](#llm-robustness-and-debugging)\n  - [LLM Inference System Side)](#llm-inference-system-side)\n  - [Compiler](#compiler)\n  - [Attention](#attention)\n  - [RAG And ANNS](#rag-and-anns)\n  - [RLHF](#rlhf)\n  - [Video](#video)\n  - [LLM Inference AI Side)](#llm-inference-ai-side)\n  - [LLM MoE](#llm-moe)\n  - [LoRA](#lora)\n  - [Framework](#framework)\n  - [Parallellism Training](#parallellism-training)\n  - [Training](#training)\n  - [Communication](#communication)\n  - [Serving-Inference](#Serving-Inference)\n  - [MoE](#MoE)\n  - [GPU Cluster Management](#gpu-cluster-management)\n  - [Schedule and Resource Management](#schedule)\n  - [Optimization](#optimzation)\n  - [GNN](#GNN)\n  - [Fine-Tune](#Fine-Tune)\n  - [Energy](#energy)\n  - [Misc](#Misc)\n- [Contribute](#contribute)\n\n\n## Paper-Code\n\n### Researcher \n\n\n\n| Name | University | Homepage | \n|:-----:|:-----:|:-----:|\n| Ion Stoica | UC Berkeley | [![Website](https://img.shields.io/badge/Website-9cf)](https://people.eecs.berkeley.edu/~istoica/) |\n| Joseph E. Gonzalez | UC Berkeley | [![Website](https://img.shields.io/badge/Website-9cf)](https://people.eecs.berkeley.edu/~jegonzal/) |\n| Matei Zaharia | UC Berkeley | [![Website](https://img.shields.io/badge/Website-9cf)](https://people.eecs.berkeley.edu/~matei/) |\n| Zhihao Jia| CMU | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.cs.cmu.edu/~zhihaoj2/) |\n| Tianqi Chen| CMU | [![Website](https://img.shields.io/badge/Website-9cf)](https://tqchen.com/) |\n| Stephanie Wang | UW | [![Website](https://img.shields.io/badge/Website-9cf)](https://stephanie-wang.github.io/) |\n| Xingda Wei| SJTU | [![Website](https://img.shields.io/badge/Website-9cf)](https://ipads.se.sjtu.edu.cn/pub/members/xingda_wei) |\n| Zeyu Min| SJTU | [![Website](https://img.shields.io/badge/Website-9cf)](https://ipads.se.sjtu.edu.cn/pub/members/zeyu_mi) |\n| Xin Jin | PKU | [![Website](https://img.shields.io/badge/Website-9cf)](https://xinjin.github.io/) |\n| Harry Xu | UCLA | [![Website](https://img.shields.io/badge/Website-9cf)](https://web.cs.ucla.edu/~harryxu/) |\n| Anand Iyer | Georgia Tech | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.anand-iyer.com/) |\n| Ravi Netravali| Princeton | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.cs.princeton.edu/~ravian/) |\n| Christos Kozyrakis | Stanford | [![Website](https://img.shields.io/badge/Website-9cf)](https://web.stanford.edu/~kozyraki/) |\n| Christopher Ré | Stanford | [![Website](https://img.shields.io/badge/Website-9cf)](https://cs.stanford.edu/people/chrismre/) |\n| Tri Dao| Princeton | [![Website](https://img.shields.io/badge/Website-9cf)](https://tridao.me/) |\n| Mosharaf Chowdhury| UMich | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.mosharaf.com/) |\n| Shivaram Venkataraman| Wisc | [![Website](https://img.shields.io/badge/Website-9cf)](https://shivaram.org/) |\n| Hao Zhang| UCSD | [![Website](https://img.shields.io/badge/Website-9cf)](https://cseweb.ucsd.edu/~haozhang/) |\n| Yiying Zhang| UCSD | [![Website](https://img.shields.io/badge/Website-9cf)](https://cseweb.ucsd.edu/~yiying/) |\n| Ana Klimovic | ETH | [![Website](https://img.shields.io/badge/Website-9cf)](https://anakli.inf.ethz.ch/) |\n| Fan Lai | UIUC | [![Website](https://img.shields.io/badge/Website-9cf)](https://www.fanlai.me/) |\n| Lianmin Zheng | UC Berkeley | [![Website](https://img.shields.io/badge/Website-9cf)](https://lmzheng.net/) |\n| Ying Sheng  | Stanford | [![Website](https://img.shields.io/badge/Website-9cf)](https://sites.google.com/view/yingsheng/) |\n| Zhuohan Li | UC Berkeley | [![Website](https://img.shields.io/badge/Website-9cf)](https://people.eecs.berkeley.edu/~zhuohan/) |\n| Woosuk Kwon| UC Berkeley | [![Website](https://img.shields.io/badge/Website-9cf)](https://woosuk.me/) |\n| Zihao Ye | University of Washington  | [![Website](https://img.shields.io/badge/Website-9cf)](https://homes.cs.washington.edu/~zhye/) |\n| Amey Agrawal | Georgia Tech | [![Website](https://img.shields.io/badge/Website-9cf)](https://ameya.info/) |\n\n\n### LLM Serving Framework\n\n| Title | Github|\n|:-----:|:-----:|\n| MLC LLM| [![Star](https://img.shields.io/github/stars/mlc-ai/mlc-llm.svg)](https://github.com/mlc-ai/mlc-llm/) |\n| TensorRT-LLM | [![Star](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg)](https://github.com/NVIDIA/TensorRT-LLM.git) |\n| xFasterTransformer |  [![Star](https://img.shields.io/github/stars/intel/xFasterTransformer.svg)](https://github.com/intel/xFasterTransformer)|\n| CTranslate2(low latency) | [![Star](https://img.shields.io/github/stars/OpenNMT/CTranslate2.svg)](https://github.com/OpenNMT/CTranslate2.git)|\n| llama2.c| [![Star](https://img.shields.io/github/stars/karpathy/llama2.c.svg)](https://github.com/karpathy/llama2.c) |\n\n\n### LLM Evaluation Platform\n\n| Title | Github| Website\n|:-----:|:-----:|:-----:|\n| FastChat | [![Star](https://img.shields.io/github/stars/lm-sys/FastChat.svg)](https://github.com/lm-sys/FastChat.git)| [![Website](https://img.shields.io/badge/Website-9cf)](https://chat.lmsys.org/) |\n\n### LLM Robustness and Debugging\n\n| Title | Paper | Github | Pub. \u0026 Date |\n|:-----:|:-----:|:------:|:-----------:|\n| WFGY 1.0: Self-healing LLM Systems Framework | [![DOI](https://img.shields.io/badge/DOI-10.6084%2Fm9.figshare.30338884-9cf)](https://doi.org/10.6084/m9.figshare.30338884) \u003cbr\u003e [PDF](https://github.com/onestardao/WFGY/blob/main/I_am_not_lizardman/WFGY_All_Principles_Return_to_One_v1.0_PSBigBig_Public.pdf) | [![Star](https://img.shields.io/github/stars/onestardao/WFGY.svg)](https://github.com/onestardao/WFGY) | Tech report, Oct 13 2025 |\n\n\n### LLM Inference (System Side)\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| SageServe: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2502.14617) | [![Star](https://img.shields.io/github/stars/hashwatj07/SageServe)](https://github.com/shashwatj07/SageServe) | - | SIGMETRICS'26|\n| {HydraServe}: Minimizing Cold Start Latency for Serverless {LLM} Serving in Public Clouds | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/conference/nsdi26/presentation/lou) | [![Star](https://img.shields.io/github/stars/LMServe/hydraserve)](https://github.com/LLMServe/hydraserve) | - | NSDI'26|\n| BulletServe:Boosting LLM Serving through Spatial-Temporal GPU Resource Sharing | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2504.19516) | [![Star](https://img.shields.io/github/stars/zejia-lin/BulletServ)](https://github.com/zejia-lin/BulletServe) | - | ASPLOS'26|\n| Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://ennanzhai.github.io/pub/sosp25-aegaeon.pdf) |  | - | SOSP'25|\n| DynaPipe: Dynamic Layer Redistribution for Efficient Serving of LLMs with Pipeline Parallelism| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://openreview.net/pdf?id=D6w7wIN360) | [![Star](https://img.shields.io/github/stars/xhx1022/DynaPipe)](https://github.com/xhx1022/DynaPipe) | - | NeurIPS'25|\n| DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3731569.3764810) | [![Star](https://img.shields.io/github/stars/zyqCSL/DiffKV.svg)](https://github.com/zyqCSL/DiffKV.git) | - | SOSP'25|\n| Pie: A Programmable Serving System for Emerging LLM Applications | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://pie-project.org/assets/files/gim2025pie-205fb6aa1c1b3c9e172dd1db182db8e5.pdf) | [![Star](https://img.shields.io/github/stars/pie-project/pie.svg)](https://github.com/pie-project/pie.git) | - | SOSP'25|\n| KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://madsys.cs.tsinghua.edu.cn/publication/ktransformers-unleashing-the-full-potential-of-cpu/gpu-hybrid-inference-for-moe-models/SOSP25-chen.pdf) | [![Star](https://img.shields.io/github/stars/kvcache-ai/ktransformers.svg)](https://github.com/kvcache-ai/ktransformers) | - | SOSP'25|\n| XSched: Preemptive Scheduling for Diverse XPUs| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/osdi25-shen-weihang.pdf) | [![Star](https://img.shields.io/github/stars/XpuOS/xsched.svg)](https://github.com/XpuOS/xsched.git) | - | OSDI 25|\n| TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2505.11329) | [![Star](https://img.shields.io/github/stars/microsoft/tokenweave.svg)](https://github.com/microsoft/tokenweave.git) | - | Arxiv 25|\n|  ServeGen: Workload Characterization and Generation of Large Language Model Serving in Production| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2505.09999) | [![Star](https://img.shields.io/github/stars/alibaba/ServeGen.svg)](https://github.com/alibaba/ServeGen.git) | - | Arxiv 25|\n|  Resource Multiplexing in Tuning and Serving Large Language Models | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/atc25-he-yongjun.pdf) | [![Star](https://img.shields.io/github/stars/llm-db/llmstation.svg)](https://github.com/llm-db/llmstation) | - | ATC'25|\n|  RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2505.02922) | [![Star](https://img.shields.io/github/stars/microsoft/RetrievalAttention.svg)](https://github.com/microsoft/RetrievalAttention.git) | - | Arxiv May 2025 |\n|  SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2504.08850) | [![Star](https://img.shields.io/github/stars/infinigence/SpecEE.svg)](https://github.com/infinigence/SpecEE) | - | ISCA'25 |\n|   LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3695053.3731092) | [![Star](https://img.shields.io/github/stars/hyungyokim/LIA_AMXGPU.svg)](https://github.com/hyungyokim/LIA_AMXGPU) | - | ISCA'25 |\n|  Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2504.07494) | [![Star](https://img.shields.io/github/stars/eddiegaoo/Apt-Serve.svg)](https://github.com/eddiegaoo/Apt-Serve) | - | SIGMOD'25 |\n|  Marconi: Prefix Caching for the Era of Hybrid LLMs | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2411.19379) | [![Star](https://img.shields.io/github/stars/ruipeterpan/marconi.svg)](https://github.com/ruipeterpan/marconi) | - | MLSys'25 |\n|  SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/10.1145/3689031.3717481) | [![Star](https://img.shields.io/github/stars/MachineLearningSystem/25Eurosys-SpInfer.svg)](https://github.com/MachineLearningSystem/25Eurosys-SpInfer) | - | Eurosys'25 Best Paper |\n|  NeuStream: Bridging Deep Learning Serving and Stream Processing | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/10.1145/3689031.3717489) | [![Star](https://img.shields.io/github/stars/MachineLearningSystem/25Eurosys-NeuStream-AE.svg)](https://github.com/MachineLearningSystem/25Eurosys-NeuStream-AE) | - | Eurosys'25 |\n|  Towards End-to-End Optimization of LLM-based Applications with Ayo | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2407.00326) | [![Star](https://img.shields.io/github/stars/MachineLearningSystem/25ASPLOS-Ayo.svg)](https://github.com/MachineLearningSystem/25ASPLOS-Ayo) | - | ASPLOS'25 |\n|  NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.01142) | [![Star](https://img.shields.io/github/stars/MachineLearningSystem/25MLSYS-NEO.svg)](https://github.com/MachineLearningSystem/25MLSYS-NEO) | - | MLSYS'25 |\n| CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2405.16444) | [![Star](https://img.shields.io/github/stars/YaoJiayi/CacheBlend.svg)](https://github.com/YaoJiayi/CacheBlend) | - | Eurosys'25 Best Paper|\n| Helix: Serving Large Language Models over Heterogeneous GPUs and Network via Max-Flow | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.01566) | [![Star](https://img.shields.io/github/stars/Thesys-lab/Helix-ASPLOS25.svg)](https://github.com/Thesys-lab/Helix-ASPLOS25.git) | - | ASPLOS'25 |\n|GLINTHAWK: A Two-Tiered Architecture for High-Throughput LLM Inference  | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2501.11779) | [![Star](https://img.shields.io/github/stars/microsoft/glinthawk.svg)](https://github.com/microsoft/glinthawk) | - | Arxiv'25,Jan |\n| Queue Management for SLO-Oriented Large Language Model Serving | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://haoran-qiu.com/pdf/socc24-qlm.pdf) | [![Star](https://img.shields.io/github/stars/QLM-project/QLM.svg)](https://github.com/QLM-project/QLM.git) | - | SOCC'24 |\n|NanoFlow: Towards Optimal Large Language Model Serving Throughput | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2408.12757) | [![Star](https://img.shields.io/github/stars/efeslab/Nanoflow.svg)](https://github.com/efeslab/Nanoflow.git) | - | OSDI'25 |\n| PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://ipads.se.sjtu.edu.cn/_media/publications/powerinfer-20231219.pdf) | [![Star](https://img.shields.io/github/stars/SJTU-IPADS/PowerInfer.svg)](https://github.com/SJTU-IPADS/PowerInfer) | - | SOSP'24 |\n|LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2404.09526) | [![Star](https://img.shields.io/github/stars/LoongServe/LoongServe.svg)](https://github.com/LoongServe/LoongServe) | - | SOSP'24 |\n|Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2403.09054) | [![Star](https://img.shields.io/github/stars/d-matrix-ai/keyformer-llm.svg)](https://github.com/d-matrix-ai/keyformer-llm) | - | MLSYS'24 |\n|PLLMCompass: Enabling Efficient Hardware Design for Large Language Model Inference | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2308.12066) | [![Star](https://img.shields.io/github/stars/PrincetonUniversity/LLMCompass.svg)](https://github.com/PrincetonUniversity/LLMCompass) | - | ISCA'24 |\n|Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2308.12066) | [![Star](https://img.shields.io/github/stars/ranggihwang/Pregated_MoE.svg)](https://github.com/ranggihwang/Pregated_MoE) | - | ISCA'24 |\n|Prompt Cache: Modular Attention Reuse for Low-Latency Inference| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2311.04934) | [![Star](https://img.shields.io/github/stars/yale-sys/prompt-cache.svg)](https://github.com/yale-sys/prompt-cache) | - | MLSYS'24 |\n|Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2403.02310) | [![Star](https://img.shields.io/github/stars/microsoft/sarathi-serve.svg)](https://github.com/microsoft/sarathi-serve) | - | OSDI'24 |\n| DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.09670) | [![Star](https://img.shields.io/github/stars/LLMServe/DistServe.svg)](https://github.com/LLMServe/DistServe) | - | OSDI'24 |\n| Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2407.00079) | [![Star](https://img.shields.io/github/stars/kvcache-ai/Mooncake.svg)](https://github.com/kvcache-ai/Mooncake.git) | - | July'24 |\n|Llumnix: Dynamic Scheduling for Large Language Model Serving | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.03243) | [![Star](https://img.shields.io/github/stars/AlibabaPAI/llumnix.svg)](https://github.com/AlibabaPAI/llumnix/tree/osdi24ae) | - | OSDI'24 |\n| Parrot: Efficient Serving of LLM-based Application with Semantic Variables| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2310.07240) | [![Star](https://img.shields.io/github/stars/MachineLearningSystem/24OSDI-ParrotServe.svg)](https://github.com/MachineLearningSystem/24OSDI-ParrotServe) | - | OSDI'24 |\n| CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2310.07240) | [![Star](https://img.shields.io/github/stars/UChi-JCL/CacheGen.svg)](https://github.com/UChi-JCL/CacheGen) | - | SIGCOMM'24 |\n| Efficiently Programming Large Language Models using SGLang| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2312.07104) | [![Star](https://img.shields.io/github/stars/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang.git) | - | Jan, 2024 |\n| Efficient Memory Management for Large Language Model Serving with PagedAttention| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2309.06180.pdf) | [![Star](https://img.shields.io/github/stars/vllm-project/vllm.svg)](https://github.com/vllm-project/vllm.git) | - | SOSP'23 |\n| SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2305.09781.pdf) | [![Star](https://img.shields.io/github/stars/flexflow/FlexFlow.svg)](https://github.com/flexflow/FlexFlow) | - | Dec,2023 |\n|Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference| - | [![Star](https://img.shields.io/github/stars/MachineLearningSystem/24PPOPP-Liger.svg)](https://github.com/MachineLearningSystem/24PPOPP-Liger) |-| PPOPP'24\n|Efficiently Programming Large Language Models using SGLang| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2312.07104.pdf)| [![Star](https://img.shields.io/github/stars/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang.git) | - | Nurips'24 | \n| Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.vldb.org/pvldb/vol17/p211-xia.pdf) | [![Star](https://img.shields.io/github/stars/AlibabaResearch/flash-llm.svg)](https://github.com/AlibabaResearch/flash-llm) | - | VLDB'24 |\n\n### Compiler\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3731569.3764798) | [![Star](https://img.shields.io/github/stars/ChandlerGuan/mercury_artifact.svg)](https://github.com/ChandlerGuan/mercury_artifact.git) | - | SOSP'25|\n| Mirage: A Multi-Level Superoptimizer  for Tensor Programs| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/osdi25-wu-mengdi.pdf) | [![Star](https://img.shields.io/github/stars/mirage-project/mirage.svg)](https://github.com/mirage-project/mirage.git) | - | OSDI'25|\n\n### Attention\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| UltraAttn: Efficiently Parallelizing Attention through Hierarchical Context-Tiling | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3712285.3759894) | [![Star](https://img.shields.io/github/stars/oliverYoung2001/UltraAttn.svg)](https://github.com/oliverYoung2001/UltraAttn.git) | - | SC'25|\n| TASP: Topology-aware Sequence Parallelism | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2509.26541) | [![Star](https://img.shields.io/github/stars/infinigence/HamiltonAttention.svg)](https://github.com/infinigence/HamiltonAttention.git) | - | Arxiv'25|\n| Ring Attn | | [![Star](https://img.shields.io/github/stars/zhuzilin/ring-flash-attention.svg)](https://github.com/zhuzilin/ring-flash-attention.git) | - | |\n\n\n\n### RAG And ANNS\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3731569.3764806) | [![Star](https://img.shields.io/github/stars/Leo9660/HedraRAG_AE.svg)](https://github.com/Leo9660/HedraRAG_AE.git) | - | SOSP'25|\n| LEANN: A Low-Storage Vector Index | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)]({https://arxiv.org/abs/2506.08276) | [![Star](https://img.shields.io/github/stars/yichuan-w/LEANN.svg)](https://github.com/yichuan-w/LEANN.git) | - | Arxiv 25 |\n| OdinANN: Direct Insert for Consistently Stable Performance in Billion-Scale Graph-Based Vector Search | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/osdi25-guo.pdf) | [![Star](https://img.shields.io/github/stars/thustorage/PipeANN.svg)](https://github.com/thustorage/PipeANN) | - | FAST'26 |\n| Achieving Low-Latency Graph-Based Vector Search via Aligning Best-First Search Algorithm with SSD | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/osdi25-guo.pdf) | [![Star](https://img.shields.io/github/stars/thustorage/PipeANN.svg)](https://github.com/thustorage/PipeANN) | - | OSDI'25 |\n| Quake: Adaptive Indexing for Vector Search| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/osdi25-mohoney.pdf) | [![Star](https://img.shields.io/github/stars/marius-team/quake.svg)](https://github.com/marius-team/quake) | - | OSDI'25 |\n| Hermes: Algorithm-System Co-design for Efficient Retrieval Augmented Generation At-Scale | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3695053.3731076) | [![Star](https://img.shields.io/github/stars/S4AI-CornellTech/Hermes.svg)](https://github.com/S4AI-CornellTech/Hermes) | - | ISCA'25 |\n| PathWeaver: A High-Throughput Multi-GPU System for Graph-Based Approximate Nearest Neighbor Search | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/atc25-kim.pdf) | [![Star](https://img.shields.io/github/stars/AIS-SNU/PathWeaver.svg)](https://github.com/AIS-SNU/PathWeaver.git) | - | ATC'25 |\n| In-Storage Acceleration of Retrieval Augmented Generation as a Service: Artifact Evaluation README | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/pdf/10.1145/3695053.3731032) | [![Star](https://img.shields.io/github/stars/he-actlab/ragx.svg)](https://github.com/he-actlab/ragx) | - | ISCA'25 |\n| RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2503.14649) | [![Star](https://img.shields.io/github/stars/google/rago.svg)](https://github.com/google/rago.git) | - | ISCA'25 |\n\n### RLHF\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| Optimizing RLHF Training for Large Language Models with Stage Fusion | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2409.13221) | [![Star](https://img.shields.io/github/stars/FlexFusion/FlexFusion.svg)](https://github.com/FlexFusion/FlexFusion.git) | - | NSDI'25 |\n| HybridFlow: A Flexible and Efficient RLHF Framework | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2409.19256v2) | [![Star](https://img.shields.io/github/stars/volcengine/verl.svg)](https://github.com/volcengine/verl.git) | - | Eurosys'25 |\n| ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2406.14088) | [![Star](https://img.shields.io/github/stars/openpsi-project/ReaLHF.svg)](https://github.com/openpsi-project/ReaLHF.git)| - | June. 2024 |\n| OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2405.11143) | [![Star](https://img.shields.io/github/stars/OpenRLHF/OpenRLHF.svg)](https://github.com/OpenRLHF/OpenRLHF)| - | May. 2024 |\n\n### Video\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| Katz: Efficient Workflow Serving for Diffusion Models with Many Adapters | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/atc25-li-suyi-katz.pdf) | [![Star](https://img.shields.io/github/stars/modelscope/Katz.svg)](https://github.com/modelscope/Katz) | - | ATC'25 |\n| PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/system/files/atc25-kong.pdf) | [![Star](https://img.shields.io/github/stars/JonnyKong/PPipe.svg)](https://github.com/JonnyKong/PPipe.git) | - | Nov. 2024 |\n| xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.01738) | [![Star](https://img.shields.io/github/stars/xdit-project/xDiT.svg)](https://github.com/xdit-project/xDiT.git) | - | Nov. 2024 |\n| FastVideo | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2411.01738) | [![Star](https://img.shields.io/github/stars/hao-ai-lab/FastVideo.svg)](https://github.com/hao-ai-lab/FastVideo.git) | - | Dec. 2024 |\n\n\n\n\n### LLM Inference(AI Side)\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| InferCept: Efficient Intercept Support for Augmented Large Language Model Inference | [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.01869) | [![Star](https://img.shields.io/github/stars/WukLab/InferCept.svg)](https://github.com/WukLab/InferCept) | - | ICML'24 |\n| Online Speculative Decoding| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2310.07177) | [![Star](https://img.shields.io/github/stars/LiuXiaoxuanPKU/OSD.svg)](https://github.com/LiuXiaoxuanPKU/OSD) | - | ICML'24 |\n| MuxServe: Flexible Spatial-Temporal Multiplexing for LLM Serving| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2404.02015) | [![Star](https://img.shields.io/github/stars/EfficientLLMSys/MuxServe.svg)](https://github.com/EfficientLLMSys/MuxServe) | - | ICML'24 |\n| BitDelta: Your Fine-Tune May Only Be Worth One Bit| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.10193) | [![Star](https://img.shields.io/github/stars/FasterDecoding/BitDelta.svg)](https://github.com/FasterDecoding/BitDelta.git) | - | Feb,2024 |\n| Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2401.10774) | [![Star](https://img.shields.io/github/stars/FasterDecoding/Medusa.svg)](https://github.com/FasterDecoding/Medusa.git) | - | Jan,2024 |\n| LLMCompiler: An LLM Compiler for Parallel Function Calling| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2312.04511.pdf) | [![Star](https://img.shields.io/github/stars/SqueezeAILab/LLMCompiler.svg)](https://github.com/SqueezeAILab/LLMCompiler.git) | - | Dec,2023 |\n| Mamba: Linear-Time Sequence Modeling with Selective State Spaces| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2312.00752.pdf) | [![Star](https://img.shields.io/github/stars/state-spaces/mamba.svg)](https://github.com/state-spaces/mamba.git) | - | Dec,2023 |\n| Teaching LLMs memory management for unbounded context| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2310.08560) | [![Star](https://img.shields.io/github/stars/cpacker/MemGPT.svg)](https://github.com/cpacker/MemGPT.git) | - | Oct,2023 |\n| Break the Sequential Dependency of LLM Inference Using Lookahead Decoding| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2402.02057) | [![Star](https://img.shields.io/github/stars/hao-ai-lab/LookaheadDecoding.svg)](https://github.com/hao-ai-lab/LookaheadDecoding.git) | - | Feb,2024 |\n| EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2401.15077.pdf) | [![Star](https://img.shields.io/github/stars/SafeAILab/EAGLE.svg)](https://github.com/SafeAILab/EAGLE.git) | - | Jan,2024 |\n\n### LLM MoE\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2308.12066) | [![Star](https://img.shields.io/github/stars/ranggihwang/Pregated_MoE.svg)](https://github.com/ranggihwang/Pregated_MoE.git) | - | ISCA'24 |\n| SIDA-MOE: SPARSITY-INSPIRED DATA-AWARE SERVING FOR EFFICIENT AND SCALABLE LARGE MIXTURE-OF-EXPERTS MODELS| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2310.18859) | [![Star](https://img.shields.io/github/stars/timlee0212/SiDA-MoE.svg)](https://github.com/timlee0212/SiDA-MoE) | - | MLSYS'24 |\n| ScheMoE: An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://dl.acm.org/doi/10.1145/3627703.3650083) | [![Star](https://img.shields.io/github/stars/Fragile-azalea/ScheMoE.svg)](https://github.com/Fragile-azalea/ScheMoE.git) | - | Eurosys'24 |\n\n\n### LoRA\n\n| Title | Paper | Github| WebSite | Pub. \u0026 Date\n|:-----:|:-----:|:-----:|:-----:|:-----:|\n| oRAFusion: Efficient LoRA Fine-Tuning for LLMs| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/abs/2510.00206) | [![Star](https://img.shields.io/github/stars/CentML/lorafusion)](https://github.com/CentML/lorafusion.git) | - | Eurosys'26 |\n| dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://www.usenix.org/conference/osdi24/presentation/wu-bingyang) | [![Star](https://img.shields.io/github/stars/LLMServe/dLoRA-artifact.svg)](https://github.com/LLMServe/dLoRA-artifact.git) | - | OSDI'24 |\n| S-LoRA: Serving Thousands of Concurrent LoRA Adapters| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2311.03285.pdf) | [![Star](https://img.shields.io/github/stars/S-LoRA/S-LoRA.svg)](https://github.com/S-LoRA/S-LoRA.git) | - | Nov,2023 |\n| Punica: Serving multiple LoRA finetuned LLM as one| [![arXiv](https://img.shields.io/badge/arXiv-b31b1b.svg)](https://arxiv.org/pdf/2310.18547.pdf) | [![Star](https://img.shields.io/github/stars/punica-ai/punica.svg)](https://github.com/punica-ai/punica.git) | - | Oct,2023 |\n\n\n\n\n\n### Framework\n- code [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22](https://github.com/alpa-projects/alpa.git) \n\n  paper [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22](https://arxiv.org/pdf/2201.12023.pdf)\n\n- code [Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22 ](https://github.com/flexflow/FlexFlow) OSDI'22 \n   \n  paper [Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22](https://www.usenix.org/system/files/osdi22-unger.pdf)\n\n- code [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM SC21 ](https://github.com/NVIDIA/Megatron-LM.git) \n\n  paper [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM SC21 ](https://people.eecs.berkeley.edu/~matei/papers/2021/sc_megatron_lm.pdf)\n\n- code [A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters OSDI'20](https://github.com/bytedance/byteps) \n  \n  paper [A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters OSDI'20](https://www.usenix.org/system/files/osdi20-jiang.pdf)\n\n- code [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP'23](https://github.com/hpcaitech/ColossalAI)\n\n  paper [Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP'23](https://dl.acm.org/doi/pdf/10.1145/3605573.3605613)\n\n- code [HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22](https://github.com/MachineLearningSystem/Hetu) \n \n  paper [HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22](https://www.vldb.org/pvldb/vol15/p312-miao.pdf)\n\n### Parallellism Training\n- code [zero-bubble-pipeline-parallelism](https://github.com/sail-sg/zero-bubble-pipeline-parallelism)\n  \n  paper [NEAR ZERO BUBBLE PIPELINE PARALLELISM ICLR'24](https://openreview.net/pdf?id=tuzTN0eIO5 )\n- code [ DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines Eurosys'24](https://github.com/MachineLearningSystem/24Eurosys-optimizing-multitask-training-through-dynamic-pipelines)\n  \n  paper [ DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines Eurosys'24](https://assets.amazon.science/33/e5/023653cb46d9abb4baa576c571b3/dynapipe-optimizing-multi-task-training-through-dynamic-pipelines.pdf)\n- [Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation Eurosys'24](https://github.com/MachineLearningSystem/24Eurosys-Aceso)\n- code [HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis Eurosys'24](https://github.com/MachineLearningSystem/24Eurosys-hap)\n  \n  paper [HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis Eurosys'24](https://i.cs.hku.hk/~cwu/papers/swzhang-eurosys24.pdf)\n\n- code [Calculon: A Methodology and Tool for High-Level Co-Design of Systems and Large Language Models SC'23](https://github.com/MachineLearningSystem/23sc-calculon)\n\n  paper [Calculon: A Methodology and Tool for High-Level Co-Design of Systems and Large Language Models SC'23](https://dl.acm.org/doi/pdf/10.1145/3581784.3607102)\n\n- code [PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices  MLSYS'23](https://github.com/MachineLearningSystem/23MLSYS-pipe-fisher)\n\n \n  paper [PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices  MLSYS'23](https://arxiv.org/pdf/2211.14133.pdf)\n\n- code [Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI'23 ](https://github.com/MachineLearningSystem/bamboo)  \n \n  paper [Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI'23 ](https://www.usenix.org/system/files/nsdi23-thorpe.pdf)\n\n- code [MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23 ](https://github.com/MachineLearningSystem/HPCA23-mpress) \n \n  paper [MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23 ](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=10071077)\n\n- code [Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression ASPLOS'23](https://github.com/MachineLearningSystem/Optimus-CC) \n \n  paper [Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression ASPLOS'23](https://arxiv.org/pdf/2301.09830.pdf)\n- code [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22](https://github.com/alpa-projects/alpa.git) \n\n  paper [Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22](https://www.usenix.org/system/files/osdi22-zheng-lianmin.pdf)\n\n- code [AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness NeurIPS '22 ](https://github.com/MachineLearningSystem/AMP) \n\n  paper [AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness NeurIPS '22 ](https://arxiv.org/pdf/2210.07297.pdf)\n\n- code [Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22](https://github.com/flexflow/FlexFlow) \n\n  paper [Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22](https://www.usenix.org/system/files/osdi22-unger.pdf)\n\n- code [NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism](https://github.com/MachineLearningSystem/naspipe) ASPLOS'22\n\n  paper [NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism](https://dl.acm.org/doi/pdf/10.1145/3503222.3507735)\n- code [Varuna: Scalable, Low-cost Training of Massive Deep Learning Models](https://github.com/MachineLearningSystem/varuna) Eurosys'22 \n\n  paper [Varuna: Scalable, Low-cost Training of Massive Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3492321.3519584)\n\n- code [Chimera: efficiently training large-scale neural networks with bidirectional pipelines SC'21 ](https://github.com/MachineLearningSystem/Chimera) \n \n  paper [Chimera: efficiently training large-scale neural networks with bidirectional pipelines SC'21 ](https://dl.acm.org/doi/pdf/10.1145/3458817.3476145)\n\n- code [Piper: Multidimensional Planner for DNN Parallelization NeurIPS'21](https://github.com/MachineLearningSystem/piper) \n\n  paper [Piper: Multidimensional Planner for DNN Parallelization NeurIPS'21](https://proceedings.neurips.cc/paper_files/paper/2021/file/d01eeca8b24321cd2fe89dd85b9beb51-Paper.pdf)\n\n- code [PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models  ICML'21](https://github.com/MachineLearningSystem/PipeTransformer.git)\n\n  paper [PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models  ICML'21](http://proceedings.mlr.press/v139/he21a/he21a.pdf)\n\n- code [DAPPLE: An Efficient Pipelined Data Parallel Approach for Large Models Training PPOPP'21](https://github.com/MachineLearningSystem/DAPPLE)\n\n  paper [DAPPLE: An Efficient Pipelined Data Parallel Approach for Large Models Training PPOPP'21](https://dl.acm.org/doi/pdf/10.1145/3437801.3441593)\n\n- code [TeraPipe:Large-Scale Language Modeling with Pipeline Parallelism ICML'21 ](https://github.com/MachineLearningSystem/terapipe) \n\n  paper [TeraPipe:Large-Scale Language Modeling with Pipeline Parallelism ICML'21 ](https://danyangzhuo.com/papers/ICML21-TeraPipe.pdf)\n\n- code [PipeDream: Pipeline Parallelism for DNN Training SOSP'19 ](https://github.com/MachineLearningSystem/pipedream.git) \n\n  paper [PipeDream: Pipeline Parallelism for DNN Training SOSP'19 ](https://people.eecs.berkeley.edu/~matei/papers/2019/sosp_pipedream.pdf)\n\n- code [SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient](https://github.com/MachineLearningSystem/swarm)\n \n\n  paper [SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient](https://proceedings.mlr.press/v202/ryabinin23a/ryabinin23a.pdf)\n\n- code [Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models](https://github.com/MachineLearningSystem/Merak)\n\n  paper [Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=10049507)\n\n- [awesome distributed deep learning](https://github.com/MachineLearningSystem/Awesome-Distributed-Deep-Learning.git)\n- [awsome parallelism](https://github.com/MachineLearningSystem/awesome-Auto-Parallelism)\n\n### Training \n\n- code [ModelKeeper: Accelerating DNN Training via Automated Training Warmup NSDI'23](https://github.com/MachineLearningSystem/ModelKeeper) \n\n  paper [ModelKeeper: Accelerating DNN Training via Automated Training Warmup NSDI'23](https://www.usenix.org/system/files/nsdi23-lai-fan.pdf)\n\n\n- code [STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training SC'22](https://github.com/MachineLearningSystem/sc22-ae-big_model) \n\n  paper  [STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training SC'22](https://dl.acm.org/doi/pdf/10.5555/3571885.3571979)\n\n- code [Whale: Efficient Giant Model Training over Heterogeneous {GPUs}ATC'22 ](https://github.com/MachineLearningSystem/EasyParallelLibrary) \n\n  paper [Whale: Efficient Giant Model Training over Heterogeneous {GPUs}ATC'22 ](https://www.usenix.org/system/files/atc22-jia-xianyan.pdf)\n\n- code [GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server Eurosys'16](https://github.com/MachineLearningSystem/geeps) \n\n  paper [GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server Eurosys'16](https://www.pdl.cmu.edu/PDL-FTP/CloudComputing/GeePS-cui-eurosys16.pdf)\n\n\n### Communication\n- code [ARK: GPU-driven Code Execution for Distributed Deep Learning NSDI'23](https://github.com/MachineLearningSystem/23NSDI-arkwo)\n\n  paper [ARK: GPU-driven Code Execution for Distributed Deep Learning NSDI'23](https://www.usenix.org/system/files/nsdi23-hwang.pdf)\n\n- code [TopoOpt: Optimizing the Network Topology for Distributed DNN Training NSDI'23 ](https://github.com/MachineLearningSystem/TopoOpt) \n\n   paper [TopoOpt: Optimizing the Network Topology for Distributed DNN Training NSDI'23 ](https://www.usenix.org/system/files/nsdi23-wang-weiyang.pdf)\n\n- code [Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads ASPLOS'22 ](https://github.com/parasailteam/coconet.git) \n\n  paper [Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads ASPLOS'22 ](https://dl.acm.org/doi/pdf/10.1145/3503222.3507778)\n\n- code [Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning SIGCOMM'21 ](https://github.com/MachineLearningSystem/omnireduce.git) \n\n  paper [Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning SIGCOMM'21 ](https://dl.acm.org/doi/pdf/10.1145/3452296.3472904)\n\n### Serving-Inference\n\n- code [Paella: Low-latency Model Serving with Virtualized GPU Scheduling SOSP'23](https://github.com/MachineLearningSystem/23sosp-paella)\n\n  paper [Paella: Low-latency Model Serving with Virtualized GPU Scheduling SOSP'23](https://dl.acm.org/doi/pdf/10.1145/3600006.3613163)\n\n- code [AlpaServe: Statistical Multiplexing with Model  Parallelism for Deep Learning Serving OSDI'23](https://github.com/MachineLearningSystem/OSDI23-mms)\n  \n  paper [AlpaServe: Statistical Multiplexing with Model  Parallelism for Deep Learning Serving OSDI'23](https://www.usenix.org/system/files/osdi23-li-zhuohan.pdf)\n- code [Optimizing Dynamic Neural Networks with Brainstorm OSDI'23](https://github.com/MachineLearningSystem/23OSDI-brainstorm)\n\n  paper [Optimizing Dynamic Neural Networks with Brainstorm OSDI'23](https://www.usenix.org/system/files/osdi23-cui.pdf)\n\n- code [Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access Eurosys'23](https://github.com/MachineLearningSystem/DeepPlan.git) \n\n  paper [Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access Eurosys'23](https://jeongseob.github.io/papers/jeong_eurosys23.pdf)\n\n\n- code [Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.ASPLOS'23](https://github.com/MachineLearningSystem/hidet)\n\n  paper [Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.ASPLOS'23](https://arxiv.org/pdf/2210.09603.pdf)\n- code [MPCFormer: fast, performant, and private transformer inference with MPC ICLR'23](https://github.com/DachengLi1/MPCFormer)  \n\n  paper [MPCFormer: fast, performant, and private transformer inference with MPC ICLR'23](https://arxiv.org/pdf/2211.01452.pdf)\n\n- code [High-throughput Generative Inference of Large Language Modelwith a Single GPU ICML'23](https://github.com/MachineLearningSystem/FlexGen) \n \n  paper [High-throughput Generative Inference of Large Language Modelwith a Single GPU ICML'23](https://arxiv.org/pdf/2303.06865.pdf)\n\n- code [VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Serving via Adaptive Compilation and Scheduling ASPLOS'22](https://github.com/MachineLearningSystem/VELTAIR_ASPLOS22) \n\n  paper [VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Serving via Adaptive Compilation and Scheduling ASPLOS'22](https://arxiv.org/pdf/2201.06212.pdf)\n\n- code [DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs ATC'22 ](https://github.com/MachineLearningSystem/DVABatch)  \n\n\n  paper [DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs ATC'22 ](https://www.usenix.org/system/files/atc22-cui.pdf)\n\n- code [Cocktail: A Multidimensional Optimization for Model Serving in Cloud NSDI'22 ](https://github.com/MachineLearningSystem/cocktail) \n\n  paper [Cocktail: A Multidimensional Optimization for Model Serving in Cloud NSDI'22 ](https://www.usenix.org/system/files/nsdi22-paper-gunasekaran.pdf)\n\n- code [Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing ATC'22](https://github.com/MachineLearningSystem/glet) \n\n  paper [Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing ATC'22](https://www.usenix.org/system/files/atc22-choi-seungbeom.pdf)\n\n- code [RIBBON: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances SC'21](https://github.com/MachineLearningSystem/SC21_Ribbon) \n\n  paper [RIBBON: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances SC'21](https://dl.acm.org/doi/pdf/10.1145/3458817.3476168)\n\n- code [INFaaS: Automated Model-less Inference Serving ATC'21 ](https://github.com/MachineLearningSystem/INFaaS.git)\n\n  paper  [INFaaS: Automated Model-less Inference Serving ATC'21 ](https://www.usenix.org/system/files/atc21-romero.pdf)\n\n- code [Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21](https://github.com/MachineLearningSystem/Abacus) \n\n  paper [Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=9910118)\n\n\n- code [Serving DNNs like Clockwork: Performance Predictability from the Bottom Up OSDI'20](https://github.com/MachineLearningSystem/clockwork) \n\n  paper [Serving DNNs like Clockwork: Performance Predictability from the Bottom Up OSDI'20](https://www.usenix.org/system/files/osdi20-gujarati.pdf)\n\n- code [Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving ATC'19 ](https://github.com/MachineLearningSystem/MArk-Project) \n\n  paper [Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving ATC'19 ](https://www.usenix.org/system/files/atc19-zhang-chengliang.pdf)\n\n- code [Nexus: a GPU cluster engine for accelerating DNN-based video analysis SOSP'19 ](https://github.com/MachineLearningSystem/nexus) \n\n  paper [Nexus: a GPU cluster engine for accelerating DNN-based video analysis SOSP'19 ](https://homes.cs.washington.edu/~arvind/papers/nexus.pdf)\n\n- code [Clipper:A low-latency prediction-serving system NSDI'17](https://github.com/ucbrise/clipper) \n\n  paper [Clipper:A low-latency prediction-serving system NSDI'17](https://www.usenix.org/system/files/conference/nsdi17/nsdi17-crankshaw.pdf)\n\n\n### MoE\n- code [SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization ATC'23](https://github.com/MachineLearningSystem/23ATC-SmartMoE-AE)\n\n  paper [SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization ATC'23](https://www.usenix.org/system/files/atc23-zhai.pdf)\n- code [MegaBlocks: Efficient Sparse Training with Mixture-of-Experts MLSYS'23 ](https://github.com/stanford-futuredata/megablocks) \n\n  paper [MegaBlocks: Efficient Sparse Training with Mixture-of-Experts MLSYS'23 ](https://arxiv.org/pdf/2211.15841.pdf)\n  \n- code [Tutel: Adaptive Mixture-of-Experts at Scale MLSYS'23](https://github.com/MachineLearningSystem/tutel-MOE) \n\n  paper [Tutel: Adaptive Mixture-of-Experts at Scale MLSYS'23](https://arxiv.org/pdf/2206.03382.pdf)\n\n- code [FastMoE: A Fast Mixture-of-Expert Training System PPOPP'22](https://github.com/MachineLearningSystem/fastmoe-thu) \n\n  paper [FastMoE: A Fast Mixture-of-Expert Training System PPOPP'22](https://dl.acm.org/doi/pdf/10.1145/3503221.3508418)\n\n- code [AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers ICLR'23](https://github.com/MachineLearningSystem/AutoMoE) \n\n\n  paper [AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers ICLR'23](https://openreview.net/pdf?id=3yEIFSMwKBC)\n\n- [awesome MoE](https://github.com/MachineLearningSystem/awesome-mixture-of-experts)\n\n- [MoE Paper](https://github.com/MachineLearningSystem/Awesome-Mixture-of-Experts-Papers)\n\n\n\n### GPU Cluster Management\n- code [Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs ASPLOS'23](https://github.com/MachineLearningSystem/Lucid) \n\n  paper [Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs ASPLOS'23](https://dl.acm.org/doi/pdf/10.1145/3575693.3575705)\n\n- code [Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning NSDI'23](https://github.com/MachineLearningSystem/shockwave)  \n\n  paper [Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning NSDI'23](https://www.usenix.org/system/files/nsdi23-zheng.pdf)\n\n- code [Synergy : Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters OSDI'22](https://github.com/MachineLearningSystem/synergy.git) \n\n  paper [Synergy : Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters OSDI'22](https://www.usenix.org/system/files/osdi22-mohan.pdf)\n\n- code [Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning OSDI'21](https://github.com/MachineLearningSystem/adaptdl) \n\n  paper [Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning OSDI'21](https://www.usenix.org/system/files/osdi21-qiao.pdf)\n\n- code [Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads OSDI'20](https://github.com/MachineLearningSystem/gavel)\n\n  paper [Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads OSDI'20](https://www.usenix.org/system/files/osdi20-narayanan_deepak.pdf)\n\n- code [Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information  NSDI'19](https://github.com/MachineLearningSystem/Tiresias)\n\n\n  paper [Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information  NSDI'19](https://www.usenix.org/system/files/nsdi19-gu.pdf)\n\n- code [Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs SOCC'21 ](https://github.com/MachineLearningSystem/ChronusArtifact) \n\n  paper [Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs SOCC'21 ](https://yezhisheng.me/publication/chronus/chronus_preprint.pdf)\n\n- [awesome DL scheduler](https://github.com/MachineLearningSystem/Awesome-DL-Scheduling-Papers.git)\n\n### Schedule and Resource Management\n- code [An interference-aware scheduler for fine-grained GPU sharing Resources Eurosys'24](https://github.com/MachineLearningSystem/24Eurosys-orion.git)\n\n  paper [An interference-aware scheduler for fine-grained GPU sharing Resources Eurosys'24](https://anakli.inf.ethz.ch/papers/orion_eurosys24.pdf)\n\n\n- code [ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning ASPLOS'23](https://github.com/MachineLearningSystem/ElasticFlow-ASPLOS23) \n\n  paper [ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning ASPLOS'23](https://cp5555.github.io/publications/elasticflow-asplos23.pdf)\n\n- code [Multi-Resource Interleaving for Deep Learning Training SIGCOMM'22](https://github.com/MachineLearningSystem/Muri) \n  \n  paper [Multi-Resource Interleaving for Deep Learning Training SIGCOMM'22](https://xinjin.github.io/files/SIGCOMM22_Muri.pdf)\n\n- code [Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training ASPLOS'24](https://github.com/MachineLearningSystem/slapo)  \n\n  paper [Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training ASPLOS'24](https://arxiv.org/pdf/2302.08005.pdf)\n\n- code [Out-of-order backprop: an effective scheduling technique for deep learning Eurosys'22 ](https://github.com/MachineLearningSystem/ooo-backprop) \n\n  paper [Out-of-order backprop: an effective scheduling technique for deep learning Eurosys'22 ](https://dl.acm.org/doi/pdf/10.1145/3492321.3519563)\n\n- code [KungFu: Making Training in Distributed Machine Learning Adaptive OSDI'20](https://github.com/MachineLearningSystem/KungFu) \n \n  paper [KungFu: Making Training in Distributed Machine Learning Adaptive OSDI'20](https://www.usenix.org/system/files/osdi20-mai.pdf)\n\n- code [PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications OSDI'20 ](https://github.com/MachineLearningSystem/PipeSwitch) \n\n  paper [PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications OSDI'20 ](https://www.usenix.org/system/files/osdi20-bai.pdf)\n\n\n### Optimization\n- code [GLake: optimizing GPU memory management and IO transmission ASPLOS'24](https://github.com/MachineLearningSystem/24ASPLOS-glake)\n- code [Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow ASPLOS'23 ](https://github.com/MachineLearningSystem/spada-sim) \n\n  paper [Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow ASPLOS'23 ](https://dl.acm.org/doi/pdf/10.1145/3575693.3575706)\n\n- code [MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters SOCC'22 ](https://github.com/MachineLearningSystem/socc22-miso) \n  \n  paper [MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters SOCC'22 ](https://dspace.mit.edu/bitstream/handle/1721.1/147687/3542929.3563510.pdf?sequence=1\u0026isAllowed=y)\n\n- code [Accpar: Tensor partitioning for heterogeneous deep learning accelerators HPCA'20 ](https://github.com/MachineLearningSystem/AccPar) \n\n  paper [Accpar: Tensor partitioning for heterogeneous deep learning accelerators HPCA'20 ](http://alchem.usc.edu/portal/static/download/accpar.pdf)\n\n\n- code [Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS'23](https://github.com/MachineLearningSystem/hidet) \n  \n  paper [Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS'23](https://dl.acm.org/doi/pdf/10.1145/3575693.3575702)\n\n- code [iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud TPDS'22 ](https://github.com/MachineLearningSystem/igniter) \n\n   paper [iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud TPDS'22 ](https://arxiv.org/pdf/2211.01713.pdf)\n\n- code [CheckFreq: Frequent, Fine-Grained DNN Checkpointing FAST'22](https://github.com/MachineLearningSystem/CheckFreq) \n\n  paper [CheckFreq: Frequent, Fine-Grained DNN Checkpointing FAST'22](https://www.usenix.org/system/files/fast21-mohan.pdf)\n\n- code [Efficient Quantized Sparse Matrix Operations on Tensor Cores  SC'22](https://github.com/MachineLearningSystem/Magicube)\n\n  paper [Efficient Quantized Sparse Matrix Operations on Tensor Cores  SC'22](https://dl.acm.org/doi/pdf/10.5555/3571885.3571934)\n\n- code [Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers VLDB'22](https://github.com/MachineLearningSystem/harmony) \n\n   paper [Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers VLDB'22](https://vldb.org/pvldb/vol15/p2747-li.pdf)\n\n- code [PetS: A Unified Framework for Parameter-Efficient Transformers Serving ATC'22  ](https://github.com/MachineLearningSystem/PetS-ATC-2022)\n\n  paper [PetS: A Unified Framework for Parameter-Efficient Transformers Serving ATC'22  ](https://www.usenix.org/system/files/atc22-zhou-zhe.pdf)\n\n- code [PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections OSDI'21](https://github.com/MachineLearningSystem/pet-osdi21-ae) \n\n  paper [PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections OSDI'21](https://www.usenix.org/system/files/osdi21-wang-haojie.pdf)\n\n- code [APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Core SC'21](https://github.com/MachineLearningSystem/APNN-TC) \n\n  paper [APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Core SC'21](https://dl.acm.org/doi/pdf/10.1145/3458817.3476157)\n\n- code [iGUARD: In-GPU Advanced Race Detection SOSP'21](https://github.com/MachineLearningSystem/iGUARD.git) \n\n  paper [iGUARD: In-GPU Advanced Race Detection SOSP'21](https://dl.acm.org/doi/pdf/10.1145/3477132.3483545)\n\n- code [Fluid: Resource-Aware Hyperparameter Tuning Engine MLSYS'21](https://github.com/MachineLearningSystem/Fluid) \n\n  paper [Fluid: Resource-Aware Hyperparameter Tuning Engine MLSYS'21](https://www.mosharaf.com/wp-content/uploads/fluid-mlsys21.pdf)\n\n- code [Baechi: Fast Device Placement on Machine Learning Graphs SOCC'20 ](https://github.com/MachineLearningSystem/baechi)\n\n  paper [Baechi: Fast Device Placement on Machine Learning Graphs SOCC'20 ](https://dprg.cs.uiuc.edu/data/files/2020/socc20-final352.pdf)\n\n- code [Dynamic Parameter Allocation in Parameter Servers VLDB'20 ](https://github.com/MachineLearningSystem/AdaPS) \n\n  paper [Dynamic Parameter Allocation in Parameter Servers VLDB'20 ](https://www.vldb.org/pvldb/vol13/p1877-renz-wieland.pdf)\n\n- code [Data Movement Is All You Need: A Case Study on Optimizing Transformers](https://github.com/MachineLearningSystem/substation) \n paper [Data Movement Is All You Need: A Case Study on Optimizing Transformers](https://htor.inf.ethz.ch/publications/img/data_movement_is_all_you_need.pdf)\n\n### GNN\n- code [gSampler: Efficient GPU-Based Graph Sampling for Graph Learning SOSP'23](https://github.com/MachineLearningSystem/23SOSP-gSampler)\n   \n   paper [gSampler: Efficient GPU-Based Graph Sampling for Graph Learning SOSP'23](https://dl.acm.org/doi/pdf/10.1145/3600006.3613168)\n\n- code [Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training ATC'23](https://github.com/MachineLearningSystem/ATC23-Legion)\n\n  paper [Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training ATC'23](https://www.usenix.org/system/files/atc23-sun.pdf)\n\n- code [TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs ATC'23](https://github.com/MachineLearningSystem/ATC23-TCGNN-Pytorch)\n\n  paper [TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs ATC'23](https://www.usenix.org/system/files/atc23-wang-yuke.pdf)\n\n- code [Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms OSDI'23](https://github.com/MachineLearningSystem/MGG-OSDI23-AE) \n\n  paper [Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms OSDI'23](https://www.usenix.org/system/files/osdi23-wang-yuke.pdf)\n\n- code [CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs SC'22](https://github.com/MachineLearningSystem/CoGNN_info_for_SC22.git) \n\n  paper [CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs SC'22](https://dl.acm.org/doi/pdf/10.5555/3571885.3571936)\n\n- code [GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs OSDI'21](https://github.com/MachineLearningSystem/OSDI21_AE-GNN) \n \n  paper  [GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs OSDI'21](https://www.usenix.org/system/files/osdi21-wang-yuke.pdf)\n\n- code [Marius: Learning Massive Graph Embeddings on a Single Machine OSDI'21](https://github.com/MachineLearningSystem/marius) \n  \n  paper [Marius: Learning Massive Graph Embeddings on a Single Machine OSDI'21](https://www.usenix.org/system/files/osdi21-mohoney.pdf)\n\n- code [Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads OSDI'21](https://github.com/MachineLearningSystem/dorylus)  \n \n  paper [Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads OSDI'21](https://www.usenix.org/system/files/osdi21-thorpe.pdf)\n\n- code [BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling MLSYS'22 ](https://github.com/MachineLearningSystem/BNS-GCN)\n\n  paper  [BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling MLSYS'22 ](https://arxiv.org/pdf/2203.10983.pdf)\n\n- code [Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning  VLDB'21 ](https://github.com/MachineLearningSystem/GCNP)\n\n  paper [Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning  VLDB'21 ](http://vldb.org/pvldb/vol14/p1597-zhou.pdf)\n\n- code [Reducing Communication in Graph Neural Network Training SC'20 ](https://github.com/MachineLearningSystem/CAGNET) \n\n  paper  [Reducing Communication in Graph Neural Network Training SC'20 ](https://dl.acm.org/doi/pdf/10.5555/3433701.3433794)\n\n- [awesome GNN](https://github.com/chwan1016/awesome-gnn-systems)\n\n### Fine-Tune\n-  code [Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism ATC'21](https://github.com/MachineLearningSystem/FTPipe-ATC21-Finetune.git) \n\n   paper [Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism ATC'21](https://www.usenix.org/system/files/atc21-eliad.pdf)\n\n### Energy\n\n- code [Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training NSDI'23](https://github.com/MachineLearningSystem/Zeus)\n\n  paper [Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training NSDI'23](https://www.usenix.org/system/files/nsdi23-you.pdf)\n\n- code [EnvPipe: Performance-preserving DNN Training Framework for Saving Energy ATC'23](https://github.com/MachineLearningSystem/23ATC-EnvPipe)\n\n  paper  [EnvPipe: Performance-preserving DNN Training Framework for Saving Energy ATC'23](https://www.usenix.org/system/files/atc23-choi.pdf)\n\n### Misc \n- code [Characterizing Variability in Large-Scale, Accelerator-Rich Systems  SC'22 ](https://github.com/MachineLearningSystem/gpu_variability_sc22_artifact)\n\n  paper [Characterizing Variability in Large-Scale, Accelerator-Rich Systems  SC'22 ](https://dl.acm.org/doi/pdf/10.5555/3571885.3571971)\n\n- code [Prediction of the Resource Consumption of Distributed Deep Learning Systems SIGMETRICS'22 ](https://github.com/MachineLearningSystem/Driple) \n\n  paper [Prediction of the Resource Consumption of Distributed Deep Learning Systems SIGMETRICS'22 ](https://dl.acm.org/doi/pdf/10.1145/3530895)\n\n- code [AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications HPCA'22](https://github.com/MachineLearningSystem/HPCA22_SuperCloud)  \n\n  paper [AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications HPCA'22](https://baolin-li.netlify.app/uploads/HPCA_2022_MIT_SuperCloud.pdf)\n\n\n\n## Contribute\nWe encourage all contributions to this repository. Open an [issue](https://github.com/lambda7xx/awesome-AI-system/issues) or send a [pull request](https://github.com/lambda7xx/awesome-AI-system/pulls).\n","projects_url":"https://awesome.ecosyste.ms/api/v1/lists/lambda7xx%2Fawesome-ai-system/projects"}