Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-AI-system
paper and its code for AI System
https://github.com/lambda7xx/awesome-AI-system
Last synced: 3 days ago
JSON representation
-
Paper-Code
-
Parallellism Training
- PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices MLSYS'23
- zero-bubble-pipeline-parallelism
- Aceso: Efficient Parallel DNN Training through Iterative Bottleneck Alleviation Eurosys'24
- HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis Eurosys'24
- Calculon: A Methodology and Tool for High-Level Co-Design of Systems and Large Language Models SC'23
- Calculon: A Methodology and Tool for High-Level Co-Design of Systems and Large Language Models SC'23
- PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices MLSYS'23
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI'23
- Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression ASPLOS'23
- AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness NeurIPS '22
- NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism
- Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines SC'21
- Piper: Multidimensional Planner for DNN Parallelization NeurIPS'21
- PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models ICML'21
- DAPPLE: An Efficient Pipelined Data Parallel Approach for Large Models Training PPOPP'21
- TeraPipe:Large-Scale Language Modeling with Pipeline Parallelism ICML'21
- PipeDream: Pipeline Parallelism for DNN Training SOSP'19
- SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- awesome distributed deep learning
- awsome parallelism
- Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression ASPLOS'23
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22
- AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness NeurIPS '22
- Piper: Multidimensional Planner for DNN Parallelization NeurIPS'21
- PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models ICML'21
- PipeDream: Pipeline Parallelism for DNN Training SOSP'19
- SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines Eurosys'24
- DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines Eurosys'24
- HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis Eurosys'24
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- Bamboo: Making Preemptible Instances Resilient for Affordable Training of Large DNNs NSDI'23
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- TeraPipe:Large-Scale Language Modeling with Pipeline Parallelism ICML'21
- NEAR ZERO BUBBLE PIPELINE PARALLELISM ICLR'24
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- Varuna: Scalable, Low-cost Training of Massive Deep Learning Models
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- MPress: Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism HPCA'23
- Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22
- PipeFisher: Efficient Training of Large Language Models Using Pipelining and Fisher Information Matrices MLSYS'23
- Optimus-CC: Efficient Large NLP Model Training with 3D Parallelism Aware Communication Compression ASPLOS'23
- AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness NeurIPS '22
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization OSDI'22
-
Framework
- A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters OSDI'20
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training ICPP'23
- HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22
- Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM SC21
- A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters OSDI'20
- HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework VLDB'22
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning OSDI'22
-
Training
- ModelKeeper: Accelerating DNN Training via Automated Training Warmup NSDI'23
- STRONGHOLD: Fast and Affordable Billion-scale Deep Learning Model Training SC'22
- Whale: Efficient Giant Model Training over Heterogeneous {GPUs}ATC'22
- GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server Eurosys'16
- ModelKeeper: Accelerating DNN Training via Automated Training Warmup NSDI'23
- GeePS: Scalable Deep Learning on Distributed GPUs with a GPU-Specialized Parameter Server Eurosys'16
- Whale: Efficient Giant Model Training over Heterogeneous {GPUs}ATC'22
-
Communication
- TopoOpt: Optimizing the Network Topology for Distributed DNN Training NSDI'23
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads ASPLOS'22
- Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning SIGCOMM'21
- ARK: GPU-driven Code Execution for Distributed Deep Learning NSDI'23
- ARK: GPU-driven Code Execution for Distributed Deep Learning NSDI'23
- TopoOpt: Optimizing the Network Topology for Distributed DNN Training NSDI'23
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads ASPLOS'22
- Efficient Sparse Collective Communication and its application to Accelerate Distributed Deep Learning SIGCOMM'21
-
Serving-Inference
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling SOSP'23
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving OSDI'23
- Optimizing Dynamic Neural Networks with Brainstorm OSDI'23
- Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access Eurosys'23
- MPCFormer: fast, performant, and private transformer inference with MPC ICLR'23
- High-throughput Generative Inference of Large Language Modelwith a Single GPU ICML'23
- High-throughput Generative Inference of Large Language Modelwith a Single GPU ICML'23
- VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Serving via Adaptive Compilation and Scheduling ASPLOS'22
- DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs ATC'22
- Cocktail: A Multidimensional Optimization for Model Serving in Cloud NSDI'22
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing ATC'22
- RIBBON: cost-effective and qos-aware deep learning model inference using a diverse pool of cloud computing instances SC'21
- INFaaS: Automated Model-less Inference Serving ATC'21
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- Serving DNNs like Clockwork: Performance Predictability from the Bottom Up OSDI'20
- Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving ATC'19
- Nexus: a GPU cluster engine for accelerating DNN-based video analysis SOSP'19
- Clipper:A low-latency prediction-serving system NSDI'17
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving OSDI'23
- Optimizing Dynamic Neural Networks with Brainstorm OSDI'23
- Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access Eurosys'23
- Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.ASPLOS'23
- MPCFormer: fast, performant, and private transformer inference with MPC ICLR'23
- High-throughput Generative Inference of Large Language Modelwith a Single GPU ICML'23
- VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Serving via Adaptive Compilation and Scheduling ASPLOS'22
- Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing ATC'22
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- Serving DNNs like Clockwork: Performance Predictability from the Bottom Up OSDI'20
- Nexus: a GPU cluster engine for accelerating DNN-based video analysis SOSP'19
- Clipper:A low-latency prediction-serving system NSDI'17
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- DVABatch: Diversity-aware Multi-Entry Multi-Exit Batching for Efficient Processing of DNN Services on GPUs ATC'22
- Cocktail: A Multidimensional Optimization for Model Serving in Cloud NSDI'22
- INFaaS: Automated Model-less Inference Serving ATC'21
- Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving ATC'19
- Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs.ASPLOS'23
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- Enable Simultaneous DNN Services Based on Deterministic Operator Overlap and Precise Latency Prediction SC'21
- VELTAIR: Towards High-Performance Multi-Tenant Deep Learning Serving via Adaptive Compilation and Scheduling ASPLOS'22
- MPCFormer: fast, performant, and private transformer inference with MPC ICLR'23
- Paella: Low-latency Model Serving with Virtualized GPU Scheduling SOSP'23
-
MoE
- SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization ATC'23
- Tutel: Adaptive Mixture-of-Experts at Scale MLSYS'23
- FastMoE: A Fast Mixture-of-Expert Training System PPOPP'22
- AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers ICLR'23
- awesome MoE
- MoE Paper
- SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization ATC'23
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts MLSYS'23
- Tutel: Adaptive Mixture-of-Experts at Scale MLSYS'23
- AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers ICLR'23
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts MLSYS'23
- Tutel: Adaptive Mixture-of-Experts at Scale MLSYS'23
-
GPU Cluster Management
- Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs ASPLOS'23
- Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning NSDI'23
- Synergy : Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters OSDI'22
- Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning OSDI'21
- Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads OSDI'20
- Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information NSDI'19
- Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs SOCC'21
- awesome DL scheduler
- Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning NSDI'23
- Synergy : Looking Beyond GPUs for DNN Scheduling on Multi-Tenant Clusters OSDI'22
- Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning OSDI'21
- Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads OSDI'20
- Tiresias -- A GPU Cluster Manager for Distributed Deep Learning Training without complete job information NSDI'19
- Chronus: A Novel Deadline-aware Scheduler for Deep Learning Training Jobs SOCC'21
-
Schedule and Resource Management
- An interference-aware scheduler for fine-grained GPU sharing Resources Eurosys'24
- ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning ASPLOS'23
- Multi-Resource Interleaving for Deep Learning Training SIGCOMM'22
- Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training ASPLOS'24
- Out-of-order backprop: an effective scheduling technique for deep learning Eurosys'22
- KungFu: Making Training in Distributed Machine Learning Adaptive OSDI'20
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications OSDI'20
- An interference-aware scheduler for fine-grained GPU sharing Resources Eurosys'24
- ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep Learning ASPLOS'23
- Multi-Resource Interleaving for Deep Learning Training SIGCOMM'22
- Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training ASPLOS'24
- KungFu: Making Training in Distributed Machine Learning Adaptive OSDI'20
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications OSDI'20
- Slapo: A Schedule Language for Progressive Optimization of Large Deep Learning Model Training ASPLOS'24
-
Optimization
- GLake: optimizing GPU memory management and IO transmission ASPLOS'24
- Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow ASPLOS'23
- MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters SOCC'22
- Accpar: Tensor partitioning for heterogeneous deep learning accelerators HPCA'20
- iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud TPDS'22
- CheckFreq: Frequent, Fine-Grained DNN Checkpointing FAST'22
- Efficient Quantized Sparse Matrix Operations on Tensor Cores SC'22
- Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers VLDB'22
- PetS: A Unified Framework for Parameter-Efficient Transformers Serving ATC'22
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections OSDI'21
- APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Core SC'21
- iGUARD: In-GPU Advanced Race Detection SOSP'21
- Fluid: Resource-Aware Hyperparameter Tuning Engine MLSYS'21
- Baechi: Fast Device Placement on Machine Learning Graphs SOCC'20
- Dynamic Parameter Allocation in Parameter Servers VLDB'20
- Data Movement Is All You Need: A Case Study on Optimizing Transformers
- CheckFreq: Frequent, Fine-Grained DNN Checkpointing FAST'22
- Harmony: Overcoming the hurdles of GPU memory capacity to train massive DNN models on commodity servers VLDB'22
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections OSDI'21
- Fluid: Resource-Aware Hyperparameter Tuning Engine MLSYS'21
- Baechi: Fast Device Placement on Machine Learning Graphs SOCC'20
- Data Movement Is All You Need: A Case Study on Optimizing Transformers
- Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs ASPLOS'23
- MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant GPU Clusters SOCC'22
- Accpar: Tensor partitioning for heterogeneous deep learning accelerators HPCA'20
- iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud TPDS'22
- PetS: A Unified Framework for Parameter-Efficient Transformers Serving ATC'22
- Dynamic Parameter Allocation in Parameter Servers VLDB'20
- APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Core SC'21
- iGniter: Interference-Aware GPU Resource Provisioning for Predictable DNN Inference in the Cloud TPDS'22
- Spada: Accelerating Sparse Matrix Multiplication with Adaptive Dataflow ASPLOS'23
-
GNN
- gSampler: Efficient GPU-Based Graph Sampling for Graph Learning SOSP'23
- Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training ATC'23
- TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs ATC'23
- Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms OSDI'23
- CoGNN: Efficient Scheduling for Concurrent GNN Training on GPUs SC'22
- GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs OSDI'21
- Marius: Learning Massive Graph Embeddings on a Single Machine OSDI'21
- Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads OSDI'21
- BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling MLSYS'22
- Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning VLDB'21
- Reducing Communication in Graph Neural Network Training SC'20
- awesome GNN
- Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training ATC'23
- TC-GNN: Accelerating Sparse Graph Neural Network Computation Via Dense Tensor Core on GPUs ATC'23
- Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Multi-GPU Platforms OSDI'23
- GNNAdvisor: An Efficient Runtime System for GNN Acceleration on GPUs OSDI'21
- Marius: Learning Massive Graph Embeddings on a Single Machine OSDI'21
- Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads OSDI'21
- BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling MLSYS'22
- Accelerating Large Scale Real-Time GNN Inference Using Channel Pruning VLDB'21
- BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling MLSYS'22
-
Fine-Tune
-
Energy
- Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training NSDI'23
- EnvPipe: Performance-preserving DNN Training Framework for Saving Energy ATC'23
- Zeus: Understanding and Optimizing {GPU} Energy Consumption of {DNN} Training NSDI'23
- EnvPipe: Performance-preserving DNN Training Framework for Saving Energy ATC'23
-
Misc
- Characterizing Variability in Large-Scale, Accelerator-Rich Systems SC'22
- Prediction of the Resource Consumption of Distributed Deep Learning Systems SIGMETRICS'22
- AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications HPCA'22
- AI-Enabling Workloads on Large-Scale GPU-Accelerated System: Characterization, Opportunities, and Implications HPCA'22
-
LLM Serving
- Efficiently Programming Large Language Models using SGLang
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity vldb'24
- PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
- Efficient Memory Management for Large Language Model Serving with PagedAttention SOSP'23
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification 23arxiv
- SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification 23arxiv
-
LLM FineTune
-
Fancy LLM
- LLMCompiler: An LLM Compiler for Parallel Function Calling 23arxiv
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
- EAGLE: Lossless Acceleration of LLM Decoding by Feature Extrapolation
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
-
LLM Serving Framework
-
LLM Evaluation Platform
- ![Star - sys/FastChat.git)| [![Website](https://img.shields.io/badge/Website-9cf)](https://chat.lmsys.org/) |
-
LLM Inference (System Side)
- ![Star - Liger) |-| PPOPP'24
-
Researcher
-
LoRA
- ![arXiv - ai/punica.svg)](https://github.com/punica-ai/punica.git) | - | Oct,2023 |
-
Categories
Sub Categories
Parallellism Training
54
Serving-Inference
43
Optimization
31
GNN
21
GPU Cluster Management
14
Schedule and Resource Management
14
MoE
12
Communication
8
Framework
8
Training
7
LLM Serving
6
LLM Serving Framework
5
Fancy LLM
5
Misc
4
Energy
4
Fine-Tune
2
LLM FineTune
2
LLM Evaluation Platform
1
Researcher
1
LLM Inference (System Side)
1
LoRA
1
Keywords
deep-learning
2
distributed-training
1
keras
1
machine-learning
1
mxnet
1
pytorch
1
tensorflow
1
ai
1
big-model
1
data-parallelism
1
distributed-computing
1
foundation-models
1
heterogeneous-training
1
hpc
1
inference
1
large-scale
1
model-parallelism
1
pipeline-parallelism
1
awesome-list
1
awesome-papers
1
graph-neural-networks
1
graph-systems
1