Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-distributed-ml
A curated list of awesome projects and papers for distributed training or inference
https://github.com/Shenggan/awesome-distributed-ml
Last synced: 6 days ago
JSON representation
-
Open Source Projects
- Megatron-LM: Ongoing Research Training Transformer Models at Scale
- Mesh TensorFlow: Model Parallelism Made Easier
- FlexFlow: A Distributed Deep Learning Framework that Supports Flexible Parallelization Strategies.
- Alpa: Auto Parallelization for Large-Scale Neural Networks
- Easy Parallel Library: A General and Efficient Deep Learning Framework for Distributed Model Training
- FairScale: PyTorch Extensions for High Performance and Large Scale Training
- TePDist: an HLO-level automatic distributed system for DL models
- EasyDist: Automated Parallelization System and Infrastructure
- exo: Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚
- DeepSpeed: A Deep Learning Optimization Library that Makes Distributed Training and Inference Easy, Efficient, and Effective.
- ColossalAI: A Unified Deep Learning System for Large-Scale Parallel Training
- Nerlnet: A framework for research and deployment of distributed machine learning algorithms on IoT devices
- veScale: A PyTorch Native LLM Training Framework
-
Papers
-
Survey
-
Pipeline Parallelism
- GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
- Memory-Efficient Pipeline-Parallel DNN Training
- Hanayo: Harnessing Wave-like Pipeline Parallelism for Enhanced Large Model Training Efficiency
- PipeDream: generalized pipeline parallelism for DNN training
- DAPPLE: a pipelined data parallel approach for training large models
- Mobius: Fine Tuning Large-Scale Models on Commodity GPU Servers
- Chimera: efficiently training large-scale neural networks with bidirectional pipelines
- Zero Bubble Pipeline Parallelism
- Elastic Averaging for Efficient Pipelined DNN Training
-
Mixture-of-Experts System
- GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- Tutel: Adaptive Mixture-of-Experts at Scale
- Accelerating Distributed MoE Training and Inference with Lina
- SmartMoE: Efficiently Training Sparsely-Activated Models through Combining Static and Dynamic Parallelization
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models
- BaGuaLu: targeting brain scale pretrained models with over 37 million cores
-
Graph Neural Networks System
-
Hybrid Parallelism & Framework
- GEMS: GPU-Enabled Memory-Aware Model-Parallelism System for Distributed DNN Training
- Amazon SageMaker Model Parallelism: A General and Flexible Framework for Large Model Training
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training
- Efficient large-scale language model training on GPU clusters using megatron-LM
-
Memory Efficient Training
- Training deep nets with sublinear memory cost
- Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
- Dynamic Tensor Rematerialization
- ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
- GACT: Activation Compressed Training for Generic Network Architectures
- ZeRO: memory optimizations toward training trillion parameter models
-
Tensor Movement
- ZeRO-Offload: Democratizing Billion-Scale Model Training
- PatrickStar: Parallel Training of Pre-trained Models via Chunk-based Memory Management
- MegTaiChi: dynamic tensor-based memory management optimization for DNN training
- Tensor Movement Orchestration In Multi-GPU Training Systems - Fu Lin et al., HPCA 2023
- SwapAdvisor: Pushing Deep Learning Beyond the GPU Memory Limit via Smart Swapping - Chin Huang et al., ASPLOS 2020
- Capuchin: Tensor-based GPU Memory Management for Deep Learning
- ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning
- Superneurons: dynamic GPU memory management for training deep neural networks
-
Auto Parallelization
- Mesh-tensorflow: Deep learning for supercomputers
- Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks
- Beyond Data and Model Parallelism for Deep Neural Networks
- GSPMD: General and Scalable Parallelization for ML Computation Graphs
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization
- Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism
- Auto-Parallelizing Large Models with Rhino: A Systematic Approach on Production AI Platform
- nnScaler: Constraint-Guided Parallelization Plan Generation for Deep Learning Training
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning
-
Communication Optimization
- Blink: Fast and Generic Collectives for Distributed ML
- Logical/Physical Topology-Aware Collective Communication in Deep Learning Training
- MSCCLang: Microsoft Collective Communication Language
- Breaking the computation and communication abstraction barrier in distributed machine learning workloads
- Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning
- Synthesizing optimal collective algorithms
-
Fault-tolerant Training
-
Inference and Serving
- EnergonAI: An Inference System for 10-100 Billion Parameter Transformer Models
- Efficiently Scaling Transformer Inference
- Beta: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- Fast inference from transformers via speculative decoding
- FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
- Liger: Interleaving Intra- and Inter-Operator Parallelism for Distributed Large Model Inference
- Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
- Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
- DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale
-
Applications
- AthenaRL: Distributed Reinforcement Learning with Dataflow Fragments
- Hydro: Surrogate-Based Hyperparameter Tuning Service in the Datacenter
- FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters
- NASPipe: High Performance and Reproducible Pipeline Parallel Supernet Training via Causal Synchronous Parallelism
- HybridFlow: A Flexible and Efficient RLHF Framework
-
Sequence Parallelism
- Long Sequence Training from System Perspective
- DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
- Ring Attention with Blockwise Transformers for Near-Infinite Context
- USP: A Unified Sequence Parallelism Approach for Long Context Generative AI
- LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism
-
-
Contribute
-
Applications
- issue - distributed-ml/pulls).
-
Categories
Sub Categories
Auto Parallelization
10
Pipeline Parallelism
9
Inference and Serving
9
Mixture-of-Experts System
8
Tensor Movement
8
Communication Optimization
6
Applications
6
Memory Efficient Training
6
Hybrid Parallelism & Framework
5
Sequence Parallelism
5
Survey
3
Fault-tolerant Training
3
Graph Neural Networks System
3
Keywords
deep-learning
4
machine-learning
4
distributed-computing
3
distributed-training
3
high-performance-computing
3
large-language-models
2
distributed-systems
2
auto-parallelization
2
compiler
2
llm
2
rhino
1
disthlo
1
pipeline-parallelism
1
model-parallelism
1
memory-efficient
1
gpu
1
data-parallelism
1
jax
1
alpa
1
transformers
1
model-para
1
pytorch
1
llm-training
1
python
1
neural-network
1
nerlnet
1
ml
1
iot
1
federated-learning-framework
1
federated-learning
1
federated
1
fault-tolerance
1
erlang
1
distributed-ml
1
distributed-machine-learning
1
cowboy
1
artificial-intelligence-projects
1
ai
1
llama
1
graph-neural-networks
1
diffusion-models
1
automatic-parallelization
1