{"id":13604254,"url":"https://github.com/MachineLearningSystem/awesome-Auto-Parallelism","last_synced_at":"2025-04-11T23:32:02.310Z","repository":{"id":185461618,"uuid":"558353563","full_name":"MachineLearningSystem/awesome-Auto-Parallelism","owner":"MachineLearningSystem","description":"A baseline repository of Auto-Parallelism in Training Neural Networks","archived":false,"fork":true,"pushed_at":"2022-06-25T03:19:28.000Z","size":832,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-05-23T02:01:08.916Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"ConnollyLeon/awesome-Auto-Parallelism","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-10-27T11:36:50.000Z","updated_at":"2022-10-25T15:53:46.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/awesome-Auto-Parallelism","commit_stats":null,"previous_names":["machinelearningsystem/awesome-auto-parallelism"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fawesome-Auto-Parallelism","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fawesome-Auto-Parallelism/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fawesome-Auto-Parallelism/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fawesome-Auto-Parallelism/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/awesome-Auto-Parallelism/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248161270,"owners_count":21057554,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:42.346Z","updated_at":"2025-04-11T23:32:02.277Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"readme":"# Concept Explanation\n\n## Data Parallelism (DP)\n\n## Model Parallelism\n\nModel Parallelism has two types: Inter-layer and intra-layer. We note Inter-layer model parallelism as MP, and\nintra-layer model parallelism as TP (tensor parallelism).\n\nsome researchers may call TP parameter parallelism or intra-layer model parallelism.\n\nPopular intra-model parallelism methods include 2D, 2.5D, 3D model-parallelism as well as Megatron(1D). There are only\nfew work related to 2D, 2.5D and 3D now (only Colossal-AI).\n\n## Pipeline Parallelism\n\nThe partition of PP and MP are similar, but has different executing behaviors. Basically pipeline parallelism has two\nfamilies: PipeDream family and GPipe family.\n\n# Published methods of auto-parallelism, including:\n\nI classify parallelism methods according to their partition ways.\n\n## Pipeline Parallelism or Inter-layer Model Parallelism only:\n\n|  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods\n| --- | --- | --- | ---  | --- | --- | --- |\n| ColocRL(REINFORCE) | Use reinforce learning to discover model partitions | Google Brain | [mlr.press](http://proceedings.mlr.press/v70/mirhoseini17a/mirhoseini17a.pdf) | Tensorflow | PMLR 70, 2017 | Reinforce\n| A hierarchical model for device placement (HDP)| Use Scotch to do graph partitioning | Google |[link](https://openreview.net/pdf?id=Hkc-TeZ0W) | Tensorflow | ICLR 2018 | Reinforce LSTM\n| GPipe| No implementation, see torchgpipe | Google | [arxiv](https://arxiv.org/abs/1811.06965) | None| 2018 on arxiv, NIPS2019 | averagely partition or manually\n|[torchgpipe](https://github.com/kakaobrain/torchgpipe)| An A GPipe implementation in PyTorch |  UNIST | [arxiv](https://arxiv.org/pdf/2004.09910.pdf) | pytorch | 2020 on arxiv | balance stages by profiling\n| GDP | A general deep RL method for automating device placements on arbitrary graphs. Orthogonal to DP,MP,PP | Google| [arxiv](https://export.arxiv.org/pdf/1910.01578.pdf) | Unknown | 2019 on arxiv | Reinforce Transformer\n| Pesto | partition model based  on inter-layer model parallelism | Stony Brook University | [acm](https://www3.cs.stonybrook.edu/~anshul/middleware21_pesto.pdf) | Tensorflow | Middleware '21 | integer linear program\n| [vPipe](https://github.com/hku-systems/vpipe) | A pipeline only system designed for NAS network. Complementary to hybrid parallelism| HKU | [ieee](https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=\u0026arnumber=9472938) | PyTorch | TPDS vol.33 no.3 2022 |Swap, Recompute, Partition(SRP) planner. P: Kernighan-Lin algorithm\n\n## Data Parallelism + Pipeline Parallelism (or Inter-layer Model Parallelism):\n\n|  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods |\n| --- | --- | --- | ---  | --- | --- | --- |\n| Spotlight| Model device placement as a Markov decision process (MDP). | University of Toronto | [mlr.press](http://proceedings.mlr.press/v80/gao18a/gao18a.pdf) | Unknown|PMLR 80, 2018 | Reinforce LSTM\n| Placeto | Looks like Spotlight with MDP, but have different Policy. | MIT |[nips](https://proceedings.neurips.cc/paper/2019/file/71560ce98c8250ce57a6a970c9991a5f-Paper.pdf) | Tensorflow | NIPS 2019 |  Reinforce\n|[REGAL](https://github.com/deepmind/deepmind-research/tree/master/regal)|a deep reinforcement learning approach to minimizing the execution cost of neural network computation graphs in an optimizing compiler. |Google|[openreview](https://openreview.net/pdf?id=rkxDoJBYPB) | Unknown |ICLR 2020 |RL with Genetic Algorithm\n|[PipeDream](https://github.com/msr-fiddle/pipedream) |This repository contains the source code implementation of PipeDream and PipeDream-2BW | Microsoft Fiddle| [arxiv](https://arxiv.org/pdf/1806.03377.pdf), | PyTorch | 2018 on arxiv, SOSP 2019 | Dynamic Programming with Profile\n|PipeDream-2BW | See above one | Microsoft |[arxiv](https://arxiv.org/pdf/2006.09503.pdf), [mlr.press](http://proceedings.mlr.press/v139/narayanan21a/narayanan21a.pdf)  | PyTorch | PMLR 139, 2021 | Dynamic Programming with Profile\n|[DNN-partitioning](https://github.com/msr-fiddle/dnn-partitioning)| published at NeurIPS 2020. | Microsoft Fiddle| [arxiv](https://arxiv.org/pdf/2006.16423.pdf) | proof-of-concept implementation | NIPS 2020 |Dynamic Programming and Integer Programming\n|HetPipe| Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism | UNIST | [usenix](https://www.usenix.org/system/files/atc20-park.pdf) | PyTorch (not open sourced) | USENIX 2020 | use CPLEX to solve linear programming problem\n|[DAPPLE](https://github.com/AlibabaPAI/DAPPLE) | An Efficient Pipelined Data Parallel Approach for Training Large Model. Succeed from GPipe | Alibaba | [arxiv](https://arxiv.org/pdf/2007.01045.pdf) | DAPPLE | 2020 on arxiv; PPoPP 21 | Dynamic Programming\n|[PipeTransformer](https://github.com/Distributed-AI/PipeTransformer) |Automated Elastic Pipelining for Distributed Training of Transformers | University of South  California | [arxiv](https://arxiv.org/pdf/2102.03161.pdf) |PyTorch |  ICML 21 | Dynamic Programming\n|[Chimera](https://github.com/Shigangli/Chimera) | Efficiently training large-scale neural networks with bidirectional pipelines | Department of Computer Science, ETH Zurich Switzerland | [dl.acm](https://dl.acm.org/doi/pdf/10.1145/3458817.3476145) | PyTorch | SC 2021 | Performance model with brute force\n| TAPP | Use a Seq2Seq based on attention mechanism to predict stage for layers. | Hohai University | [mdpi](https://www.mdpi.com/2076-3417/11/11/4785/pdf) | Unknown |Appl.sci. 2021, 11 | Reinforce Seq2Seq based on attention\n|[RaNNC](https://github.com/nict-wisdom/rannc/tree/main) | RaNNC is an automatic parallelization middleware used to train very large-scale neural networks. | DIRECT and University of Tokyo | [arxiv](http://arxiv.org/abs/2103.16063) | PyTorch | IPDPS 2021 | dynamic programming\n|[HeterPS](https://github.com/PaddlePaddle/Paddle/tree/develop/paddle/fluid/framework/fleet/heter_ps)| distributed deep learning with RL based scheduling in heterogeneous environment. | Baidu |  [arxiv](https://arxiv.org/pdf/2111.10635.pdf) | Paddle | 2021 | Reinforce learning based\n|[FTPipe](https://github.com/saareliad/FTPipe) | FTPipe can automatically transform sequential implementation into a multi-GPU one. | Technion-Israel Institute of Technology | [usenix](https://usenix.org/system/files/atc21-eliad.pdf) | PyTorch | 2021 | multiprocessor scheduling problem with profiling.\n\n## Data Parallelism + Intra-layer Model Parallelism (or Tensor Parallelism):\n\n|  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods |\n| --- | --- | --- | ---  | --- | --- | --- |\n|[OptCNN](https://github.com/flexflow/FlexFlow) | auto parallelism method  for CNN |Zhihao Jia | [mlr.press](http://proceedings.mlr.press/v80/jia18a/jia18a.pdf) | FlexFlow | PMLR 80, 2018 | Dynamic Programming based graph search algorithm\n|[FlexFlow](https://github.com/flexflow/FlexFlow) | a deep learning framework that accelerates distributed DNN training by automatically searching for efficient parallelization strategies | Zhihao Jia | [stanford](https://cs.stanford.edu/~zhihao/papers/sysml19a.pdf) |FlexFlow, compatible with PyTorch, Keras | SysML 2019 | MCMC\n|Tofu| Supporting Very Large Models using Automatic Dataflow Graph Partitioning | New York University | [dl.acm](https://dl.acm.org/doi/pdf/10.1145/3302424.3303953) | Not OpenSourced | Euro-Sys 2019 | same as OptCNN\n|[AccPar](https://github.com/linghaosong/AccPar) |Tensor partitioning for heterogeneous deep learning accelerators. | Linghao Song from USC| [usc.edu](http://alchem.usc.edu/portal/static/download/accpar.pdf) | Need Manually Deploy | 2019 on arxiv, HPCA 2020 | Dynamic Programming\n|[TensorOpt](https://github.com/mindspore-ai/mindspore/tree/master/mindspore/ccsrc/frontend/parallel/auto_parallel) | Exploring the Tradeoffs in Distributed DNN Training with Auto-Parallelism | CUHK \u0026 Huawei |  [arxiv](https://arxiv.org/pdf/2004.10856.pdf) | MindSpore | 2020 on arxiv | Dynamic Programming based graph search algorithm\n|[ROC](https://github.com/jiazhihao/ROC) | Another paper from Zhihao, Jia. Designed for GNN | Zhihao Jia | [mlsys](https://proceedings.mlsys.org/paper/2020/file/fe9fc289c3ff0af142b6d3bead98a923-Paper.pdf) | On top of Flexflow  | MLSys 2020 | uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost.\n|[Double Recursive](https://github.com/mindspore-ai/mindspore/tree/master/mindspore/ccsrc/frontend/parallel/auto_parallel/rec_core) | A Double recursive algorithm to search strategies | Huawei | [link](https://link.springer.com/chapter/10.1007/978-3-030-85665-6_13) | MindSpore | Euro-Par 2021 | Double Recursive\n|[PaSE](https://github.com/baidu-research/PaSE) |PaSE uses a dynamic programming based approach to find an efficient strategy within a reasonable time. | Baidu Research | [ieee](https://github.com/baidu-research/PaSE/raw/master/docs/PaSE_ipdps2021.pdf) | prototype | IPDPS 2021 | Dynamic Programming\n|P^2| offer a novel syntax-guided program synthesis framework that is able to decompose reductions over one or more parallelism axes to sequences of collectives in a hierarchy- and mapping-aware way |University of Cambridge \u0026 DeepMind | [arxiv](https://arxiv.org/pdf/2110.10548.pdf) | Simulation Experiment |2021 on arxiv, MLSys 2022 | Synthesize tool with simulation\n|AutoMap| Uses Search and Learn to do find Megatron-like strategies | DeepMind | [arxiv](https://arxiv.org/pdf/2112.02958.pdf) | JAX python API, XLA backend | 2021 on arxiv, NIPS 2021 | Search: Monte Carlo Tree Search; Learn: Interactive Network\n\n## Data Parallelism + Model Parallelism (or Tensor Parallelism) + Pipeline Parallelism:\n\n|  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods|\n| --- | --- | --- | ---  | --- | --- | --- |\n|Auto-MAP| It works on HLO IR. Use Linkage Group to prune search space Use DQN RL to search DD, MP, PP stategies. | Alibaba | [arxiv](https://arxiv.org/pdf/2007.04069.pdf) | RAINBOW DQN | 2020 | Reinforce Learning\n|[Piper](https://github.com/msr-fiddle/piper) | This code package contains algorithms (proof-of-concept implementation) and input files (profiled DNN models / workloads) from the paper \"Piper: Multidimensional Planner for DNN Parallelization\" published at NeurIPS 2021. An extension of DNN partitioning| Microsoft Fiddle| [link](https://www.microsoft.com/en-us/research/publication/piper-multidimensional-planner-for-dnn-parallelization/) | proof-of-concept implementation | NIPS 2021 | two-level dynamic programming\n|[GSPMD](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/compiler/xla) |a system that uses simple tensor sharding annotations to achieve different parallelism paradigms in a unified way | Google | [arxiv](https://arxiv.org/pdf/2105.04663.pdf) | Tensorflow XLA | 2021 | sharding propagation\n|[DistIR](https://github.com/microsoft/dist-ir) | Horizontal TP. An intermediate representation and simulator for efficient neural network distribution | Stanford University \u0026 Microsoft Fiddle| [arxiv](https://arxiv.org/abs/2111.05426) | PyTorch | MLSys 2021 | Grid-Search Simulator\n|Neo | A software-hardware co-designed system for high-performance distributed training of large-scale DLRM.  | Facebook | [arxiv](https://export.arxiv.org/pdf/2104.05158.pdf) | PyTorch | 2021 | 1. Greedy 2. Karmarker-Karp Algorithm\n|Adaptive Paddle| Elastic training, fault tolerant, Cost-model based Sharding propagation |Baidu | [arxiv](https://arxiv.org/pdf/2112.02752.pdf) | Paddle | 2021 | Cost model based. Details un-given.\n|[Alpa](https://github.com/alpa-projects/alpa) | Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning | UC Berkley, Google, etc. | [arxiv](https://arxiv.org/pdf/2201.12023.pdf) | Jax, XLA | 2022 | Integer Linear for Intra, Dynamic programming for inter\n\n## Other Interesting automatic work\n\n|  Name  | Description | Organization or author | Paper| Framework| Year | Auto Methods|\n| --- | --- | --- | ---  | --- | --- | --- |\n|[TASO](https://github.com/jiazhihao/TASO) | automatically optimize DNN computation with graph substitution |  Zhihao Jia |\n\n---\n\n# Classify with Machine-Learning Based Methods and Classic Algorithm Based Methods\n\n## Machine-Learning Based Methods\n\n| Name | Method Type | Parallelism | Year |\n| --- | --- | --- | ---  |\n| ColocRL | Reinforcement | MP | 2017 | \n| HDP | Reinforcement | MP | 2018 |  \n| GDP | Reinforcement | MP | 2019 |  \n| REGAL | Reinforcement | MP | 2020 |  \n| TAPP | Reinforcement | DP+PP | 2021 |\n| Spotlight | Reinforcement | DP+MP | 2018 |\n| Placeto | Reinforcement | DP+MP | 2019 |  \n| HeterPS | Reinforcement | DP+PP | 2021 | \n| AutoMap | Deep Learning to predict rank | DP+TP | 2021 | \n| Auto-MAP | Reinforcement | DP or TP or PP | 2020 | \n| FlexFlow | MCMC | DP+TP  | 2019 | \n| ROC | uses a novel online linear regression model to achieve efficient graph partitioning, and introduces a dynamic programming algorithm to minimize data transfer cost. | DP+TP  | 2020 | \n\n## Classic Algorithm Based Methods\n\n| Name | Method Type | Parallelism | Year |\n| --- | --- | --- | ---  |\n| Pesto | integer linear | MP | 2021 | \n| vpipe | SRP algorithm + KL (DP) | PP | 2022 | \n| PipeDream | dynamic programming | DP+PP | 2019 | \n| DNN-partitioning | dynamic programming + integer programming | DP+PP | 2020 |\n| PipeDream-2BW | dynamic programming | DP+PP | 2021 |\n| HetPipe | dynamic programming | DP+PP | 2020 |   \n| DAPPLE | dynamic programming | DP+PP | 2021 | \n| PipeTransformer | dynamic programming | DP+PP | 2021 | \n| Chimera | Grid-Search| DP+PP | 2021 | \n| RaNNC | dynamic programming | DP+PP | 2021 |  \n| FTPipe | Multiprocessor scheduling problem with profiling | DP+PP | 2021 |\n| OptCNN | dynamic programming | DP+TP | 2018 | \n| Tofu | dynamic programming | DP+TP | 2019 | \n| AccPar | dynamic programming | DP+TP | 2020 | \n| TensorOpt | dynamic programming | DP+TP | 2020 | \n| Double Recursive | Double recursive  | DP+TP | 2021 |\n| PaSE | dynamic programming | DP+TP | 2021 |\n| P^2 | Synthesize tool with simulation | DP+TP | 2021 |  \n| Piper | two-level dynamic programming | DP+TP+PP | 2021 |\n| GSPMD | heuristic-propagation | DP+TP+PP | 2021 | \n| DistIR | grid search | DP+TP+PP | 2021 | \n| Neo | Greedy + Karmarker-karp algorithm | DP+TP+PP | 2021 |\n| Alpa | Integer programming + Dynamic Programming | DP+TP+PP | 2022 |\n\n---\n\n## Pictures\n\n### REINFORCE\n\n![img.png](Image/overall/reinforce.png)\n\n### Spotlight\n\n![img.png](Image/overall/spotlight.png)\n\n### GPipe\n\n![img.png](Image/overall/gpipe.png)\n\n### GDP\n\n![img.png](Image/overall/gdp.png)\n\n### Placeto\n\n![img.png](Image/overall/placeto.png)\n\n### REGAL\n\n![img.png](Image/overall/REGAL.png)\n\n# News\n\n2021.12.9 DeepMind proposes Gopher, a 280 billion parameter transformer language model. Trained by 4096 16GB\nTPUv3. [link](https://deepmind.com/blog/article/language-modelling-at-scale)\n\n2021.12.8 Baidu and Peng Cheng proposes Wenxin (文心), a 260 billion parameter knowledge-aware pretrained model (a.k.a.\nERNIE 3.0 Titan). Trained with Adaptive Paddle in the Table above.\n\n2021.10.26 Inspur formally proposes 245.7 billion parameter on AICC 2021.s","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fawesome-Auto-Parallelism","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fawesome-Auto-Parallelism","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fawesome-Auto-Parallelism/lists"}