{"id":13604177,"url":"https://github.com/MachineLearningSystem/AMP","last_synced_at":"2025-04-11T23:32:05.275Z","repository":{"id":185461615,"uuid":"567154753","full_name":"MachineLearningSystem/AMP","owner":"MachineLearningSystem","description":"Automatically finding good model-parallel strategies, especially for complex models and clusters.","archived":false,"fork":true,"pushed_at":"2022-11-04T21:58:41.000Z","size":98703,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-11-07T08:42:29.222Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"DachengLi1/AMP","license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-11-17T07:25:08.000Z","updated_at":"2022-11-16T10:53:14.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/MachineLearningSystem/AMP","commit_stats":null,"previous_names":["machinelearningsystem/amp"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FAMP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FAMP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FAMP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2FAMP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/AMP/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495061,"owners_count":21113559,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:41.118Z","updated_at":"2025-04-11T23:32:00.267Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness (NeurIPS 2022) \n[**Paper**](https://arxiv.org/pdf/2210.07297.pdf) | \n[**Usage**](#usage) |\n[**Citation**](#citation) |\n[**Presentation**](https://recorder-v3.slideslive.com/?share=74667\u0026s=aa9ce793-0697-43bc-9d8f-f7b139471f95) \n\nThis repository contains the official code for our NeurIPS 2022 paper **AMP**. AMP is an **automatic** approach to find fast model-parallel strategies to train large Deep Learning models. We design AMP to tackle real-world scnerios where users are training **hetergeneous** models with uneven layers and **hetergeneous** cluster with mixed generations of GPUs. Concretely, it contributes\n- A valid **representation** of model-parallelism strategies.\n- A **cost model** that accurately predicts the running time of a strategy without launching expensive real trials.\n- An **automatic optimization** procedure that uses the cost model and a dynamic programming algorithm to efficiently find fast strategies.\n\n\u003cimg src=\"figures/workflow.png\" width=\"600\"\u003e\n\n## Performance \nAMP finds strategies that have similar performance to the state-of-the-art strategy finder[[1]](#1) when no heterogeneity in the model and in the cluster. AMP fins strategies that are **1.54x** better than the SOTA when heterogeneity exists in the cluster, and **1.77x** better when heterogeneity exists in the model. In particular, the cost model in AMP can accurately predict low costs for top strategies. \n\n\u003cimg src=\"figures/speedup.PNG\" width=\"600\"\u003e \u003cimg src=\"figures/cost_vs_real.png\" width=\"600\" \u003e\n\n## Usage\nWe provide two settings: (1) use AMP to predict top strategies, (2) Additionally launch real trials with DeepSpeed to validate the ground truth runtime. Setting 1 requires a single CPU, while Setting 2 requires 16 GPUs in AWS EC2 (we provide the instance details in the paper). We have installed the environment and prepare necessary intermediate results for Setting 2 in an AMI for ease of setup.\n\n#### Set up environment for setting 1\n````\ncd ~\ngit clone https://github.com/MccRee17/AMP\nconda create -n amp python=3.7.3\nconda activate amp\nconda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests\npip install tqdm spur torch==1.7.1+cu110 torchvision==0.8.2+cu110 -f https://download.pytorch.org/whl/torch_stable.html\n````\n\n#### Set up environment for setting 2\nUse our AWS AMI:\n\n| AMI Name       | AMI ID                | Region    | Instances in the paper               |\n|----------------|-----------------------|-----------|--------------------------------------|\n| \tAMP-Oct-31   | ami-011d5dd7f6fe79d32 | us-west-2 | G4dn.12xlarge, P3.8xlarge            |\n\nLaunch instances specified in the paper, e.g. 4 G4dn.12xlarge instances for the homogeneous experiment. Lauch them within the **same** placement group using the **cluster** strategy in EC2 so that they have the maximum bandwidth. Assume that AWS assigns 4 machines with IP address IP1, IP2, IP3, and IP4. Then do several steps on the master Machine IP1:\n- ssh into IP[1-4] and exit to store the public ssh key of all machines in IP1, so the ssh verification does not prompt during trials.\n- Add IP[1-4] to ``~/hostfile`` and state the number of GPUs in each machine ([DeepSpeed Tutorial](https://www.deepspeed.ai/getting-started/)). For instance, all 4x4 clusters in the paper are specified by: \n````\nIP1 slots=4\nIP2 slots=4\nIP3 slots=4\nIP4 slots=4\n````\n- Activate our environment by ``source anaconda3/bin/activate; conda activate amp``.\n\nSuggestions: (1) Warm up **each** AWS machine before running, otherwise trials may get terminated by timeout. A simple warmup is ``python; import torch; a = torch.randn(100,100).cuda()`` (2) If some trials hang, one can manually login to each machine and kill GPU processes. The optimization algorithms runs on CPU and will not be affected. A useful command to check processes on GPU: ````sudo fuser -v /dev/nvidia*````. (3) If processes constantly get stuck, try removing all caches by ``rm -rf ~/amp_simulate; rm -rf ~/tmp``. If there are other blockers in launching distributed experiments, please leave an issue here or send the author an [email](dacheng2@andrew.cmu.edu).\n\n### Experiment 1: Homogeneous\nWith Setting 1:\n````\ncd ~/AMP/src\npython homogeneous.py \n````\nThis will finish in around 500 seconds and store the result in ~/amp_main_logs/homogeneous_[time_stamp].txt.\n\nWith Setting 2:\n```` \ncd ~/AMP/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism\npython homogeneous.py --full --budget 10\n````\nThis will run the prediction and launch top 10 predicted strategies. It will finish in around 1500 seconds and store the result in ~/amp_main_logs/homogeneous_[time_stamp].txt. To run x numbers of real trials, use argument ````--budget x````. The raw log from our modified DeepSpeed contains a lot of details such as the pipeline schedule, we recommend redirecting it into another log.txt for further interpretation.\n\nCached results with 53 real trials are in AMP/src/results/homogeneous_results.txt and logs in AMP/src/results/homogeneous_log.txt.\n\n### Experiment 2: Hetergeneous cluster\nWith Setting 1:\n````\ncd ~/AMP/src\npython het_cluster.py \n````\nThis will finish in around 500 seconds and store the result in ~/amp_main_logs/het_cluster_[time_stamp].txt.\n\nWith Setting 2:\n```` \ncd ~/AMP/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism\npython het_cluster.py --full --budget 10\n````\nPredicting and 10 real trials takes around 1600 seconds. Cached results with 53 real trials are in AMP/src/results/het_cluster_results.txt and logs in AMP/src/results/het_cluster_log.txt.\n\n### Experiment 3: Hetergeneous model\nWith Setting 1:\n````\ncd ~/AMP/src\npython het_model.py \n````\nThis will finish in around 200 seconds and store the result in ~/amp_main_logs/het_model_[time_stamp].txt.\n\nWith Setting 2:\n```` \ncd ~/AMP/DeepSpeed/DeepSpeedExamples/Megatron-LM-v1.1.5-3D_parallelism\npython het_model.py --full --budget 10\n````\nPredicting and 10 real trials takes around 1300 seconds. Cached results with 65 real trials are in AMP/src/results/het_model_results.txt and logs in AMP/src/results/het_model_log.txt.\n\n## Code Logic\nBasic logic of AMP is implemented in several files:\n- The main function (homogeneous.py, het_cluster.py, het_model.py) iteratively applies the cost model and optimize. \n- cost_xxx.py implements the cost model.\n- pipe.py implements the dynamic programming algorithm.\n- sa.py provides possible candidates for the main function. \n- amp_utils.py implements other functions such as launching real trials with given configurations.\n\n## Citation\nIf you find this repository useful, please cite our paper using\n````\n@article{li2022amp,\n  title={AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness},\n  author={Li, Dacheng and Wang, Hongyi and Xing, Eric and Zhang, Hao},\n  journal={arXiv preprint arXiv:2210.07297},\n  year={2022}\n}\n````\n## References\n\u003ca id=\"1\"\u003e[1]\u003c/a\u003e \nNarayanan, Deepak, et al. \"Efficient large-scale language model training on gpu clusters using megatron-lm.\" Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2021.\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FAMP","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2FAMP","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2FAMP/lists"}