{"id":13604225,"url":"https://github.com/MachineLearningSystem/pipedream","last_synced_at":"2025-04-11T23:32:07.882Z","repository":{"id":185461907,"uuid":"483710230","full_name":"MachineLearningSystem/pipedream","owner":"MachineLearningSystem","description":null,"archived":false,"fork":true,"pushed_at":"2021-11-03T09:18:59.000Z","size":3009,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"pipedream","last_synced_at":"2024-11-07T08:42:33.729Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":"msr-fiddle/pipedream","license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/MachineLearningSystem.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2022-04-20T15:27:19.000Z","updated_at":"2022-04-20T14:35:32.000Z","dependencies_parsed_at":null,"dependency_job_id":"a45cf0c9-7b73-42b1-a573-42f5b186119a","html_url":"https://github.com/MachineLearningSystem/pipedream","commit_stats":null,"previous_names":["machinelearningsystem/pipedream"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fpipedream","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fpipedream/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fpipedream/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/MachineLearningSystem%2Fpipedream/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/MachineLearningSystem","download_url":"https://codeload.github.com/MachineLearningSystem/pipedream/tar.gz/refs/heads/pipedream","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248495067,"owners_count":21113561,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T19:00:41.886Z","updated_at":"2025-04-11T23:32:02.868Z","avatar_url":"https://github.com/MachineLearningSystem.png","language":null,"readme":"# PipeDream: Pipeline Parallelism for DNN Training\n\nThis repository contains the source code implementation of the following\npapers:\n- \"[PipeDream: Generalized Pipeline Parallelism for DNN Training](https://www.microsoft.com/en-us/research/publication/pipedream-generalized-pipeline-parallelism-for-dnn-training/)\",\n  which appeared at SOSP 2019 (`pipedream` branch).\n- \"[Memory-Efficient Pipeline-Parallel DNN Training](https://www.microsoft.com/en-us/research/publication/memory-efficient-pipeline-parallel-dnn-training/)\",\n  which appeared at ICML 2021 (`pipedream_2bw` branch).\n\nThis work was one as part of Microsoft Research's\n[Project Fiddle](https://aka.ms/msr-fiddle). This source code is available\nunder the [MIT License](LICENSE.txt).\n\n## Directory Structure\n\n### `graph`\n\nThis contains a Python implementation of a graph, used by the PipeDream profiler\nand optimizer. Profiling scripts in `profiler` generate graph profiles, that can\nthen be ingested by the optimizer located in `optimizer` to generate a partitioned\nmodel, that can then be fed to the PipeDream runtime.\n\n### `profiler`\n\nInstrumented PyTorch applications which return profiles that can be ingested by\nthe optimizer.\n\n### `optimizer`\n\nA Python implementation of PipeDream's optimizer.\n\n### `runtime`\n\nPipeDream's runtime, which implements model parallelism, as well as input\npipelining in PyTorch. This can be fused with data parallelism to give hybrid\nmodel and data parallelism, and input pipelining.\n\n## Setup\n\n### Software Dependencies\n\nTo run PipeDream, you will need a NVIDIA GPU with CUDA 10.0, GPU driver version 418.56, nvidia-docker2,\nand Python 3. On a Linux server with NVIDIA GPU(s) and Ubuntu 16.04, these dependencies can be installed\nusing,\n\n```bash\nbash setup.sh\n```\n\nAll dependencies are in the nvcr.io/nvidia/pytorch:19.05-py3 container, which can be downloaded using,\n\n```bash\nnvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3\n```\n\nTo run the PipeDream profiler, you will need to build a new Docker image, which can be done using the\nDockerfile in this directory. Note that the Dockerfile has a dependency on the `pre_hook.patch`  and\n`requirements.txt` files in this directory. This container can be built using,\n\n```bash\ndocker build --tag \u003cCONTAINER_NAME\u003e .\n```\n\nThe PyTorch Docker Container can then be run using,\n\n```bash\nnvidia-docker run -it -v /mnt:/mnt --ipc=host --net=host \u003cCONTAINER_NAME\u003e /bin/bash\n```\n\n### Data\n\n#### Image Classification\nAll image classification experiments are run using the ImageNet ILSVC 2012 dataset.\nThis can be downloaded using the following command (within the docker container above),\n\n```bash\ncd scripts; python download_imagenet.py --data_dir \u003cDATASET_DIR\u003e\n```\n\nNote that the ImageNet dataset is about 145GB, so this download script can take some time.\n\n#### Translation\nAll translation experiments are run using the WMT En-De dataset, also used for the MLPerf\ntranslation (RNN) task. This can be downloaded using the instructions in [the MLPerf\nrepository](https://github.com/mlperf/training_results_v0.5/tree/master/v0.5.0/nvidia/submission/code/translation/pytorch#2-directions).\n\n\n## End-to-end Workflow\n\nTo run a demo, run the following commands (the optimizer and runtime have been verified to work unchanged in `nvcr.io/nvidia/pytorch:19.05-py3`).\nMore detailed instructions for each of the individual components are in the corresponding directory READMEs,\nand more detailed instructions on how to run the main experiments in the SOSP paper are in [`EXPERIMENTS.md`](EXPERIMENTS.md).\n\n[from `pipedream/profiler/image_classification`; you will need to have the changes to PyTorch listed above]\nNote that the profiling step must be run with only a single GPU (hence the `CUDA_VISIBLE_DEVICES=0` before the command).\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python main.py -a vgg16 -b 64 --data_dir \u003cpath to ImageNet directory\u003e\n```\n\n[from `pipedream/optimizer`]\n\n```bash\npython optimizer_graph_hierarchical.py -f ../profiler/image_classification/profiles/vgg16/graph.txt -n 4 --activation_compression_ratio 1 -o vgg16_partitioned\n```\n\n[from `pipedream/optimizer`]\n\n```bash\npython convert_graph_to_model.py -f vgg16_partitioned/gpus=4.txt -n VGG16Partitioned -a vgg16 -o ../runtime/image_classification/models/vgg16/gpus=4 --stage_to_num_ranks 0:3,1:1\n```\n\n[from `pipedream/runtime/image_classification`; run on 4 GPUs (including a single server with 4 GPUs)]\n\n```bash\npython main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir \u003cpath to ImageNet\u003e --rank 0 --local_rank 0 --master_addr \u003cmaster IP address\u003e --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo\npython main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir \u003cpath to ImageNet\u003e --rank 1 --local_rank 1 --master_addr \u003cmaster IP address\u003e --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo\npython main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir \u003cpath to ImageNet\u003e --rank 2 --local_rank 2 --master_addr \u003cmaster IP address\u003e --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo\npython main_with_runtime.py --module models.vgg16.gpus=4 -b 64 --data_dir \u003cpath to ImageNet\u003e --rank 3 --local_rank 3 --master_addr \u003cmaster IP address\u003e --config_path models/vgg16/gpus=4/hybrid_conf.json --distributed_backend gloo\n```\n\n`master IP address` here is the IP address of the rank 0 process. On a server with 4 GPUs, `localhost` can be specified.\n\nWhen running DP setups, please use the `nccl` backend for optimal performance. When running hybrid setups, please use\nthe `gloo` backend.\n\n\n## Code of Conduct\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n\n## License\n\nCopyright (c) Microsoft Corporation. All rights reserved.\n\nLicensed under the [MIT](LICENSE.txt) license.\n","funding_links":[],"categories":["Paper-Code"],"sub_categories":["Parallellism Training"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fpipedream","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FMachineLearningSystem%2Fpipedream","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FMachineLearningSystem%2Fpipedream/lists"}