{"id":18277115,"url":"https://github.com/js2hou/pytorch-ddp-analysis","last_synced_at":"2025-09-05T06:34:56.146Z","repository":{"id":179869100,"uuid":"462185665","full_name":"Js2Hou/Pytorch-DDP-Analysis","owner":"Js2Hou","description":null,"archived":false,"fork":false,"pushed_at":"2023-07-05T08:45:48.000Z","size":11,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T04:31:43.360Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Js2Hou.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-02-22T07:30:54.000Z","updated_at":"2024-08-12T12:52:05.000Z","dependencies_parsed_at":null,"dependency_job_id":"35be5d07-a94a-4e14-8af9-bbb39bab85ae","html_url":"https://github.com/Js2Hou/Pytorch-DDP-Analysis","commit_stats":null,"previous_names":["js2hou/pytorch-ddp-analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Js2Hou/Pytorch-DDP-Analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Js2Hou%2FPytorch-DDP-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Js2Hou%2FPytorch-DDP-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Js2Hou%2FPytorch-DDP-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Js2Hou%2FPytorch-DDP-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Js2Hou","download_url":"https://codeload.github.com/Js2Hou/Pytorch-DDP-Analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Js2Hou%2FPytorch-DDP-Analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273723020,"owners_count":25156300,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-05T12:18:08.996Z","updated_at":"2025-09-05T06:34:56.136Z","avatar_url":"https://github.com/Js2Hou.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pytorch DDP Analysis\r\n\r\n## Introduction\r\n\r\nThis project studies how to distributed train model in `Pytorch` and analyzes its efficiency. We run the same code on a GPU, two GPUs and four GPUs (two GPUs per node), and record their runtime. \r\n\r\nExperiments show that training model on 2 GPUs can save 1 of time than that on single GPU, but when done on 4 GPUs (2 GPUs per node) it will cost more time. We think it is possible that our dataset (cifar100) is too small. Nonetheless, it suggests that we should run code on a single node instead of multiple nodes to achieve better performance.\r\n\r\nIn addition, We provide a template of `Pytorch` distributed training.\r\n\r\n## Structure\r\n\r\n```\r\nTamplate/\r\n|-- data/\r\n|\r\n|-- models/\r\n|   |-- __init__.py\r\n|   |-- models.py\r\n|\r\n|-- scripts/\r\n|   |-- train_single_gpu.sh  # script for training on single gpu\r\n|   |-- train_single_node.sh  # script for training on single node with multi gpus\r\n|   |-- train_multi_nodes.sh  # script for training on multi nodes\r\n|\r\n|-- dataset.py\r\n|-- main.py\r\n|-- metrics.py\r\n|-- engine.py\r\n|-- utils.py\r\n|-- requirements.txt\r\n|-- README\r\n```\r\n\r\n- `dataset.py`: loading data and doing data augmentatioin\r\n- `models/models.py`: implement your model \r\n- `scripts`: scripts for starting training\r\n    - `train_single_gpu.sh`: training model on single gpu\r\n    - `train_single_node.sh`: trainging model on single node with multi gpus\r\n    -  `train_multi_nodes.sh`: training on multi nodes. Execute this script on each node to start distrubuted training.\r\n\r\n## Experiment on DDP\r\n\r\n### Setting\r\n\r\n- device: TITAN RTX\r\n- model: resnet18\r\n- dataset: cifar100\r\n- epochs: 10\r\n- batch size: 128\r\n- dataset augmentation:\r\n    ```python\r\n    transform = create_transform(\r\n        input_size=input_size,\r\n        is_training=True,\r\n        color_jitter=0.4,\r\n        auto_augment='rand-m9-mstd0.5-inc1',\r\n        interpolation='bicubic',\r\n        re_prob=0.25,\r\n        re_mode='pixel',\r\n        re_count=1,\r\n    )\r\n    ```\r\n\r\n### Results\r\n\r\n- single GPU: 555.0834 s\r\n- 2 GPUs on single node (launched by `torch.distributed.launch`, will be deprecated) : 301.9986 s\r\n- 2 GPUs on single node (launched by `torchrun`) : 324.2550 s\r\n- 4 GPUs on two nodes (launched by `torchrun`) : 549.2544 s\r\n\r\n## Usage\r\n\r\nFirst, clone the repository locally:\r\n```\r\ngit clone https://github.com/js2hou/Pytorch-DDP-Analysis.git\r\n```\r\nThen, install requirements:\r\n```\r\npip install -r requirements.txt\r\n```\r\nLoading your dataset in `dataset.py` and implement your models in `models/models.py`.  Remember to modify the code for calling the model in `main.py`. Last, \r\n\r\n- run `./script/train_single_gpu.sh` for training on single gpu\r\n- run `./script/train_single_node.sh` for training on multi gpus (recommended)\r\n- run `./script/train_multi_nodes.sh` on each node for training on multi nodes\r\n\r\n\r\n## License\r\n\r\nThis repository is released under the Apache 2.0 license as found in the [LICENSE](LICENSE) file.\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjs2hou%2Fpytorch-ddp-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjs2hou%2Fpytorch-ddp-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjs2hou%2Fpytorch-ddp-analysis/lists"}