{"id":18077559,"url":"https://github.com/rentainhe/pytorch-distributed-training","last_synced_at":"2026-03-06T05:12:33.699Z","repository":{"id":38415082,"uuid":"326679377","full_name":"rentainhe/pytorch-distributed-training","owner":"rentainhe","description":"Simple tutorials on Pytorch DDP training","archived":false,"fork":false,"pushed_at":"2022-08-19T07:38:22.000Z","size":348,"stargazers_count":276,"open_issues_count":0,"forks_count":49,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-04-09T20:10:34.112Z","etag":null,"topics":["apex","cuda","ddp-training","deep-learning","pytorch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rentainhe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-01-04T12:38:53.000Z","updated_at":"2025-04-07T00:58:50.000Z","dependencies_parsed_at":"2022-07-12T17:29:04.004Z","dependency_job_id":null,"html_url":"https://github.com/rentainhe/pytorch-distributed-training","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rentainhe%2Fpytorch-distributed-training","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rentainhe%2Fpytorch-distributed-training/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rentainhe%2Fpytorch-distributed-training/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rentainhe%2Fpytorch-distributed-training/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rentainhe","download_url":"https://codeload.github.com/rentainhe/pytorch-distributed-training/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248103872,"owners_count":21048245,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apex","cuda","ddp-training","deep-learning","pytorch"],"created_at":"2024-10-31T11:45:33.572Z","updated_at":"2025-10-14T12:10:52.673Z","avatar_url":"https://github.com/rentainhe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## pytorch-distributed-training\nDistribute Dataparallel (DDP) Training on Pytorch\n\n### Features\n* Easy to study DDP training\n* You can directly copy this code for a quick start\n* Learning Notes Sharing(with `√`means finished):\n  - [x] [Basic Theory](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/0.%20Basic%20Theory.md)\n  - [x] [Pytorch Gradient Accumulation](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/1.%20Gradient%20Accumulation.md)\n  - [x] [More Details of DDP Training](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/2.%20DDP%20Training%20Details.md)\n  - [x] [DDP training with apex](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/4.%20DDP%20with%20apex.md)\n  - [x] [Accelerate-on-Accelerate DDP Training Tricks](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/3.%20DDP%20Training%20Tricks.md)\n  - [x] [DP and DDP 源码解读](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/5.%20DP%20and%20DDP.md)\n\n### Good Notes\n分享一些网上优质的笔记\n- [分布式训练（理论篇）](https://zhuanlan.zhihu.com/p/129912419)\n- [当代研究生应当掌握的并行训练方法（单机多卡）](https://zhuanlan.zhihu.com/p/98535650)\n\n### TODO\n- [ ] 完成DP和DDP源码解读笔记(当前进度50%)\n- [ ] 修改代码细节, 复现实验结果\n\n### Quick start\n想直接运行查看结果的可以执行以下命令, 注意一定要用`--ip`和`--port`来指定主机的`ip`地址以及空闲的`端口`，否则可能无法运行\n- [dataparaller.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/dataparallel.py)\n```bash\n$ python dataparallel.py --gpu 0,1,2,3\n```\n\n- [distributed.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed.py)\n```bash\n$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py\n```\n\n- [distributed_mp.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_mp.py)\n```bash\n$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py\n```\n\n- [distributed_apex.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_apex.py)\n```bash\n$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py\n```\n\n- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址\n- `--port=int`, e.g `--port=23456` 来指定启动端口号\n- `--batch_size=int`, e.g `--batch_size=128` 设定训练batch_size\n\n- [distributed_gradient_accumulation.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_gradient_accumulation.py)\n```bash\n$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py\n```\n- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址\n- `--port=int`, e.g `--port=23456` 来指定启动端口号\n- `--grad_accu_steps=int`, e.g `--grad_accu_steps=4'` 来指定gradient_step\n\n\n### Comparison\n结果不够准确，GPU状态不同结果可能差异较大\n\n默认情况下都使用`SyncBatchNorm`, 这会导致执行速度变慢一些，因为需要增加进程之间的通讯来计算`BatchNorm`, 但有利于保证准确率\n\nConcepts\n- [apex](https://github.com/NVIDIA/apex)\n- DP: `DataParallel`\n- DDP: `DistributedDataParallel`\n\nEnvironments\n- 4 × 2080Ti\n\n|model|dataset|training method|time(seconds/epoch)|Top-1 accuracy\n|:---:|:---:|:---:|:---:|:---:\n|resnet18|cifar100|DP|20s|\n|resnet18|cifar100|DP+apex|18s|\n|resnet18|cifar100|DDP|16s|\n|resnet18|cifar100|DDP+apex|14.5s|\n\n### Basic Concept\n- group: 表示进程组，默认情况下只有一个进程组。\n- world size: 全局进程个数\n  - 比如16张卡`单卡单进程`: world size = 16\n  - `8卡单进程`: world size = 1\n  - 只有当连接的进程数等于world size, 程序才会执行\n- rank: 进程序号，用于进程间通讯，表示进程优先级，`rank=0`表示`主进程`\n- local_rank: 进程内，`GPU`编号，非显示参数，由`torch.distributed.launch`内部指定，`rank=3, local_rank=0` 表示第`3`个进程的第`1`块`GPU`\n\n\n### Usage 单机多卡\n#### 1. 获取当前进程的index\npytorch可以通过torch.distributed.lauch启动器，在命令行分布式地执行.py文件, 在执行的过程中会将当前进程的index通过参数传递给python\n```python\nimport argparse\nparser = argparse.ArgumentParser()\nparser.add_argument('--local_rank', default=-1, type=int,\n                    help='node rank for distributed training')\nargs = parser.parse_args()\nprint(args.local_rank)\n```\n#### 2. 定义 main_worker 函数 \n主要的训练流程都写在main_worker函数中，main_worker需要接受三个参数（最后一个参数optional）: \n```python\ndef main_worker(local_rank, nprocs, args):\n    training...\n```\n- local_rank: 接受当前进程的rank值，在一机多卡的情况下对应使用的GPU号\n- nprocs: 进程数量\n- args: 自己定义的额外参数\n\nmain_worker,相当于你每个进程需要运行的函数（每个进程执行的函数内容是一致的，只不过传入的local_rank不一样）\n\n#### 3. main_worker函数中的整体流程\nmain_worker函数中完整的训练流程\n```python\nimport torch\nimport torch.distributed as dist\nimport torch.backends.cudnn as cudnn\ndef main_worker(local_rank, nprocs, args):\n    args.local_rank = local_rank\n    # 分布式初始化，对于每个进程来说，都需要进行初始化\n    cudnn.benchmark = True\n    dist.init_process_group(backend='nccl', init_method='tcp://ip:port', world_size=nprocs, rank=local_rank)\n    # 模型、损失函数、优化器定义\n    model = ...\n    criterion = ...\n    optimizer = ...\n    # 设置进程对应使用的GPU\n    torch.cuda.set_device(local_rank)\n    model.cuda(local_rank)\n    # 使用分布式函数定义模型\n    model = model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])\n    \n    # 数据集的定义，使用 DistributedSampler\n    mini_batch_size = batch_size / nprocs # 手动划分 batch_size to mini-batch_size\n    train_dataset = ...\n    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)\n    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., \n                                              sampler=train_sampler)\n    \n    test_dataset = ...\n    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)\n    testloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., \n                                             sampler=test_sampler) \n    \n    # 正常的 train 流程\n    for epoch in range(300):\n       model.train()\n       for batch_idx, (images, target) in enumerate(trainloader):\n          images = images.cuda(non_blocking=True)\n          target = target.cuda(non_blocking=True)\n          ...\n          pred = model(images)\n          loss = loss_function(pred, target)\n          ...\n          optimizer.zero_grad()\n          loss.backward()\n          optimizer.step()\n```\n\n#### 4. 定义main函数\n```python\nimport argparse\nimport torch\nparser = argparse.ArgumentParser(description='PyTorch ImageNet Training')\nparser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')\nparser.add_argument('--batch_size','--batch-size', default=256, type=int)\nparser.add_argument('--lr', default=0.1, type=float)\n\ndef main_worker(local_rank, nprocs, args):\n    ...\n\ndef main():\n    args = parser.parse_args()\n    args.nprocs = torch.cuda.device_count()\n    # 执行 main_worker\n    main_worker(args.local_rank, args.nprocs, args)\n\nif __name__ == '__main__':\n    main()\n```\n\n#### 5. Command Line 启动\n```bash\n$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py\n```\n\n- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址\n- `--port=int`, e.g `--port=23456` 来指定启动端口号\n\n参数说明:\n- --nnodes 表示机器的数量\n- --node_rank 表示当前的机器\n- --nproc_per_node 表示每台机器上的进程数量\n\n参考 [distributed.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed.py)\n\n#### 6. torch.multiprocessing \n使用`torch.multiprocessing`来解决进程自发控制可能产生问题，这种方式比较稳定，推荐使用\n```python\nimport argparse\nimport torch\nimport torch.multiprocessing as mp\n\nparser = argparse.ArgumentParser(description='PyTorch ImageNet Training')\nparser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')\nparser.add_argument('--batch_size','--batch-size', default=256, type=int)\nparser.add_argument('--lr', default=0.1, type=float)\n\ndef main_worker(local_rank, nprocs, args):\n    ...\n\ndef main():\n    args = parser.parse_args()\n    args.nprocs = torch.cuda.device_count()\n    # 将 main_worker 放入 mp.spawn 中\n    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))\n\nif __name__ == '__main__':\n    main()\n```\n\n\n参考 [distributed_mp.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_mp.py) 启动方式如下:\n```bash\n$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py\n```\n\n- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址\n- `--port=int`, e.g `--port=23456` 来指定启动端口号\n\n\n## Reference\n参考的文章如下（如果有文章没有引用，但是内容差不多的，可以提issue给我，我会补上，实在抱歉）：\n- [Pytorch: DDP系列](https://zhuanlan.zhihu.com/p/178402798)\n- [分布式训练](https://zhuanlan.zhihu.com/p/98535650)\n- [分布式训练（理论篇）](https://zhuanlan.zhihu.com/p/129912419)\n- [DistributedSampler的问题](https://www.zhihu.com/question/67209417/answer/1017851899)\n- I learned from this [repo](https://github.com/tczhangzhi/pytorch-distributed), and want to make it easier and cleaner.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frentainhe%2Fpytorch-distributed-training","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frentainhe%2Fpytorch-distributed-training","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frentainhe%2Fpytorch-distributed-training/lists"}