https://github.com/rentainhe/pytorch-distributed-training

Simple tutorials on Pytorch DDP training
https://github.com/rentainhe/pytorch-distributed-training
apex cuda ddp-training deep-learning pytorch
Last synced: 19 days ago
JSON representation
Simple tutorials on Pytorch DDP training
Host: GitHub
URL: https://github.com/rentainhe/pytorch-distributed-training
Owner: rentainhe
Created: 2021-01-04T12:38:53.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2022-08-19T07:38:22.000Z (over 2 years ago)
Last Synced: 2025-04-09T20:10:34.112Z (19 days ago)
Topics: apex, cuda, ddp-training, deep-learning, pytorch
Language: Python
Homepage:
Size: 340 KB
Stars: 276
Watchers: 3
Forks: 49
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        ## pytorch-distributed-training

Distribute Dataparallel (DDP) Training on Pytorch

### Features

* Easy to study DDP training

* You can directly copy this code for a quick start

* Learning Notes Sharing(with `√`means finished):

  - [x] [Basic Theory](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/0.%20Basic%20Theory.md)

  - [x] [Pytorch Gradient Accumulation](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/1.%20Gradient%20Accumulation.md)

  - [x] [More Details of DDP Training](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/2.%20DDP%20Training%20Details.md)

  - [x] [DDP training with apex](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/4.%20DDP%20with%20apex.md)

  - [x] [Accelerate-on-Accelerate DDP Training Tricks](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/3.%20DDP%20Training%20Tricks.md)

  - [x] [DP and DDP 源码解读](https://github.com/rentainhe/pytorch-distributed-training/blob/master/tutorials/5.%20DP%20and%20DDP.md)

### Good Notes

分享一些网上优质的笔记

- [分布式训练（理论篇）](https://zhuanlan.zhihu.com/p/129912419)

- [当代研究生应当掌握的并行训练方法（单机多卡）](https://zhuanlan.zhihu.com/p/98535650)

### TODO

- [ ] 完成DP和DDP源码解读笔记(当前进度50%)

- [ ] 修改代码细节, 复现实验结果

### Quick start

想直接运行查看结果的可以执行以下命令, 注意一定要用`--ip`和`--port`来指定主机的`ip`地址以及空闲的`端口`，否则可能无法运行

- [dataparaller.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/dataparallel.py)

```bash

$ python dataparallel.py --gpu 0,1,2,3

```

- [distributed.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed.py)

```bash

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py

```

- [distributed_mp.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_mp.py)

```bash

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py

```

- [distributed_apex.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_apex.py)

```bash

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py

```

- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址

- `--port=int`, e.g `--port=23456` 来指定启动端口号

- `--batch_size=int`, e.g `--batch_size=128` 设定训练batch_size

- [distributed_gradient_accumulation.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_gradient_accumulation.py)

```bash

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_apex.py

```

- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址

- `--port=int`, e.g `--port=23456` 来指定启动端口号

- `--grad_accu_steps=int`, e.g `--grad_accu_steps=4'` 来指定gradient_step

### Comparison

结果不够准确，GPU状态不同结果可能差异较大

默认情况下都使用`SyncBatchNorm`, 这会导致执行速度变慢一些，因为需要增加进程之间的通讯来计算`BatchNorm`, 但有利于保证准确率

Concepts

- [apex](https://github.com/NVIDIA/apex)

- DP: `DataParallel`

- DDP: `DistributedDataParallel`

Environments

- 4 × 2080Ti

|model|dataset|training method|time(seconds/epoch)|Top-1 accuracy

|:---:|:---:|:---:|:---:|:---:

|resnet18|cifar100|DP|20s|

|resnet18|cifar100|DP+apex|18s|

|resnet18|cifar100|DDP|16s|

|resnet18|cifar100|DDP+apex|14.5s|

### Basic Concept

- group: 表示进程组，默认情况下只有一个进程组。

- world size: 全局进程个数

  - 比如16张卡`单卡单进程`: world size = 16

  - `8卡单进程`: world size = 1

  - 只有当连接的进程数等于world size, 程序才会执行

- rank: 进程序号，用于进程间通讯，表示进程优先级，`rank=0`表示`主进程`

- local_rank: 进程内，`GPU`编号，非显示参数，由`torch.distributed.launch`内部指定，`rank=3, local_rank=0` 表示第`3`个进程的第`1`块`GPU`

### Usage 单机多卡

#### 1. 获取当前进程的index

pytorch可以通过torch.distributed.lauch启动器，在命令行分布式地执行.py文件, 在执行的过程中会将当前进程的index通过参数传递给python

```python

import argparse

parser = argparse.ArgumentParser()

parser.add_argument('--local_rank', default=-1, type=int,

                    help='node rank for distributed training')

args = parser.parse_args()

print(args.local_rank)

```

#### 2. 定义 main_worker 函数 

主要的训练流程都写在main_worker函数中，main_worker需要接受三个参数（最后一个参数optional）: 

```python

def main_worker(local_rank, nprocs, args):

    training...

```

- local_rank: 接受当前进程的rank值，在一机多卡的情况下对应使用的GPU号

- nprocs: 进程数量

- args: 自己定义的额外参数

main_worker,相当于你每个进程需要运行的函数（每个进程执行的函数内容是一致的，只不过传入的local_rank不一样）

#### 3. main_worker函数中的整体流程

main_worker函数中完整的训练流程

```python

import torch

import torch.distributed as dist

import torch.backends.cudnn as cudnn

def main_worker(local_rank, nprocs, args):

    args.local_rank = local_rank

    # 分布式初始化，对于每个进程来说，都需要进行初始化

    cudnn.benchmark = True

    dist.init_process_group(backend='nccl', init_method='tcp://ip:port', world_size=nprocs, rank=local_rank)

    # 模型、损失函数、优化器定义

    model = ...

    criterion = ...

    optimizer = ...

    # 设置进程对应使用的GPU

    torch.cuda.set_device(local_rank)

    model.cuda(local_rank)

    # 使用分布式函数定义模型

    model = model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[local_rank])

    

    # 数据集的定义，使用 DistributedSampler

    mini_batch_size = batch_size / nprocs # 手动划分 batch_size to mini-batch_size

    train_dataset = ...

    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)

    trainloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 

                                              sampler=train_sampler)

    

    test_dataset = ...

    test_sampler = torch.utils.data.distributed.DistributedSampler(test_dataset)

    testloader = torch.utils.data.DataLoader(train_dataset, batch_size=mini_batch_size, num_workers=..., pin_memory=..., 

                                             sampler=test_sampler) 

    

    # 正常的 train 流程

    for epoch in range(300):

       model.train()

       for batch_idx, (images, target) in enumerate(trainloader):

          images = images.cuda(non_blocking=True)

          target = target.cuda(non_blocking=True)

          ...

          pred = model(images)

          loss = loss_function(pred, target)

          ...

          optimizer.zero_grad()

          loss.backward()

          optimizer.step()

```

#### 4. 定义main函数

```python

import argparse

import torch

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')

parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')

parser.add_argument('--batch_size','--batch-size', default=256, type=int)

parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):

    ...

def main():

    args = parser.parse_args()

    args.nprocs = torch.cuda.device_count()

    # 执行 main_worker

    main_worker(args.local_rank, args.nprocs, args)

if __name__ == '__main__':

    main()

```

#### 5. Command Line 启动

```bash

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 distributed.py

```

- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址

- `--port=int`, e.g `--port=23456` 来指定启动端口号

参数说明:

- --nnodes 表示机器的数量

- --node_rank 表示当前的机器

- --nproc_per_node 表示每台机器上的进程数量

参考 [distributed.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed.py)

#### 6. torch.multiprocessing 

使用`torch.multiprocessing`来解决进程自发控制可能产生问题，这种方式比较稳定，推荐使用

```python

import argparse

import torch

import torch.multiprocessing as mp

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')

parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')

parser.add_argument('--batch_size','--batch-size', default=256, type=int)

parser.add_argument('--lr', default=0.1, type=float)

def main_worker(local_rank, nprocs, args):

    ...

def main():

    args = parser.parse_args()

    args.nprocs = torch.cuda.device_count()

    # 将 main_worker 放入 mp.spawn 中

    mp.spawn(main_worker, nprocs=args.nprocs, args=(args.nprocs, args))

if __name__ == '__main__':

    main()

```

参考 [distributed_mp.py](https://github.com/rentainhe/pytorch-distributed-training/blob/master/distributed_mp.py) 启动方式如下:

```bash

$ CUDA_VISIBLE_DEVICES=0,1,2,3 python distributed_mp.py

```

- `--ip=str`, e.g `--ip='10.24.82.10'` 来指定主进程的ip地址

- `--port=int`, e.g `--port=23456` 来指定启动端口号

## Reference

参考的文章如下（如果有文章没有引用，但是内容差不多的，可以提issue给我，我会补上，实在抱歉）：

- [Pytorch: DDP系列](https://zhuanlan.zhihu.com/p/178402798)

- [分布式训练](https://zhuanlan.zhihu.com/p/98535650)

- [分布式训练（理论篇）](https://zhuanlan.zhihu.com/p/129912419)

- [DistributedSampler的问题](https://www.zhihu.com/question/67209417/answer/1017851899)

- I learned from this [repo](https://github.com/tczhangzhi/pytorch-distributed), and want to make it easier and cleaner.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rentainhe/pytorch-distributed-training

Awesome Lists containing this project

README