https://github.com/udthegooner/distributed-ml
Distributed training of a VGG-11 network on the CIFAR-10 dataset
https://github.com/udthegooner/distributed-ml
allreduce ddp gloo pytorch vgg-11
Last synced: about 1 month ago
JSON representation
Distributed training of a VGG-11 network on the CIFAR-10 dataset
- Host: GitHub
- URL: https://github.com/udthegooner/distributed-ml
- Owner: udthegooner
- Created: 2025-04-17T04:10:32.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-17T04:53:56.000Z (about 1 year ago)
- Last Synced: 2025-04-23T23:47:02.186Z (about 1 year ago)
- Topics: allreduce, ddp, gloo, pytorch, vgg-11
- Language: Python
- Homepage:
- Size: 4.88 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Distributed Training with AllReduce, DDP, and Custom Gather/Scatter
This repository contains an implementation of distributed training techniques using PyTorch, including **AllReduce**, **Distributed Data Parallel (DDP)**, and custom **Gather/Scatter** operations. The code is designed for training models using the CIFAR-10 dataset with a **VGG11** architecture. The setup is intended for **CPU-only** execution.
## Files
### 1. `allreduce.py`
Contains the implementation of the **AllReduce** technique for averaging gradients across multiple nodes in a distributed setup. This helps synchronize the model weights during training.
### 2. `ddp.py`
Implements **Distributed Data Parallel (DDP)**, a PyTorch feature that improves training efficiency by distributing the model across multiple processes and updating the gradients in parallel. This file sets up the environment for DDP training.
### 3. `gather_scatter.py`
Includes custom **Gather** and **Scatter** operations, which collect and distribute model gradients across nodes. This allows for synchronized updates to the model parameters during training.
### 4. `model.py`
Defines the **VGG11** model architecture, which is used for training on the CIFAR-10 dataset. The model consists of several convolutional layers followed by fully connected layers for classification.
## Requirements
- Python 3.x
- PyTorch
- NumPy
- CIFAR-10 dataset (automatically downloaded)
## Setup
To set up and run the code:
1. Clone the repository:
```bash
git clone https://github.com/yourusername/distributed-training.git
cd distributed-training
```
2. Install the required dependencies:
```bash
pip install torch torchvision numpy
```
## Running the Code
### Starting the Distributed Training
The code uses **PyTorch's Distributed** framework. To start training on multiple nodes, run the `ddp.py` script with the appropriate arguments. Example for running on 2 nodes:
```bash
python ddp.py --master-ip "127.0.0.1" --num-nodes 2 --rank 0
```
- `--master-ip` is the IP address of the master node.
- `--num-nodes` specifies the number of nodes in the distributed setup.
- `--rank` is the rank of the current node (0 for the master node).
### Training Flow
1. **Gather/Scatter Operations**: The model's parameters are updated during training using custom gather and scatter functions, ensuring that all gradients are synchronized across nodes.
2. **DDP Setup**: The Distributed Data Parallel framework ensures efficient training by parallelizing the process and updating gradients in parallel.
3. **Model**: The VGG11 model is trained on the CIFAR-10 dataset using the **CrossEntropyLoss** criterion and **SGD optimizer**.
### Model and Dataset
- **Model**: VGG11 - A deep neural network with convolutional layers, batch normalization, and ReLU activation.
- **Dataset**: CIFAR-10 - A benchmark dataset for image classification with 60,000 32x32 color images in 10 classes.
## Output
- The code will output the training loss and accuracy after each epoch.
- For each node, the model will synchronize gradients using the AllReduce or custom gather-scatter operations.