https://github.com/MachineLearningSystem/CAGNET

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/MachineLearningSystem/CAGNET
Owner: MachineLearningSystem
License: other
Fork: true (PASSIONLab/CAGNET)
Created: 2022-07-18T10:32:30.000Z (almost 3 years ago)
Default Branch: master
Last Pushed: 2022-07-14T21:15:34.000Z (almost 3 years ago)
Last Synced: 2024-11-07T10:41:27.107Z (8 months ago)
Size: 19.8 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-AI-system - Reducing Communication in Graph Neural Network Training SC'20

README

# CAGNET: Communication-Avoiding Graph Neural nETworks

## Description

CAGNET is a family of parallel algorithms for training GNNs that can asymptotically reduce communication compared to previous parallel GNN training methods. CAGNET algorithms are based on 1D, 1.5D, 2D, and 3D sparse-dense matrix multiplication, and are implemented with `torch.distributed` on GPU-equipped clusters. We also implement these parallel algorithms on a 2-layer GCN.

For more information, please read our ACM/IEEE SC'20 paper [Reducing Communication in Graph Neural Network Training](https://arxiv.org/pdf/2005.03300.pdf).

**Contact:** Alok Tripathy ()

## Dependencies
- Python 3.6.10
- PyTorch 1.3.1
- PyTorch Geometric (PyG) 1.3.2
- CUDA 10.1
- GCC 6.4.0

On OLCF Summit, all of these dependencies can be accessed with the following
```bash
module load cuda # CUDA 10.1
module load gcc # GCC 6.4.0
module load ibm-wml-ce/1.7.0-3 # PyTorch 1.3.1, Python 3.6.10

# PyG and its dependencies
conda create --name gnn --clone ibm-wml-ce-1.7.0-3
conda activate gnn
pip install --no-cache-dir torch-scatter==1.4.0
pip install --no-cache-dir torch-sparse==0.4.3
pip install --no-cache-dir torch-cluster==1.4.5
pip install --no-cache-dir torch-geometric==1.3.2
```

## Compiling

This code uses C++ extensions. To compile these, run

```bash
cd sparse-extension
python setup.py install
```

## Documentation

Each algorithm in CAGNET is implemented in a separate file.
- `gcn_distr.py` : 1D algorithm
- `gcn_distr_15d.py` : 1.5D algorithm
- `gcn_distr_2d.py` : 2D algorithm
- `gcn_distr_3d.py` : 3D algorithm

Each file also as the following flags:

- `--accperrank ` : Number of GPUs on each node
- `--epochs ` : Number of epochs to run training
- `--graphname ` : Graph dataset to run training on
- `--timing ` : Enable timing barriers to time phases in training
- `--midlayer ` : Number of activations in the hidden layer
- `--runcount ` : Number of times to run training
- `--normalization ` : Normalize adjacency matrix in preprocessing
- `--activations ` : Enable activation functions between layers
- `--accuracy ` : Compute and print accuracy metrics (Reddit only)
- `--replication ` : Replication factor (1.5D algorithm only)
- `--download ` : Download the Reddit dataset

Some of these flags do not currently exist for the 3D algorithm.

Amazon/Protein datasets must exist as COO files in `../data//processed/`, compressed with pickle.
For Reddit, PyG handles downloading and accessing the dataset (see below).

## Running on OLCF Summit (example)

To run the CAGNET 1.5D algorithm on Reddit with
- 16 processes
- 100 epochs
- 16 hidden layer activations
- 2-factor replication

run the following command to download the Reddit dataset:

`python gcn_distr_15d.py --graphname=Reddit --download=True`

This will download Reddit into `../data`. After downloading the Reddit dataset, run the following command to run training

`ddlrun -x WORLD_SIZE=16 -x MASTER_ADDR=$(echo $LSB_MCPU_HOSTS | cut -d " " -f 3) -x MASTER_PORT=1234 -accelerators 6 python gcn_distr_15d.py --accperrank=6 --epochs=100 --graphname=Reddit --timing=False --midlayer=16 --runcount=1 --replication=2`

## Citation

To cite CAGNET, please refer to:

> Alok Tripathy, Katherine Yelick, Aydın Buluç. Reducing Communication in Graph Neural Network Training. Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’20), 2020.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/MachineLearningSystem/CAGNET

Awesome Lists containing this project

README