Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MachineLearningSystem/BNS-GCN
[MLSys 2022] "BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling" by Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, Yingyan Lin
https://github.com/MachineLearningSystem/BNS-GCN
Last synced: about 1 month ago
JSON representation
[MLSys 2022] "BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling" by Cheng Wan, Youjie Li, Ang Li, Nam Sung Kim, Yingyan Lin
- Host: GitHub
- URL: https://github.com/MachineLearningSystem/BNS-GCN
- Owner: MachineLearningSystem
- License: mit
- Fork: true (GATECH-EIC/BNS-GCN)
- Created: 2022-10-25T03:35:41.000Z (about 2 years ago)
- Default Branch: master
- Last Pushed: 2022-10-17T06:30:28.000Z (about 2 years ago)
- Last Synced: 2024-08-02T19:37:21.193Z (5 months ago)
- Homepage:
- Size: 46.9 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-AI-system - BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling MLSYS'22
README
# BNS-GCN: Efficient Full-Graph Training of Graph Convolutional Networks with Partition-Parallelism and Random Boundary Node Sampling
Cheng Wan\* (Rice University), Youjie Li\* (UIUC), Ang Li (PNNL), Nam Sung Kim (UIUC), Yingyan Lin (Rice University)
(\*Equal contribution)
Accepted at MLSys 2022 [[Paper](https://arxiv.org/abs/2203.10983) | [Video](https://youtu.be/kzI0ksASFQY) | [Slide](https://mlsys.org/media/mlsys-2022/Slides/2178.pdf) | [Docker](https://hub.docker.com/r/cheng1016/bns-gcn) | [Sibling](https://github.com/RICE-EIC/PipeGCN)]
## Directory Structure
```
|-- checkpoint # model checkpoints
|-- dataset
|-- helper # auxiliary codes
| `-- timer
|-- module # PyTorch modules
|-- partitions # partitions of input graphs
|-- results # experiment outputs
`-- scripts # example scripts
```Note that `./checkpoint/`, `./dataset/`, `./partitions/` and `./results/` are empty folders at the beginning and will be created when BNS-GCN is launched.
## Setup
### Environment
#### Hardware Dependencies
- A X86-CPU machine with at least 120 GB host memory
- At least five Nvidia GPUs (at least 11 GB each)#### Software Dependencies
- Ubuntu 18.04
- Python 3.8
- CUDA 11.1
- [PyTorch 1.8.0](https://github.com/pytorch/pytorch)
- [customized DGL 0.8.0](https://github.com/chwan-rice/dgl)
- [OGB 1.3.2](https://ogb.stanford.edu/docs/home/)### Installation
#### Option 1: Run with Docker
We have prepared a [Docker package](https://hub.docker.com/r/cheng1016/bns-gcn) for BNS-GCN.
```bash
docker pull cheng1016/bns-gcn
docker run --gpus all -it cheng1016/bns-gcn
```#### Option 2: Install with Conda
Running the following command will install DGL from source and other prerequisites from conda.
```bash
bash setup.sh
```#### Option 3: Do it Yourself
Please follow the official guides ([[1]](https://github.com/pytorch/pytorch), [[2]](https://ogb.stanford.edu/docs/home/)) to install PyTorch and OGB. For DGL, please follow the [official guide](https://docs.dgl.ai/install/index.html#install-from-source) to install our customized DGL **from source** (do NOT forget to adjust the first `git clone` command to clone [our customized repo](https://github.com/chwan-rice/dgl)). We are contacting the DGL team to integrate our modification that supports minimizing communication volume for graph partition.
### Datasets
We use Reddit, ogbn-products, Yelp and ogbn-papers100M for evaluating BNS-GCN. All datasets are supposed to be stored in `./dataset/` by default. Reddit, ogbn-products and ogbn-papers100M will be downloaded by DGL or OGB automatically. Yelp is preloaded in the Docker environment, and is available [here](https://drive.google.com/open?id=1zycmmDES39zVlbVCYs88JTJ1Wm5FbfLz) or [here](https://pan.baidu.com/s/1SOb0SiSAXavwAcNqkttwcg) (with passcode f1ao) if you choose to set up the enviromnent by yourself.
## Basic Usage
### Core Training Options
- `--dataset`: the dataset you want to use
- `--model`: the GCN model (only GraphSAGE and GAT are supported at this moment)
- `--lr`: learning rate
- `--sampling-rate`: the sampling rate of BNS-GCN
- `--n-epochs`: the number of training epochs
- `--n-partitions`: the number of partitions
- `--n-hidden`: the number of hidden units
- `--n-layers`: the number of GCN layers
- `--partition-method`: the method for graph partition ('metis' or 'random')
- `--port`: the network port for communication
- `--no-eval`: disable evaluation process### Run Example Scripts
Simply running `scripts/reddit.sh`, `scripts/ogbn-products.sh` and `scripts/yelp.sh` can reproduce BNS-GCN under the default settings. For example, after running `bash scripts/reddit.sh`, you will get the output like this
```
...
Process 000 | Epoch 02999 | Time(s) 0.3578 | Comm(s) 0.2267 | Reduce(s) 0.0108 | Loss 0.0716
Process 001 | Epoch 02999 | Time(s) 0.3600 | Comm(s) 0.2314 | Reduce(s) 0.0136 | Loss 0.0867
(rank 1) memory stats: current 562.96MB, peak 1997.89MB, reserved 2320.00MB
(rank 0) memory stats: current 557.01MB, peak 2087.31MB, reserved 2296.00MB
Epoch 02999 | Accuracy 96.55%
model saved
Max Validation Accuracy 96.68%
Test Result | Accuracy 97.21%
```### Run Full Experiments
If you want to reproduce core experiments of our paper (e.g., accuracy in Table 4, throughput in Figure 4, time breakdown in Figure 5, peak memory in Figure 6), please run `scripts/reddit_full.sh`, `scripts/ogbn-products_full.sh` or `scripts/yelp_full.sh`, and the outputs will be saved to `./results/` directory. Note that the throughput of these experiments will be significantly slower than the results in our paper because the training is performed along with validation.
### Run Customized Settings
You may adjust `--n-partitions` and `--sampling-rate` to reproduce the results of BNS-GCN under other settings. To verify the exact throughput or time breakdown of BNS-GCN, please add `--no-eval` argument to skip the evaluation step. You may also use the argument `--partition-method=random` to explore the performance of BNS-GCN with random partition.
### Run with Multiple Compute Nodes
Our code base also supports distributed GCN training with multiple compute nodes. To achieve this, you should specify `--master-addr`, `--node-rank` and `--parts-per-node` for each compute node. An example is provided in `scripts/reddit_multi_node.sh` where we train the Reddit graph over 4 compute nodes, each of which contains 10 GPUs, with 40 partitions in total. You should run the command on each node and specify the corresponding node rank. **Please turn on `--fix-seed` argument** so that all nodes initialize the same model weights.
If the compute nodes do not share storage, you should partition the graph in a single device first and manually distribute the partitions to other compute nodes. When run the training script, please enable `--skip-partition` argument.
## License
Copyright (c) 2022 RICE-EIC. All rights reserved.
Licensed under the [MIT](https://github.com/RICE-EIC/BNS-GCN/blob/master/LICENSE) license.