Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/facebookresearch/shadow_gnn

NeurIPS 2021: Improve the GNN expressivity and scalability by decoupling the depth and receptive field of state-of-the-art GNN architectures
https://github.com/facebookresearch/shadow_gnn

Last synced: 8 days ago
JSON representation

NeurIPS 2021: Improve the GNN expressivity and scalability by decoupling the depth and receptive field of state-of-the-art GNN architectures

Host: GitHub
URL: https://github.com/facebookresearch/shadow_gnn
Owner: facebookresearch
License: mit
Created: 2021-03-18T19:04:48.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2022-03-18T18:56:46.000Z (over 2 years ago)
Last Synced: 2024-03-04T16:46:58.300Z (8 months ago)
Language: Python
Homepage:
Size: 229 KB
Stars: 129
Watchers: 11
Forks: 18
Open Issues: 3
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# Decoupling the Depth and Scope of Graph Neural Networks

Hanqing Zeng, Muhan Zhang, Yinglong Xia, Ajitesh Srivastava, Andrey Malevich, Rajgopal Kannan, Viktor Prasanna, Long Jin, Ren Chen

**Contact**: Hanqing Zeng ([email protected])

[Latest version of the paper](https://arxiv.org/abs/2201.07858)

(Note: There is an [old version](https://arxiv.org/abs/2012.01380) named "Deep Graph Neural Networks with Shallow Subgraph Samplers". Please only refer to the new version and disgard the old one. )

## News
* Major updates (code refactoring; link prediction support; all training configs) released in Jan 2022.
* We thank the **DGL** team for including the shaDow [k-hop sampler](https://docs.dgl.ai/en/latest/api/python/dgl.dataloading.html#dgl.dataloading.shadow.ShaDowKHopSampler) in their library.
* shaDow-GNN paper accepted to **NeurIPS'21**!
* We thank the **Pytorch Geometric** team for including the shaDow [k-hop sampler](https://pytorch-geometric.readthedocs.io/en/2.0.0/modules/loader.html?#torch_geometric.loader.ShaDowKHopSampler) in their library.

## Overview

We propose a design principle of "decoupling the depth and scope" when constructing GNN models. This is a simple way to **surpass 1-WL**, **overcome oversmoothing** and **avoid neighborhood explosion** at the same time.

We call the practical implementation of our design principle as **shaDow-GNN** (**D**eep GNNs on **sha**ll**ow** subgraphs).

This repo implements:
* 6 backbone message passing layers (GCN, GraphSAGE, GIN, GAT, JK-Net, SGC)
* 4 pooling layers (sort, max, mean, sum)
* 4 subgraph extractors / sampler (IID node, k-hop, PPR, stochastic PPR)
* Pre-processing (feature smoothening; label propagation)
* Post-processing (C&S)
* Subgraph ensemble (either during training or post-processing)

This repo supports:
* Inductive node classification (`Flickr`, `Reddit`, `Yelp`)
* Transductive node classification (`ogbn-arxiv`, `ogbn-products`, `ogbn-papers100M`)
* Link prediction (`ogbl-collab`)

The training pipeline of shaDow-GNN can be abstracted as three major steps:

### Preprocessing (optional)

Expand to see details...

The preprocessing steps may augment the input node features with
* Smoothened node features
* Ground-truth labels in the training set

The first point is similar to what SGC and SIGN did (it's just we convert the original algorithm into the shaDow version). The second point is inspired by the methods on the OGB leaderboard (only applicable under the transductive setting).

**Note**: preprocessing is turned off in all experiment in our main paper.

### Training

All shaDow-GNN are trained in the minibatch fashion. For each training batch, we first perform subgraph extraction, and then build a multi-layer GNN on the subgraph to perform message passing.

For any nodes `u` and `v` in the same batch, we treat the two subgraphs as completely isolated. i.e., when a node `w` of the original graph is included in both subgraphs, we rename `w` of `u`'s subgraph as `w1` and `w` of `v`'s subgraph as `w2` so that the two subgraphs don't talk to each other. See `_node_induced_subgraph()` function in `para_graph_sampler/graph_engine/backend/ParallelSampler.cpp`.

**Note**: unlike other graph sampling based methods, shaDow-GNN allows much smaller batch size (can be as small as 1) since the subgraph degree of shaDow-GNN does not drop with batch size. This property makes shaDow-GNN easily portable on GPUs of limited memory.

### Postprocessing (optional)

Expand to see details...

After the training is finished, we can reload the stored checkpoint to perform the following post-processing steps:
* *C&S* (transductive only): we borrow the DGL implementation of C&S to perform smoothening of the predictions generated by shaDow-GNN.
* *Ensemble*: Ensemble can be done either in an "end-to-end" fashion during the above training step, or as a postprocessing step.

## Hardware requirements

Due to its flexibility in minibatching, shaDow-GNN requires the minimum hardware for training and inference computation. Most of our experiments can be run on a desktop machine. Even the largest graph of 111 million nodes can be trained on a low-end server.

The main computation operations include:
* Subgraph extraction / sampling: parallelized on CPU by C++ and OpenMP.
* GNN model propagation: accelerated on GPU via PyTorch.

We summarize the recommended *minimum* hardware spec for the three OGB graphs:

| Graph | Num. nodes | CPU cores | CPU RAM | GPU memory |
|:-----:|:----------:|:---------:|:-------:|:----------:|
| ogbn-arxiv | 0.2M | 4 | 8GB | 4GB |
| ogbn-products | 2.4M | 4 | 32GB | 4GB |
| ogbn-papers100M | 111.1M | 4 | 128GB | 4GB |

## Data format

When you run shaDow-GNN for the first time, we will convert the graph data from the OGB or GraphSAINT format into the shaDow-GNN format.
The converted data files are (by default) stored in the `./data/` directory.

**NOTE**: the initial data conversion may take a while for large graphs (e.g., for ogbn-papers100M). Please be patient.

### General shaDow format
Expand to see details if you want to prepare your own dataset

We briefly describe the shaDow data format. You should not need to worry about the details unless you want to prepare your own dataset. Each graph is defined by the following files:
* `adj_full_raw.npz` / `adj_full_raw.npy`: The adjacency matrix of the full graph (consisting of all the train / valid / test nodes). It can either be a `*.npz` file of type `scipy.sparse.csr_matrix`, or a `*.npy` file containing the dictionary `{'indptr': numpy.ndarray, 'indices': numpy.ndarray, 'data': numpy.ndarray}`.
* `adj_train_raw.npz` / `adj_train_raw.npy`: The adjacency matrix induced by all training nodes (ONLY used in inductive learning).
* `label_full.npy`: The `numpy.ndarray` representing the labels of all the train / valid / test nodes. If this matrix is 2D, then a row is a one-hot encoding of the label(s) of a node. If this is 1D, then an element is the label index of a node. In any case, the first dimension equals the total number of nodes.
* `feat_full.npy`: The `numpy.ndarray` representing the node features. The first dimension of the matrix equals the total number of nodes.
* `split.npy`: The file stores a dictionary representing the train / valid / test splitting. The keys are train / valid / test. The values are `numpy` array of the node indices for the corresponding split.
* (Optional) `adj_full_undirected.npy`: This is a cache file storing the graph after converting `adj_full_raw` into undirected (e.g., the raw graph of `ogbn-arxiv` is directed).
* (Optional) `adj_train_undirected.npy`: Similar as above. Converted from `adj_train_raw` into undirected.
* (Optional) `cpp/adj__.bin`: These are the cache files for the C++ sampler. We store the corresponding `*.npy` / `*.npz` files as binary files so that the C++ sampler can directly load the graph without going through the layer of PyBind11 (see below). For gigantic graphs such as `ogbn-papers100M`, the conversion from `numpy.ndarray` to C++ `vector` seems to be slow (maybe an issue of PyBind11).
* (Optional) `ppr_float/__.bin`: These are the cache files for the C++ PPR sampler. We store the PPR values and node indices for the close neighbors of each target as the external binary files. Therefore, we do not need to run PPR multiple times when we perform parameter tuning (even through running PPR from scratch is still much cheaper than the model training).

### Graphs tested

To train shaDow-GNN on the 7 graphs evaluated in the paper:
* For the 4 OGB graphs (i.e., `ogbn-arxiv`, `ogbn-products`, `ogbn-papers100M`), you don't need to manually download anything. Just execute the training command (see below).
* For the 3 other graphs (i.e., `Flickr`, `Reddit`, `Yelp`), the source data files are listed in the official [GraphSAINT repo](https://github.com/GraphSAINT/GraphSAINT). Please manually download from the [link provided by GraphSAINT](https://drive.google.com/open?id=1zycmmDES39zVlbVCYs88JTJ1Wm5FbfLz), and place all the downloaded files under the `./data/saint//` directory.
* E.g., for `Flickr`, the directory should look something like (note the **lower case** for graph name)

```
data/
└───saint/
└───flickr/
└───adj_full.npz
class_map.json
...
```

The script for converting from OGB / SAINT into shaDow format is `./para_graph_sampler/graph_engine/frontend/data_converter.py`. It is automatically invoked when you run training for the first time.

## Build and Run

Clone the repo by (you need the `--recursive` flag to download `pybind11` as submodule):

```
git clone --recursive
```

**Step 0**: Make sure you create a virtual environment with Python 3.8 (lower version of python may not work. The version we use is 3.8.5).

**Step 1**: We need PyBind11 to link the C++ based sampler with the PyTorch based trainer. The `./para_graph_sampler/graph_engine/backend/ParallelSampler.*` contain the C++ code for the PPR and k-hop samplers. The `./para_graph_sampler/graph_engine/backend/pybind11/` directory contains a [copy of PyBind11](https://github.com/pybind/pybind11).

Before training, we need to build the C++ sampler as a python package, so that it can be directly imported by the PyTorch trainer (just like we import any other python module). To do so, you need to install the following:

* `cmake` (our version is 3.18.2. Can be installed by `conda install -c anaconda cmake`)
* `ninja` (our version is 1.10.2. Can be installed by `conda install -c conda-forge ninja`)
* `pybind11` (our version is 2.6.2. Can be installed by `pip install pybind11`)
* `OpenMP`: normally openmp should already be included in the C++ compiler. If not, you may need to install it manually based on your C++ compiler version.

Then build the sampler. Run the following in your terminal

```
cd para_graph_sampler
bash install.sh
cd ..
```

On Windows machine, you could instead replace the `bash install.sh` command by `.\install.bat`.

**Step 2**: Install all the other Python packages in your virtual environment.

* pytorch==1.7.1 (CUDA 11)
* Pytorch Geometric and its dependency packages (torch-scatter, torch-sparse, etc.)
* Follow the [official instructions](https://pytorch-geometric.readthedocs.io/en/latest/notes/installation.html) (see the "Installation via Binaries" section)
* We also explicitly use the `torch_scatter` functions to perform some graph operations for shaDow.
* ogb>=1.2.4
* dgl>=0.5.3 (only used by the postprocessing of C&S). Can be installed by `pip` or `conda`. See the [official instruction](https://docs.dgl.ai/install/index.html)
* numpy>=1.19.2
* scipy>=1.6.0
* scikit-learn>=0.24.0
* pyyaml>=5.4.1
* argparse
* tqdm

**(Optional) Step 3**: Record your system information. We use the `CONFIG.yml` file to keep track of the meta information of your hardware / software system. Copy `CONFIG_TEMPLATE.yml` and name it `CONFIG.yml`. Edit the fields based on your machine specs.

In most cases, the only thing you need to overwrite is the `max_threads` field. This is used to control the parallelism of the C++ sampler. You can also set it to `-1` so that OpenMP will automatically decide the number of threads for you.

**Step 4**: Now you should be able to run the training / inference. In general, just type:

```
python -m shaDow.main --configs --dataset --gpu
```

where the `*.yml` file specifies all the hyperparameters (e.g., GNN architecture, sampler, etc.). The name of the graph should correspond to the sub-directory name under `./data/` (we use all **lowercase** and omit the `ogbn-` or `ogbl-` prefix).

**Step 5** Check the logs of the training. We use the following protocol for logging. Our principle is to enable complete reproductivity of the previous runs.

* Each run gets its own subdirectory in the format of `.////-/...`, where the subdirectory indicates the status of the run:
* `running/`: the training is still in progress.
* `finished/`: the training finishes normally. The logs will be moved from `running/` to `finished/`.
* `killed/`: the training is killed (e.g., by CTRL-C).
* `crashed/`: the training crashes (e.g., bugs in the code, GPU / CPU out-of-memory, etc.).
* In the subdirectory we should find the following files:
* `*.yml`: a copy of the `*.yml` file to launch the training
* `epoch_.csv`: CSV file logging the accuracy and loss of each epoch
* `final.csv`: CSV file logging the final accuracy on the full train / valid / test sets.
* pytorch checkpoint: the model weights and optimizer states.

## Reproducing the paper results

We first describe the command for a single run. At the end of this section, we show the wrapper script for repeating the same configuration multiple times.

### Table 1

The configs are under `./config_train///__.yml`, where
* `dataset` corresponds to the 5 graphs in Table 1: `flickr`, `reddit`, `yelp`, `arxiv`, `products`.
* `vanilla` means no subgraph pooling is performed. So we take the target node embedding for classification and disgard embeddings for all other subgraph nodes.
* `pool` means we add an extra subgraph pooling layer onto the vanilla arch. Table 1 only evaluated `mean` and `max` pooling. You can also try `sort` and `sum`.
* `arch` in Table 1 is restricted to `gcn`, `sage` and `gat`. Figure 3 corresponds to `sgc` and Table 12 corresponds to `gin`.
* `depth` in Table 1 is either 3 or 5.
* `sampler` is chosen from `ppr` and `khop`.

**Note**: for `ogbn-products`, since its test set is especially large, you can skip evaluating test accuracy during training by additional flags. e.g.,

```
python -m shaDow.main --configs config_train/products/pool/gat_3_ppr.yml --dataset products --gpu --log_test_convergence -1 --nocache test
```

### Table 2

Run:

```
python -m shaDow.main --configs config_train/papers100M/leaderboard/gat_ppr.yml --dataset papers100M --gpu
```

### Table 3

Run:

```
python -m shaDow.main --configs config_train/collab/leaderboard/sage_ppr.yml --dataset collab --gpu
```

### Repeat the same configuration multiple times

Expand to see details

Table 1 results are all repeated 5 times. Table 2 and 3 results are repeated 10 times. All without fixing random seeds. In C++, not fixing random seed is achieved by `std::srand(std::time(0))` in `para_graph_sampler/graph_engine/backend/ParallelSampler.h`.

We also provide a wrapper script for repeat the training. See `./scripts/train_multiple_runs.py`.

General command:

```
python scripts/train_multiple_runs.py --dataset --configs --gpu --repetition 10
```

where all the command line arguments of `train_multiple_runs.py` are the same as the original training script (i.e., the `shaDow.main` module). The only additional flag is `--repetition`.

**NOTE**: the wrapper script uses python subprocess to launch multiple runs. There seems to be some issue on redirecting the print-out messages of the training subprocess. It may appear that the program stucks without any outputs. This is due to the buffering of output. However, the training should actually be running **in the background**. You can check the corresponding log files in the `running/` directory to see the accuracy per epoch being updated.

## License

shaDow-GNN is released under an MIT license. Find out more about it [here](https://github.com/facebookresearch/shaDow_GNN/blob/master/LICENSE).

## Citation

NeurIPS 2021
```
@inproceedings{
shaDow,
title={Decoupling the Depth and Scope of Graph Neural Networks},
author={Hanqing Zeng and Muhan Zhang and Yinglong Xia and Ajitesh Srivastava and Andrey Malevich and Rajgopal Kannan and Viktor Prasanna and Long Jin and Ren Chen},
booktitle={Advances in Neural Information Processing Systems},
editor={A. Beygelzimer and Y. Dauphin and P. Liang and J. Wortman Vaughan},
year={2021},
url={https://openreview.net/forum?id=d0MtHWY0NZ}
}
```