Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/krishnanlab/PecanPy_benchmarks


https://github.com/krishnanlab/PecanPy_benchmarks

Last synced: about 2 months ago
JSON representation

Awesome Lists containing this project

README

        

# PecanPy Benchmark

This reposotory provides scripts for reproducing the benchmarking results of several implementations of node2vec
as presented in [PecanPy](https://github.com/krishnanlab/PecanPy).

***Note**: all test scripts provided use the [SLURM workload manager](https://slurm.schedmd.com/documentation.html),
and will **NOT** run on a personal computer; this package uses [Anaconda](https://www.anaconda.com/products/individual)
to manage environments through conda environments, so that different software packages can be run despite they might
have different dependency requirements.*

## 1. Quick start

To get started and perform all tests, run the following commands which will first setup the working environment
and then submit all job scripts for testing.

***WARNING**: the whole dataset takes up 11GB of sapce, make sure you have enough space before setting up.*

```bash
sh setup.sh # setup working environment
sh SLURM/submit_all.sh # submitting test job scripts
```

***Note**: it might take up to 30 minutes to fully setup, mainly due to download of large data files*

After all jobs are done (***one might encounter memory errors due to limitations of some implementation***), two
files will be created under the `result/` directory:
* `stat_summary.tsv` - summary table consisting of runtime statistics of different implementations for different
networks. More details about the measurements could be found in
[section 3](https://github.com/krishnanlab/PecanPy_benchmarks#3-submitting-benchmark-jobs)
* `evaluation_summary.tsv` - summary table for the classification evaluation of the resulting embeddings from
different implementations. More details about the eavluation setup could be found in
[section 4](https://github.com/krishnanlab/PecanPy_benchmarks#4-classification-evaluation)

To generate the runtime stats and evaluation figures using the test data collected, open and run the two jupyter
notebooks (`runtime_stats.ipynb` and `classification_evaluation.ipynb`) under the `jup/` directory. The figures
should be produced at the end of each notebook.

## 2. Setting up

This section describes in more detail the steps that are done when a user runs `setup.sh`, which includes
* setting up directory structure for saving results and history files
* building environments for different implementataions
* downloading and processing data

### 2.1 Directory

The `script/init_setup/setup_dir.sh` script will setup the `SLURM_history/` directory for holding SLURM job history
files and the `result/` with the following
```txt
result
|-- emb
| |-- nodevectors
| |-- orig-cpp
| |-- orig-py
| |-- pecanpy-DenseOTF
| |-- pecanpy-PreComp
| `-- pecanpy-SparseOTF
`-- stat
|-- nodevectors
|-- orig-cpp
|-- orig-py
|-- pecanpy-DenseOTF
|-- pecanpy-PreComp
`-- pecanpy-SparseOTF
```

where `emb/` holds the embedding files generated by different implementations, and `stat/` holds the runtime stat
data (execution time and physical memory usage) profiled by GNU `time` command. Each of the directory above is
further subdivided with one folder per implementations

### 2.2 Environments

We tested 6 different implementations of the node2vec algorithm with 3 implementations coming from PecanPy and 3
other from alternative software packages.

* [Original node2vec (Python)](https://github.com/aditya-grover/node2vec) - `orig-py`
* [Original node2vec (C++)](https://github.com/snap-stanford/snap/tree/master/examples/node2vec) - `orig-cpp`
* [PecanPy](https://github.com/krishnanlab/PecanPy)
* PreComp - `pecanpy-PreComp`
* SparseOTF - `pecanpy-SparseOTF`
* DenseOTF - `pecanpy-DenseOTF`
* [NodeVectors](https://github.com/VHRanger/nodevectors) - `nodevectors`

Since different libraries require different dependencies that are not compatible with each other, we set up conda
conda environments for each of the libraries (except for `orig-cpp`, which requires building cpp source code
instead). Information about the conda environments is provided in the `env/` directory as `.yml` files. The
`script/init_setup/setup_envs.sh` script will use these files to setup the three conda environments to be used
later by different libraries
* `pecanpy-bench_node2vec`
* `pecanpy-bench_nodevectors`
* `pecanpy-bench_pecanpy`

### 2.3 Data

Various networks with a wide range of sizes and densities are used for benchmarking different implementations of
the node2vec algorithm. The relatively small networks (BlogCatalog, PPI, Wikipedia) are provided in this repository
along with node labels. They are originally downloaded from the [node2vec](https://snap.stanford.edu/node2vec/)
webpage, and converted to `.txt` files from `.mat` files so that it is easier to load with Python. The rest of
the netowrks will be downloaded from other repositories, which will be automatically done by the
`script/init_setup/setup_data.sh` script. The following table summaries some information about the networks tested.

|Network|Weighted|# nodes|# edges|Density (unweighted)|File size|
|:-|:-|-:|-:|-:|-:|
|BioGRID|No|20,558|238,474|1.13E-03|2.5M|
|STRING|Yes|17,352|3,640,737|2.42E-02|60M|
|GIANT-TN-c01|Yes|25,689|38,904,929|1.18E-01|1.1G|
|GIANT-TN|Yes|25,825|333,452,400|1.00E+00|7.2G|
|SSN200|Yes|814,731|72,618,574|2.19E-04|2.0G|
|BlogCatalog|No|10,312|333,983|6.28E-03|3.2M|
|PPI|No|3,852|38,273|5.16E-03|707K|
|Wikipedia|Yes|4,777|92,406|8.10E-03|2.0M|

## 3. Submitting benchmark jobs

Each implementation will be tested using two different resource configurations (*multi* and *single*).
The *multi* setup aims to test the capability of the implementation to make use of large computational
resources, while the *single* setup tests the ability of the implementation run on a smaller amount of
computational resources (e.g. what might be found on a laptop). The configuration details are the following

|Configuration|Core count|Memory (GB)|Time limit (hr)|
|:-|:-|:-|:-|
|Multi|28|200|24|
|Single|1|32|8|

As mentioned earlier, `SLURM/submit_all.sh` will submit all test jobs at once. Each implementation and configuration
has its own test script, e.g. `SLURM/test_pecanpy-PreComp_single.sb` is the batch scrip for testing PecanPy-PreComp
with single-core resource configuration.

***A note on testing `nodevectors` implementation:*** for all other implementation besides `nodevectors`, the `p` and
`q` parameters are set to `1` by default, but for `nodevectors` they are set to `1.001`. If `p` and `q` are set to `1`,
`nodevectors` will automatically perform 1st order random walk as a shortcut
([link to source](https://github.com/VHRanger/CSRGraph/blob/8fb8f0e44aba1f147272bd8db19875756fde999f/csrgraph/graph.py#L241-L251))
other than 2nd order walk as required by the node2vec. Performing 1st order walk is significantly faster than performing
2nd order random walk as all other implementations do, and hence providing biased testing results. Setting `p` and `q`
to `1.001` disable the automatic fall back to 1st order random walk and at the same time keeps the results close to when
`p` and `q` are set to `1`.

After all test jobs are finished, the following Python script `~/script get_test_info.py` will be executed
to extract test information from logging files and summarize into a single table to `~/result/stat_summary/summary.txt`.
The summary file can be renamed to specify specific benchmark conditions if needed.
The table consists of the following columns:
* **Network** - name of the network embedded
* **Method** - name of the implementation of node2vec
* **Setup** - computational resource setting of the test
* **Loading time** - (stage1) time used to load network into memory in desired format
* **Preprocessing time** - (stage2) time used to pre-compute transition probability table
* **Walking time** - (stage3) time used to generate random walks
* **Training time** - (stage4) time used to train word2vec model using the random walks
* **Total time** - total run time of the program (from starting Python, includes time to load packages, etc.)
* **Total time in second** - same as total time, but converted to seconds
* **Maximum resident size** - maximum physical memory used

## 4. Classification Evaluation

To ensure the quality of the embeddings, we perform node classification tasks as presented in
[node2vec](https://arxiv.org/abs/1607.00653) for BlogCatalog, PPI, and Wikipedia. Slight modification to the labels for PPI
was made to remove labelsets with less than 10 positives to ensure meaningful evaluation metrics. There are 38 node classes
in BlogCatalog, 50 node classes in PPI, and 21 node classes in Wikipedia. For each node class in a network, a one vs rest l2
regularized [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
model is trained and evaluated through 5-fold cross validation. Each test fold is scored using auROC separately, and the mean
auROC score across 5 folds is reported. This evaluation is repeated 10 times and the mean reported scores are taken as the
final evaluation score the the class.

After the evaluation, for each network, there should be a list of scores for each implementation, where the entries correspond
to evaluation score for a particular node class. A wilcoxon paired test is then applied for each implementation against the
original Python implementation to see if there is any significant performance drop using the new implementation.

## 5. Extending Benchmarks

We welcome any researchers to use this repository to benchmark their own network of interest or even embedding programs.
This section provides guidelines for creating new tests and requirements for contributing to the repository by compiling
more implementations.

### 5.1 Procedure for adding new implementation
1. Add new implementation name to `data/implementation_list.txt` (see [section 5.3]() for more requirements of implementation)
2. Create test job script using the templates provided (`SLURM/_test_template.sb` and `SLURM/_test_template_single.sb`),
and follow the modifaction tag `### MODIFY` to modify the lines as needed
* `### MODIFY1` - sbatch job name (ends with `_s` for single-core configuration setup)
* `### MODIFY2` - name of implementation, need to match that added to `data/implementation_list.txt`
* `### MODIFY3` - change directory to source code for new implementation if needed (not required if setup through env)
* `### MODIFY4` - activte vritual environment for the new implementation if needed
* `### MODIFY5` - modify command for calling the program to embed network

### 5.2 Procedure for adding new network
1. Add new network as edgelist file (`.edg`) and add the network name (without file extension) to `data/networks.txt`.
2. Add `true` (or `false`) to `data/weighted.txt` depending on whether the new network is weighted (or not). The order
of network names in `data/networks.txt` and the corresponding weighted information in `data/weighted.txt` must match.

### 5.3 Contributing
* Prepare the `.yml` environment file with the naming convention of `pecanpy-bench_new-method` where `new-method` should
be replaced by the method's name, and modify `script/init_setup/setup_envs.sh` to setup environment for the new implementation.
* Provide interfacing script to for bash to communicate with python if needed, the following inputs are required
* `--input` - input graph path
* `--output` - output embedding path
* `--dimension` - embedding dimension
* `--walk-length` - length of each walk
* `--num-walks` - number of walks per node
* `--workers` - number of workers
* `--p` - return parameter
* `--q` - inout parameter
* For proper runtime stat retrieval, add the following printing statements to report execution time for each corresponding stage
```python
print("Took %02d:%02d:%05.2f to load graph"%(hrs, mins, secs))
print("Took %02d:%02d:%05.2f to pre-compute transition probabilities"%(hrs, mins, secs))
print("Took %02d:%02d:%05.2f to generate walks"%(hrs, mins, secs))
print("Took %02d:%02d:%05.2f to train embeddings"%(hrs, mins, secs))
```
* Add data retrieval scripts and any necessary preprocessing scripts for new networks (as edgelist files, with `.edg` extension)