{"id":13680101,"url":"https://github.com/krishnanlab/PecanPy_benchmarks","last_synced_at":"2025-04-29T19:32:43.004Z","repository":{"id":105006435,"uuid":"310495466","full_name":"krishnanlab/PecanPy_benchmarks","owner":"krishnanlab","description":null,"archived":false,"fork":false,"pushed_at":"2022-05-09T15:14:01.000Z","size":3280,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-11-11T22:35:57.712Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krishnanlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-11-06T04:53:15.000Z","updated_at":"2024-07-25T02:22:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"87b26916-c834-4917-a900-5c68ca4a45e5","html_url":"https://github.com/krishnanlab/PecanPy_benchmarks","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy_benchmarks","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy_benchmarks/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy_benchmarks/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy_benchmarks/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krishnanlab","download_url":"https://codeload.github.com/krishnanlab/PecanPy_benchmarks/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251569603,"owners_count":21610587,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:01:13.032Z","updated_at":"2025-04-29T19:32:38.538Z","avatar_url":"https://github.com/krishnanlab.png","language":"Shell","readme":"# PecanPy Benchmark\n\nThis reposotory provides scripts for reproducing the benchmarking results of several implementations of node2vec \nas presented in [PecanPy](https://github.com/krishnanlab/PecanPy). \n\n***Note**: all test scripts provided use the [SLURM workload manager](https://slurm.schedmd.com/documentation.html), \nand will **NOT** run on a personal computer; this package uses [Anaconda](https://www.anaconda.com/products/individual) \nto manage environments through conda environments, so that different software packages can be run despite they might \nhave different dependency requirements.*\n\n## 1. Quick start\n\nTo get started and perform all tests, run the following commands which will first setup the working environment \nand then submit all job scripts for testing.\n\n***WARNING**: the whole dataset takes up 11GB of sapce, make sure you have enough space before setting up.*\n\n```bash\nsh setup.sh # setup working environment\nsh SLURM/submit_all.sh # submitting test job scripts\n```\n\n***Note**: it might take up to 30 minutes to fully setup, mainly due to download of large data files*\n\nAfter all jobs are done (***one might encounter memory errors due to limitations of some implementation***), two \nfiles will be created under the `result/` directory:\n* `stat_summary.tsv` - summary table consisting of runtime statistics of different implementations for different \nnetworks. More details about the measurements could be found in \n[section 3](https://github.com/krishnanlab/PecanPy_benchmarks#3-submitting-benchmark-jobs)\n* `evaluation_summary.tsv` - summary table for the classification evaluation of the resulting embeddings from \ndifferent implementations. More details about the eavluation setup could be found in \n[section 4](https://github.com/krishnanlab/PecanPy_benchmarks#4-classification-evaluation)\n\nTo generate the runtime stats and evaluation figures using the test data collected, open and run the two jupyter \nnotebooks (`runtime_stats.ipynb` and `classification_evaluation.ipynb`) under the `jup/` directory. The figures \nshould be produced at the end of each notebook.\n\n\n## 2. Setting up\n\nThis section describes in more detail the steps that are done when a user runs `setup.sh`, which includes \n* setting up directory structure for saving results and history files\n* building environments for different implementataions\n* downloading and processing data\n\n### 2.1 Directory\n\nThe `script/init_setup/setup_dir.sh` script will setup the `SLURM_history/` directory for holding SLURM job history \nfiles and the `result/` with the following \n```txt\nresult\n|-- emb\n|    |-- nodevectors\n|    |-- orig-cpp\n|    |-- orig-py\n|    |-- pecanpy-DenseOTF\n|    |-- pecanpy-PreComp\n|    `-- pecanpy-SparseOTF\n`-- stat\n     |-- nodevectors\n     |-- orig-cpp\n     |-- orig-py\n     |-- pecanpy-DenseOTF\n     |-- pecanpy-PreComp\n     `-- pecanpy-SparseOTF\n```\n\nwhere `emb/` holds the embedding files generated by different implementations, and `stat/` holds the runtime stat\ndata (execution time and physical memory usage) profiled by GNU `time` command. Each of the directory above is \nfurther subdivided with one folder per implementations  \n\n\n### 2.2 Environments\n\nWe tested 6 different implementations of the node2vec algorithm with 3 implementations coming from PecanPy and 3 \nother from alternative software packages. \n\n* [Original node2vec (Python)](https://github.com/aditya-grover/node2vec) - `orig-py`\n* [Original node2vec (C++)](https://github.com/snap-stanford/snap/tree/master/examples/node2vec) - `orig-cpp`\n* [PecanPy](https://github.com/krishnanlab/PecanPy)\n  * PreComp - `pecanpy-PreComp`\n  * SparseOTF - `pecanpy-SparseOTF`\n  * DenseOTF - `pecanpy-DenseOTF`\n* [NodeVectors](https://github.com/VHRanger/nodevectors) - `nodevectors`\n\nSince different libraries require different dependencies that are not compatible with each other, we set up conda \nconda environments for each of the libraries (except for `orig-cpp`, which requires building cpp source code \ninstead). Information about the conda environments is provided in the `env/` directory as `.yml` files. The \n`script/init_setup/setup_envs.sh` script will use these files to setup the three conda environments to be used \nlater by different libraries\n* `pecanpy-bench_node2vec`\n* `pecanpy-bench_nodevectors`\n* `pecanpy-bench_pecanpy`\n\n### 2.3 Data\n\nVarious networks with a wide range of sizes and densities are used for benchmarking different implementations of \nthe node2vec algorithm. The relatively small networks (BlogCatalog, PPI, Wikipedia) are provided in this repository \nalong with node labels. They are originally downloaded from the [node2vec](https://snap.stanford.edu/node2vec/) \nwebpage, and converted to `.txt` files from `.mat` files so that it is easier to load with Python. The rest of \nthe netowrks will be downloaded from other repositories, which will be automatically done by the \n`script/init_setup/setup_data.sh` script. The following table summaries some information about the networks tested.\n\n|Network|Weighted|# nodes|# edges|Density (unweighted)|File size|\n|:-|:-|-:|-:|-:|-:|\n|BioGRID|No|20,558|238,474|1.13E-03|2.5M|\n|STRING|Yes|17,352|3,640,737|2.42E-02|60M|\n|GIANT-TN-c01|Yes|25,689|38,904,929|1.18E-01|1.1G|\n|GIANT-TN|Yes|25,825|333,452,400|1.00E+00|7.2G|\n|SSN200|Yes|814,731|72,618,574|2.19E-04|2.0G|\n|BlogCatalog|No|10,312|333,983|6.28E-03|3.2M|\n|PPI|No|3,852|38,273|5.16E-03|707K|\n|Wikipedia|Yes|4,777|92,406|8.10E-03|2.0M|\n\n\n## 3. Submitting benchmark jobs\n\nEach implementation will be tested using two different resource configurations (*multi* and *single*). \nThe *multi* setup aims to test the capability of the implementation to make use of large computational \nresources, while the *single* setup tests the ability of the implementation run on a smaller amount of \ncomputational resources (e.g. what might be found on a laptop). The configuration details are the following  \n\n|Configuration|Core count|Memory (GB)|Time limit (hr)|\n|:-|:-|:-|:-|\n|Multi|28|200|24|\n|Single|1|32|8|\n\nAs mentioned earlier, `SLURM/submit_all.sh` will submit all test jobs at once. Each implementation and configuration \nhas its own test script, e.g. `SLURM/test_pecanpy-PreComp_single.sb` is the batch scrip for testing PecanPy-PreComp \nwith single-core resource configuration.\n\n***A note on testing `nodevectors` implementation:*** for all other implementation besides `nodevectors`, the `p` and \n`q` parameters are set to `1` by default, but for `nodevectors` they are set to `1.001`. If `p` and `q` are set to `1`, \n`nodevectors` will automatically perform 1st order random walk as a shortcut \n([link to source](https://github.com/VHRanger/CSRGraph/blob/8fb8f0e44aba1f147272bd8db19875756fde999f/csrgraph/graph.py#L241-L251)) \nother than 2nd order walk as required by the node2vec. Performing 1st order walk is significantly faster than performing \n2nd order random walk as all other implementations do, and hence providing biased testing results. Setting `p` and `q` \nto `1.001` disable the automatic fall back to 1st order random walk and at the same time keeps the results close to when \n`p` and `q` are set to `1`.\n\nAfter all test jobs are finished, the following Python script `~/script get_test_info.py` will be executed \nto extract test information from logging files and summarize into a single table to `~/result/stat_summary/summary.txt`. \nThe summary file can be renamed to specify specific benchmark conditions if needed. \nThe table consists of the following columns: \n* **Network** - name of the network embedded\n* **Method** - name of the implementation of node2vec\n* **Setup** - computational resource setting of the test\n* **Loading time** - (stage1) time used to load network into memory in desired format\n* **Preprocessing time** - (stage2) time used to pre-compute transition probability table\n* **Walking time** - (stage3) time used to generate random walks\n* **Training time** - (stage4) time used to train word2vec model using the random walks\n* **Total time** - total run time of the program (from starting Python, includes time to load packages, etc.)\n* **Total time in second** - same as total time, but converted to seconds\n* **Maximum resident size** - maximum physical memory used\n\n\n## 4. Classification Evaluation\n\nTo ensure the quality of the embeddings, we perform node classification tasks as presented in \n[node2vec](https://arxiv.org/abs/1607.00653) for BlogCatalog, PPI, and Wikipedia. Slight modification to the labels for PPI \nwas made to remove labelsets with less than 10 positives to ensure meaningful evaluation metrics. There are 38 node classes \nin BlogCatalog, 50 node classes in PPI, and 21 node classes in Wikipedia. For each node class in a network, a one vs rest l2 \nregularized [logistic regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) \nmodel is trained and evaluated through 5-fold cross validation. Each test fold is scored using auROC separately, and the mean \nauROC score across 5 folds is reported. This evaluation is repeated 10 times and the mean reported scores are taken as the \nfinal evaluation score the the class. \n\nAfter the evaluation, for each network, there should be a list of scores for each implementation, where the entries correspond \nto evaluation score for a particular node class. A wilcoxon paired test is then applied for each implementation against the \noriginal Python implementation to see if there is any significant performance drop using the new implementation.\n\n\n## 5. Extending Benchmarks\n\nWe welcome any researchers to use this repository to benchmark their own network of interest or even embedding programs. \nThis section provides guidelines for creating new tests and requirements for contributing to the repository by compiling \nmore implementations.\n\n### 5.1 Procedure for adding new implementation\n1. Add new implementation name to `data/implementation_list.txt` (see [section 5.3]() for more requirements of implementation)\n2. Create test job script using the templates provided (`SLURM/_test_template.sb` and `SLURM/_test_template_single.sb`), \nand follow the modifaction tag `### MODIFY` to modify the lines as needed\n  * `### MODIFY1` - sbatch job name (ends with `_s` for single-core configuration setup)\n  * `### MODIFY2` - name of implementation, need to match that added to `data/implementation_list.txt`\n  * `### MODIFY3` - change directory to source code for new implementation if needed (not required if setup through env)\n  * `### MODIFY4` - activte vritual environment for the new implementation if needed\n  * `### MODIFY5` - modify command for calling the program to embed network\n\n### 5.2 Procedure for adding new network\n1. Add new network as edgelist file (`.edg`) and add the network name (without file extension) to `data/networks.txt`.\n2. Add `true` (or `false`) to `data/weighted.txt` depending on whether the new network is weighted (or not). The order \nof network names in `data/networks.txt` and the corresponding weighted information in `data/weighted.txt` must match.\n\n### 5.3 Contributing\n* Prepare the `.yml` environment file with the naming convention of `pecanpy-bench_new-method` where `new-method` should \nbe replaced by the method's name, and modify `script/init_setup/setup_envs.sh` to setup environment for the new implementation.\n* Provide interfacing script to for bash to communicate with python if needed, the following inputs are required\n  * `--input` - input graph path\n  * `--output` - output embedding path\n  * `--dimension` - embedding dimension\n  * `--walk-length` - length of each walk\n  * `--num-walks` - number of walks per node\n  * `--workers` - number of workers\n  * `--p` - return parameter\n  * `--q` - inout parameter\n* For proper runtime stat retrieval, add the following printing statements to report execution time for each corresponding stage\n  ```python\n  print(\"Took %02d:%02d:%05.2f to load graph\"%(hrs, mins, secs))\n  print(\"Took %02d:%02d:%05.2f to pre-compute transition probabilities\"%(hrs, mins, secs))\n  print(\"Took %02d:%02d:%05.2f to generate walks\"%(hrs, mins, secs))\n  print(\"Took %02d:%02d:%05.2f to train embeddings\"%(hrs, mins, secs))\n  ```\n* Add data retrieval scripts and any necessary preprocessing scripts for new networks (as edgelist files, with `.edg` extension)\n","funding_links":[],"categories":["Shell"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnanlab%2FPecanPy_benchmarks","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrishnanlab%2FPecanPy_benchmarks","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnanlab%2FPecanPy_benchmarks/lists"}