{"id":13678524,"url":"https://github.com/krishnanlab/PecanPy","last_synced_at":"2025-04-29T15:31:39.731Z","repository":{"id":38187291,"uuid":"247705885","full_name":"krishnanlab/PecanPy","owner":"krishnanlab","description":"A fast, parallelized, memory efficient, and cache-optimized Python implementation of node2vec","archived":false,"fork":false,"pushed_at":"2024-10-29T00:27:58.000Z","size":1202,"stargazers_count":155,"open_issues_count":21,"forks_count":22,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-10-29T01:25:29.295Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krishnanlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-16T13:13:00.000Z","updated_at":"2024-10-02T14:26:13.000Z","dependencies_parsed_at":"2023-02-18T10:16:38.576Z","dependency_job_id":"bfd5a06b-b4e1-4036-afca-19ad314184ef","html_url":"https://github.com/krishnanlab/PecanPy","commit_stats":{"total_commits":238,"total_committers":10,"mean_commits":23.8,"dds":0.7100840336134453,"last_synced_commit":"297f89d4721664c0e00ad1978a9c76d5e0eede32"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishnanlab%2FPecanPy/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krishnanlab","download_url":"https://codeload.github.com/krishnanlab/PecanPy/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224178905,"owners_count":17268967,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T13:00:54.551Z","updated_at":"2024-11-11T21:30:30.476Z","avatar_url":"https://github.com/krishnanlab.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.6386437.svg)](https://doi.org/10.5281/zenodo.6386437)\n[![Documentation Status](https://readthedocs.org/projects/pecanpy/badge/?version=latest)](https://pecanpy.readthedocs.io/en/latest/?badge=latest)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n[![Tests](https://github.com/krishnanlab/PecanPy/actions/workflows/tests.yml/badge.svg)](https://github.com/krishnanlab/PecanPy/actions/workflows/tests.yml)\n\n# PecanPy: A parallelized, efficient, and accelerated _node2vec(+)_ in Python\n\nLearning low-dimensional representations (embeddings) of nodes in large graphs is key to applying machine learning on massive biological networks. _Node2vec_ is the most widely used method for node embedding. PecanPy is a fast, parallelized, memory efficient, and cache optimized Python implementation of [_node2vec_](https://github.com/aditya-grover/node2vec). It uses cache-optimized compact graph data structures and precomputing/parallelization to result in fast, high-quality node embeddings for biological networks of all sizes and densities. Detailed source code documentation can be found [here](https://pecanpy.readthedocs.io/).\n\nThe details of implementation and the optimizations, along with benchmarks, are described in the application note [_PecanPy: a fast, efficient and parallelized Python implementation of node2vec_](https://doi.org/10.1093/bioinformatics/btab202), which is published in _Bioinformatics_. The benchmarking results presented in the preprint can be reproduced using the test scripts provided in the companion [benchmarks repo](https://github.com/krishnanlab/PecanPy_benchmarks).\n\n**v2 update**: PecanPy is now equipped with _node2vec+_, which is a natural extension of _node2vec_ and handles weighted graph more effectively. For more information, see [*Accurately Modeling Biased Random Walks on Weighted Graphs Using Node2vec+*](https://arxiv.org/abs/2109.08031). The datasets and test scripts for reproducing the presented results are available in the [node2vec+ benchmarks repo](https://github.com/krishnanlab/node2vecplus_benchmarks).\n\n## Installation\n\nInstall from the latest release with:\n\n```bash\n$ pip install pecanpy\n```\n\nInstall latest version (unreleassed) in development mode with:\n\n```bash\n$ git clone https://github.com/krishnanlab/pecanpy.git\n$ cd pecanpy\n$ pip install -e .\n```\n\nwhere `-e` means \"editable\" mode so you don't have to reinstall every time you make changes.\n\nPecanPy installs a command line utility `pecanpy` that can be used directly.\n\n## Usage\n\nPecanPy operates in three different modes – `PreComp`, `SparseOTF`, and `DenseOTF` – that are optimized for networks of different sizes and densities; `PreComp` for networks that are small (≤10k nodes; any density), `SparseOTF` for networks that are large and sparse (\u003e10k nodes; ≤10% of edges), and `DenseOTF` for networks that are large and dense (\u003e10k nodes; \u003e10% of edges). These modes appropriately take advantage of compact/dense graph data structures, precomputing transition probabilities, and computing 2nd-order transition probabilities during walk generation to achieve significant improvements in performance.\n\n### Example\n\nTo run *node2vec* on Zachary's karate club network using `SparseOTF` mode, execute the following command from the project home directory:\n\n```bash\npecanpy --input demo/karate.edg --output demo/karate.emb --mode SparseOTF\n```\n\n### Node2vec+\n\nTo enable _node2vec+_, specify the `--extend` option.\n\n```bash\npecanpy --input demo/karate.edge --output demo/karate_n2vplus.emb --mode SparseOTF --extend\n```\n\n**Note**: _node2vec+_ is only beneficial for embedding _weighted_ graphs. For unweighted graphs, _node2vec+_ is equivalent to _node2vec_. The above example only serves as a demonstration of enabling _node2vec+_.\n\n### Demo\n\nExecute the following command for full demonstration:\n\n```bash\nsh demo/run_pecanpy\n```\n\n### Mode\n\nAs mentioned above, PecanPy contains three main modes for generating node2vec random walks,\neach of which is better optimized for different network sizes/densities:\n| Mode | Network size/density | Optimization |\n|:-----|:---------------------|:-------------|\n| `PreComp` | \u003c10k nodes, \u003c0.1% edges | Precompute second order transition probabilities, using CSR graph |\n| `SparseOTF` (default) | (≥10k nodes, ≥0.1% and \u003c20% of edges) or (\u003c10k nodes, ≥0.1% edges) | Transition probabilites computed on-the-fly, using CSR graph |\n| `DenseOTF` | \u003e20% of edges | Transition probabilities computed on-the-fly, using dense matrix |\n\n#### Compatibility and recommendations\n\n| Mode | Weighted | ``p,q!=1`` | Node2vec+ | Speed | Use this if |\n|:-----|----------------|---------------|-----------|:------------|:--------|\n|``PreComp``|:white_check_mark:|:white_check_mark:|:white_check_mark:|:dash::dash:|The graph is small and sparse|\n|``SparseOTF``|:white_check_mark:|:white_check_mark:|:white_check_mark:|:dash:|The graph is sparse but not necessarily small|\n|``DenseOTF``|:white_check_mark:|:white_check_mark:|:white_check_mark:|:dash:|The graph is extremely dense|\n|``PreCompFirstOrder``|:white_check_mark:|:x:|:x:|:dash::dash:|Run with ``p = q = 1`` on weighted graph|\n|``FirstOrderUnweighted``|:x:|:x:|:x:|:dash::dash::dash:|Run with ``p = q = 1`` on unweighted graph|\n\n### Options\n\nCheck out the full list of options available using:\n```bash\npecanpy --help\n```\n\n### Input\n\nThe supported input is a network file as an edgelist `.edg` file (node id could be int or string):\n\n```\nnode1_id node2_id \u003cweight_float, optional\u003e\n```\n\nAnother supported input format (only for `DenseOTF`) is the numpy array `.npz` file. Run the following command to prepare a `.npz` file from a `.edg` file.\n\n```bash\npecanpy --input $input_edgelist --output $output_npz --task todense\n```\n\nThe default delimiter for `.edg` is tab space (`\\t`), you many change this by passing in the `--delimiter` option.\n\n### Output\n\nThe output file has *n+1* lines for graph with *n* vertices, with a header line of the following format:\n\n```\nnum_of_nodes dim_of_representation\n```\n\nThe following  next *n* lines are the representations of dimension *d* following the corresponding node ID:\n\n```\nnode_id dim_1 dim_2 ... dim_d\n```\n\n### Development Note\n\nRun `black src/pecanpy/` to automatically follow black code formatting.\nRun `tox -e flake8` and resolve suggestions before committing to ensure consistent code style.\n\n## Additional Information\n### Documentation\nDetailed documentation for PecanPy is available [here](https://pecanpy.readthedocs.io/).\n\n### Support\nFor support, please consider opening a GitHub issue and we will do our best to reply in a timely manner.\nAlternatively, if you would like to keep the conversation private, feel free to contact [Remy Liu](https://twitter.com/RemyLau3) at liurenmi@msu.edu.\n\n### License\nThis repository and all its contents are released under the [BSD 3-Clause License](https://opensource.org/licenses/BSD-3-Clause); See [LICENSE.md](https://github.com/krishnanlab/pecanpy/blob/master/LICENSE.md).\n\n### Citation\nIf you use PecanPy, please cite:\nLiu R, Krishnan A (2021) **PecanPy: a fast, efficient, and parallelized Python implementation of _node2vec_.** _Bioinformatics_ https://doi.org/10.1093/bioinformatics/btab202\n\nIf you find _node2vec+_ useful, please cite:\nLiu R, Hirn M, Krishnan A (2023) **Accurately modeling biased random walks on weighted graphs using _node2vec+_.** _Bioinformatics_ https://doi.org/10.1093/bioinformatics/btad047\n\n### Authors\nRenming Liu, Arjun Krishnan*\n\u003e\\*General correspondence should be addressed to AK at arjun.krishnan@cuanschutz.edu.\n\n### Funding\nThis work was primarily supported by US National Institutes of Health (NIH) grants R35 GM128765 to AK and in part by MSU start-up funds to AK.\n\n### Acknowledgements\nWe thank [Christopher A. Mancuso](https://github.com/ChristopherMancuso), [Anna Yannakopoulos](http://yannakopoulos.com/), and the rest of the [Krishnan Lab](https://www.thekrishnanlab.org/team) for valuable discussions and feedback on the software and manuscript. Thanks to [Charles T. Hoyt](https://github.com/cthoyt) for making the software `pip` installable and for an extensive code review.\n\n### References\n\n**Original _node2vec_**\n* Grover, A. and Leskovec, J. (2016) node2vec: Scalable Feature Learning for Networks. ArXiv160700653 Cs Stat.\nOriginal _node2vec_ software and networks\n  * https://snap.stanford.edu/node2vec/ contains the original software and the networks (PPI, BlogCatalog, and Wikipedia) used in the original study (Grover and Leskovec, 2016).\n\n**Other networks**\n* Stark, C. et al. (2006) BioGRID: a general repository for interaction datasets. Nucleic Acids Res., 34, D535–D539.\n  * BioGRID human protein-protein interactions.\n\n* Szklarczyk, D. et al. (2015) STRING v10: protein–protein interaction networks, integrated over the tree of life. Nucleic Acids Res., 43, D447–D452.\n  * STRING predicted human gene interactions.\n\n* Greene, C.S. et al. (2015) Understanding multicellular function and disease with human tissue-specific networks. Nat. Genet., 47, 569–576.\n  * GIANT-TN is a generic genome-scale human gene network. GIANT-TN-c01 is a sub-network of GIANT-TN where edges with edge weight below 0.01 are discarded.\n\nBioGRID (Stark et al., 2006), STRING (Szklarczyk et al., 2015), and GIANT-TN (Greene et al., 2015) are available from https://doi.org/10.5281/zenodo.3352323.\n\n* Law, J.N. et al. (2019) Accurate and Efficient Gene Function Prediction using a Multi-Bacterial Network. bioRxiv, 646687.\n  * SSN200 is a cross-species network of proteins from 200 species with the edges representing protein sequence similarities. Downloaded from https://bioinformatics.cs.vt.edu/~jeffl/supplements/2019-fastsinksource/.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnanlab%2FPecanPy","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrishnanlab%2FPecanPy","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishnanlab%2FPecanPy/lists"}