{"id":13689371,"url":"https://github.com/Aghasemian/OptimalLinkPrediction","last_synced_at":"2025-05-01T23:34:14.332Z","repository":{"id":72703164,"uuid":"208381396","full_name":"Aghasemian/OptimalLinkPrediction","owner":"Aghasemian","description":"This page is a companion for our paper on optimal link prediction, written by Amir Ghasemian, Homa Hosseinmardi, Aram Galstyan, Edoardo M. Airoldi, and Aaron Clauset. (arXiv:1909.07578)","archived":false,"fork":false,"pushed_at":"2024-03-12T02:02:54.000Z","size":8752,"stargazers_count":59,"open_issues_count":1,"forks_count":20,"subscribers_count":5,"default_branch":"master","last_synced_at":"2024-08-03T15:17:34.428Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Aghasemian.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-09-14T03:06:27.000Z","updated_at":"2024-05-20T02:11:19.000Z","dependencies_parsed_at":null,"dependency_job_id":"287e0a13-c7ae-4902-b7ad-d570a0820e2e","html_url":"https://github.com/Aghasemian/OptimalLinkPrediction","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aghasemian%2FOptimalLinkPrediction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aghasemian%2FOptimalLinkPrediction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aghasemian%2FOptimalLinkPrediction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Aghasemian%2FOptimalLinkPrediction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Aghasemian","download_url":"https://codeload.github.com/Aghasemian/OptimalLinkPrediction/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224282254,"owners_count":17285795,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T15:01:44.967Z","updated_at":"2024-11-12T13:31:36.010Z","avatar_url":"https://github.com/Aghasemian.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Optimal Link Prediction\n\n\u003cp align=\"justify\"\u003eThis page is a companion for the paper \t\n  \n\u003e \u003cb\u003eAmir Ghasemian\u003c/b\u003e, Homa Hosseinmardi, Aram Galstyan, Edoardo M. Airoldi and Aaron Clauset\n\u003e \u003cbr\u003e\u003cb\u003e\u003ca href=\"https://www.pnas.org/content/early/2020/09/03/1914950117\" target=\"_blank\"\u003eStacking Models for Nearly Optimal Link Prediction in Complex Networks\u003c/a\u003e, \u003ci\u003ePNAS USA\u003c/i\u003e 117(38), 23393-23400 (2020).\n\non optimal link prediction.\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src =\"Images/OptimalLinkPrediction_logo.png\" width=700\u003e\u003cbr\u003e\n\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003eHere, we provide (i) a reference set of networks as a benchmark for link prediction (Fig. S1 of the paper), (ii) the necessary code to generate 42 topological features for each network (Table S1 of the paper), and (iii) a useful stacking method to combine these topological features to be used in link prediction.\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src =\"Images/Fig_S1.png\" width=700\u003e\u003cbr\u003e\n\u003cb\u003eFig. S1 of the paper\u003c/b\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src =\"Images/Table_S1.png\" width=900\u003e\u003cbr\u003e\n\u003cb\u003eTable S1 of the paper\u003c/b\u003e\n\u003c/p\u003e\n\n\n\u003cp align=\"justify\"\u003e The most common approach to predict missing links constructs a score function from network statistics of each unconnected node pair. We studied 42 of these topological predictors in this paper, which include predictions based on node degrees, common neighbors, random walks, node and edge centralities, among others (see SI Appendix, Table S1). Models of large-scale network structure and close proximity of an unconnected pair, after embedding a network's nodes into a latent space are also commonly used for link prediction. We have also studied 11 of the model-based methods (Table S2 of the paper), besides 150 of the embedding-based predictors, derived from two popular graph embedding algorithms and six notions of distance or similarity in the latent space in this work. In total, we considered 203 features of node pairs. \u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src =\"Images/Table_S2.png\" width=900\u003e\u003cbr\u003e\n\u003cb\u003eTable S2 of the paper\u003c/b\u003e\n\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003eAcross domains, predictor importances cluster in interesting ways, such that some individual and some families of predictors perform better on specific domains. For instance, examining the 10 most-important predictors by domain (29 unique predictors; Fig. 1 of the paper), we find that topological methods, such as those based on common neighbors or localized random walks, perform well on social networks but less well on networks from other domains. In contrast, model-based methods perform relatively well across domains, but often perform less well on social networks than do topological measures and some embedding-based methods. Together, these results indicate that predictor methods exhibit a broad diversity of errors, which tend correlate somewhat with scientific domain.\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src =\"Images/Fig_1.png\" width=900\u003e\u003cbr\u003e\n\u003cb\u003eFig. 1 of the paper\u003c/b\u003e\n\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003eThis performance heterogeneity highlights the practical relevance to link prediction of the general No Free Lunch theorem, which proves that across all possible inputs, every machine learning method has the same average performance, and hence accuracy must be assessed on a per dataset basis. The observed diversity of errors indicates that none of the 203 individual predictors is a universally-best method for the subset of all inputs that are realistic. However, that diversity also implies that a nearly-optimal link prediction method for realistic inputs could be constructed by combining individual methods so that the best individual method is applied for each given input. Such a meta-learning algorithm cannot circumvent the No Free Lunch theorem, but it can achieve optimal performance on realistic inputs by effectively redistributing its worse-than-average performance onto unrealistic inputs, which are unlikely to be encountered in practice.\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003e In this page we also provide one of the useful stacking methods in our paper to be accessible for all researchers in the field. In the module provided in Python we construct 42 topological features and combine them using a standard random forest as a supervised learning algorithm. The reason we are not including the model-based or embedding-based features, and the corresponding stacked models is that the implementations we used of these techniques are not ours to reshare, and we refer the interested readers to the sources we cite in the paper for specific details.\u003c/p\u003e\n\n### Download the package:\n\u003cp align=\"left\"\u003e\n\u003ca href=\"./Benchmark/OLP_updated.pickle\"\u003eDownload Pickle Format\u003c/a\u003e.\u003c/p\u003e\n\n\u003cp align=\"justify\"\u003eThis package contains the corpus of 550 real-world networks, a slightly expanded version of \u003ca href=\"https://github.com/Aghasemian/CommunityFitNet\"\u003ethe CommunityFitNet corpus\u003c/a\u003e from many scientific domains drawn from the Index of Complex Networks (\u003ca href=\"https://icon.colorado.edu/#!/\"\u003eICON\u003c/a\u003e). This corpus spans a variety of sizes and structures, with 23% social, 23% economic, 32% biological, 12% technological, 3% information, and 7% transportation graphs (Fig. S1 of the paper). More information regarding the partitions achieved by 16 state-of-the-art community detection algorithms over these networks are provided in \u003ca href=\"https://github.com/Aghasemian/CommunityFitNet\"\u003e CommunityFitNet\u003c/a\u003e.\u003c/p\u003e\n\n### Download the code:\n\u003cp align=\"left\"\u003e\n\u003ca href=\"./Code/OLP.py\"\u003eTopol. Stacking Method\u003c/a\u003e.\u003c/p\u003e\n\n### Instruction for using the package and running the code:\n\n\u003cp align=\"justify\"\u003e To load the data:\u003c/p\u003e\n\n```python \nimport pickle  \n# load the data \ninfile = open('./Benchmark/OLP_updated.pickle','rb')  \ndf = pickle.load(infile)  \n\n# read edge lists for all networks\ndf_edgelists = df['edges_id'] # column 'edges_id' in dataframe df includes the edge list \n                              # for each network \n \n# extract the edge list for the first network \nedges_orig = df_edgelists.iloc[0] # a numpy array of edge list for original graph \n```\n\n\u003cp align=\"justify\"\u003e To run the topological feature stacking model on one of the networks in real dataset:\u003c/p\u003e\n\n```python \nimport OLP as olp \n# run topological stacking model\nolp.topol_stacking(edges_orig) \n```\n\n\u003cp align=\"justify\"\u003eTo run a demo:\u003c/p\u003e\n\n```python\nimport OLP as olp\nolp.demo()\n```\n\n### How to cite this work:\n\u003cp\u003eIf you use this code or data in your research, please cite it as follows:\u003c/p\u003e\n\u003cpre\u003e\n@article{ghasemian2020stacking,\n  title = {Stacking models for nearly optimal link prediction in complex networks},\n  author = {Ghasemian, Amir and Hosseinmardi, Homa and Galstyan, Aram and Airoldi, Edoardo M and Clauset, Aaron},\n  journal = {Proceedings of the National Academy of Sciences},\n  volume = {117},\n  number = {38},\n  pages = {23393--23400},\n  year = {2020},\n  publisher = {National Acad Sciences},\n}\n\u003c/pre\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAghasemian%2FOptimalLinkPrediction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FAghasemian%2FOptimalLinkPrediction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FAghasemian%2FOptimalLinkPrediction/lists"}