{"id":20414408,"url":"https://github.com/pykeen/ilpc2022","last_synced_at":"2025-04-12T16:44:22.190Z","repository":{"id":43711656,"uuid":"460713416","full_name":"pykeen/ilpc2022","owner":"pykeen","description":"🏅 KG Inductive Link Prediction Challenge (ILPC) 2022","archived":false,"fork":false,"pushed_at":"2022-03-12T01:22:19.000Z","size":2813,"stargazers_count":84,"open_issues_count":2,"forks_count":17,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-26T11:11:12.520Z","etag":null,"topics":["benchmarking","machine-learning","pykeen"],"latest_commit_sha":null,"homepage":"https://pykeen.github.io/ilpc2022/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pykeen.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-02-18T04:48:16.000Z","updated_at":"2024-12-13T09:05:21.000Z","dependencies_parsed_at":"2022-09-24T15:52:11.275Z","dependency_job_id":null,"html_url":"https://github.com/pykeen/ilpc2022","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Filpc2022","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Filpc2022/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Filpc2022/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pykeen%2Filpc2022/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pykeen","download_url":"https://codeload.github.com/pykeen/ilpc2022/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248600108,"owners_count":21131425,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","machine-learning","pykeen"],"created_at":"2024-11-15T06:09:46.605Z","updated_at":"2025-04-12T16:44:22.160Z","avatar_url":"https://github.com/pykeen.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# KG Inductive Link Prediction Challenge (ILPC) 2022\n\n[![arXiv](https://img.shields.io/badge/arXiv-2203.01520-b31b1b)](https://arxiv.org/abs/2203.01520)\n[![Zenodo DOI](https://zenodo.org/badge/460713416.svg)](https://zenodo.org/badge/latestdoi/460713416)\n\n[This repository](https://github.com/pykeen/ilpc2022) introduces the [ILPC'22 Small](data/small) and [ILPC'22 Large](data/large)\ndatasets for benchmarking inductive link prediction models and outlines the 2022\nincarnation of the Inductive Link Prediction Challenge (ILPC).\n\n## 🗄️ Datasets\n\n\u003cimg alt=\"A schematic diagram of inductive link prediction\"\nsrc=\"https://pykeen.readthedocs.io/en/latest/_images/ilp_1.png\"\nheight=\"200\" align=\"right\"\n/\u003e\n\nWhile in *transductive* link prediction, the training and inference graph are\nthe same (and therefore contain the same entities), in *inductive* link\nprediction, there is a disjoint inference graph that potentially contains new,\nunseen entities.\n\nFor this challenge, we sampled two datasets from [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page), \nthe largest publicly available and open KG. Inductive link prediction implies\ntraining a model on one graph (denoted as `training`) and performing inference, \neg, validation and test, over a new graph (denoted as `inference`). \n\nDataset creation principles:\n* Represents a real-world KG used in many NLP and ML tasks (Wikidata)\n* Larger than [existing benchmarks](https://github.com/pykeen/pykeen/blob/master/src/pykeen/datasets/inductive/ilp_teru.py)\n* Allows for fast iteration and hypothesis testing - we sampled two datasets, small and large, that vary in sizes of training and\ninference graphs. \n* The ratio of training / inference graph sizes is challenging for modern GNNs.\n* Training graph is a connected component\n* Inference graph is a connected component\n\n\nBoth the small and large variants of the dataset can be found in the\n[`data`](data) folder of this repository. Each contains four splits\ncorresponding to the diagram:\n\n* `train.txt` - the training graph on which you are supposed to train a model\n* `inference.txt` - the inductive inference graph **disjoint** with the training\n  one - that is, it has a new non-overlapping set of entities, the missing links\n  are sampled from this graph\n* `inductive_validation.txt` - validation set of triples to predict, uses\n  entities from the **inference** graph\n* `inductive_test.txt` - test set of triples to predict, uses entities from\n  the **inference** graph\n* a hold-out test set of triples - kept by the organizers for the final ranking\n  😉 , uses entities from the **inference** graph\n\n### [ILPC'22 Small](data/small)\n\n| Split                |  Entities |   Relations | Triples |\n|----------------------|----------:|------------:|--------:|\n| Train                |    10,230 |          96 |  78,616 |\n| Inference            |     6,653 | 96 (subset) |  20,960 |\n| Inference validation |     6,653 | 96 (subset) |   2,908 |\n| Inference test       |     6,653 | 96 (subset) |   2,902 |\n| Hold-out test set    |     6,653 | 96 (subset) |   2,894 |\n\n### [ILPC'22 Large](data/large)\n\n| Split                | Entities |         Relations | Triples |\n|----------------------|---------:|------------------:|--------:|\n| Train                |   46,626 |               130 | 202,446 |\n| Inference            |   29,246 |      130 (subset) |  77,044 |\n| Inference validation |   29,246 |      130 (subset) |  10,179 |\n| Inference test       |   29,246 |      130 (subset) |  10,184 |\n| Hold-out test set    |   29,246 |      130 (subset) |  10,172 |\n\n## 🏅 Challenge\n\nThe Challenge aims to streamline community efforts in the emerging area of representation learning techniques beyond shallow entity embeddings.\nWe invite submissions proposing new inductive models as well as extending baseline models to achieve higher performance.\n\nWe use the\nfollowing [rank-based evaluation metrics](https://pykeen.readthedocs.io/en/stable/tutorial/understanding_evaluation.html):\n\n* MRR (Inverse Harmonic Mean Rank) - higher is better, range `[0, 1]`\n* Hits @ K (H@K; with K as one of `{1, 3, 5, 10, 100}`) - higher is better, range `[0, 1]`\n* AMRI (Adjusted Arithmetic Mean Rank Index) - higher is better, compares model scoring \nagainst random scoring, range `[-1, 1]`. AMRI=0 means the model is not better than random scoring.\n\nMaking a submission:\n\n1. Fork the repo\n2. Train your inductive link prediction model\n3. Save the model weights using the `--save` flag\n4. Upload model weights on GitHub or other platforms (Dropbox, Google Drive,\n   etc)\n5. Open an issue in **this** repo with the link to your repo, performance\n   metrics, and model weights\n\n## 🎸 Baselines\n\nWe provide an example workflow in [`main.py`](main.py) for training and\nevaluating two variants of the [NodePiece](https://arxiv.org/abs/2106.12144)\nmodel using [PyKEEN](https://github.com/pykeen/pykeen):\n\n* `InductiveNodePiece` - plain tokenizer + tokens MLP encoder to bootstrap node\n  representations. Fast.\n* `InductiveNodePieceGNN` - everything above + an additional\n  2-layer [CompGCN](https://arxiv.org/abs/1911.03082) message passing encoder.\n  Slower but performs better.\n\nThe example can be run with `python main.py` and the options can be listed\nwith `python main.py --help`.\n\n\u003c!--\nTraining shallow entity embeddings in this setup is useless as trained embeddings cannot be used for inference over unseen entities.\nThat's why we need new representation learning mechanisms - in particular, we use [NodePiece](https://arxiv.org/abs/2106.12144) for the baselines.\n\nNodePiece in the inductive mode will use the set of relations seen in the training graph to *tokenize* entities in the training and inference graphs.\nWe can afford tokenizing the nodes in the *inference* graph since the set of relations **is shared** between training and inference graphs \n(more formally, the set of relations of the inference graph is a subset of training ones).\n\nFor more information on the models check out the [PyKEEN tutorial](https://pykeen.readthedocs.io/en/latest/tutorial/inductive_lp.html) on inductive link prediction with NodePiece\n--\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eInstallation Instructions\u003c/summary\u003e\n\nMain requirements:\n\n* python \u003e= 3.9\n* torch \u003e= 1.10\n\nYou will need PyKEEN 1.8.0 or newer.\n\n```shell\n$ pip install pykeen\n```\n\nBy the time of creation of this repo 1.8.0 is not yet there, but the latest\nversion from sources contains everything we need\n\n```shell\n$ pip install git+https://github.com/pykeen/pykeen.git\n```\n\nIf you plan to use GNNs (including the `InductiveNodePieceGNN` baseline) make\nsure you install [torch-scatter](https://github.com/rusty1s/pytorch_scatter)\nand [torch-geometric](https://github.com/pyg-team/pytorch_geometric)\ncompatible with your python, torch, and CUDA versions.\n\nRunning the code on a GPU is strongly recommended.\n\n\u003c/details\u003e\n\n### Baseline Performance on Small Dataset\n\nWe report the performance of both variants of the NodePiece model on the small\nvariant of the dataset after running the following:\n\n* InductiveNodePieceGNN (32d, 50 epochs, 24K params) - NodePiece (5 tokens per\n  node, MLP aggregator) + 2-layer CompGCN with DistMult composition function +\n  DistMult decoder. Training time: **77 min***\n  ```shell\n  $ python main.py --dataset small -d 32 -e 50 -n 16 -m 2.0 -lr 0.0001 --gnn\n  ```\n* InductiveNodePiece (32d, 50 epochs, 15.5K params) - NodePiece (5 tokens per\n  node, MLP aggregator) + DistMult decoder. Training time: **6 min***\n  ```shell\n  $ python main.py --dataset small -d 32 -e 50 -n 16 -m 5.0 -lr 0.0001\n  ```\n\n| **Model**             |        MRR |      H@100 |       H@10 |        H@5 |        H@3 |        H@1 |      AMRI |\n|-----------------------|-----------:|-----------:|-----------:|-----------:|-----------:|-----------:|----------:|\n| InductiveNodePieceGNN | **0.1326** | **0.4705** | **0.2509** | **0.1899** | **0.1396** | **0.0763** | **0.730** |\n| InductiveNodePiece    |     0.0381 |     0.4678 |     0.0917 |     0.0500 |     0.0219 |      0.007 |     0.666 |\n\n### Baseline Performance on Large Dataset\n\nWe report the performance of both variants of the NodePiece model on the large\nvariant of the dataset after running the following:\n\n* InductiveNodePieceGNN (32d, 53 epochs, 24K params) - NodePiece (5 tokens per\n  node, MLP aggregator) + 2-layer CompGCN with DistMult composition function +\n  DistMult decoder. Training time: **8 hours***\n  ```shell\n  $ python main.py --dataset large -d 32 -e 53 -n 16 -m 20.0 -lr 0.0001 --gnn\n  ```\n* InductiveNodePiece (32d, 17 epochs, 15.5K params) - NodePiece (5 tokens per\n  node, MLP aggregator) + DistMult decoder. Training time: **5 min***\n  ```shell\n  $ python main.py --dataset large -d 32 -e 17 -n 16 -m 15.0 -lr 0.0001\n  ```\n\n| **Model**             |    MRR |     H@100 |       H@10 |        H@5 |        H@3 |    H@1 |      AMRI |\n|-----------------------|-------:|----------:|-----------:|-----------:|-----------:|-------:|----------:|\n| InductiveNodePieceGNN | 0.0705 | **0.374** | **0.1458** | **0.0990** | **0.0730** | 0.0319 | **0.682** |\n| InductiveNodePiece    | 0.0651 |     0.287 |     0.1246 |     0.0809 |     0.0542 | 0.0373 |     0.646 |\n\n\\* Note: All models were trained on a single RTX 8000. Average memory\nconsumption during training is about 2 GB VRAM on the `small` dataset and about\n3 GB on `large`.  \n\n## 👋 Attribution\n\n### ⚖️ License\n\nThe code in this package is licensed under the MIT License. The datasets in this\nrepository are licensed under the Creative Commons Zero license. The trained\nmodels and their weights are licensed under the Creative Commons Zero license.\n\n### 📖 Citation\n\nIf you use the ILPC'22 datasets in your work, please cite the following:\n\n```bibtex\n@article{Galkin2022,\n  archivePrefix = {arXiv},\n  arxivId = {2203.01520},\n  author = {Galkin, Mikhail and Berrendorf, Max and Hoyt, Charles Tapley},\n  eprint = {2203.01520},\n  month = {mar},\n  title = {{An Open Challenge for Inductive Link Prediction on Knowledge Graphs}},\n  url = {http://arxiv.org/abs/2203.01520},\n  year = {2022}\n}\n```\n\n### 🎁 Support\n\nThis project has been supported by several organizations (in alphabetical order):\n\n- [Harvard Program in Therapeutic Science - Laboratory of Systems Pharmacology](https://hits.harvard.edu/the-program/laboratory-of-systems-pharmacology/)\n- [Ludwig-Maximilians-Universität München](https://www.en.uni-muenchen.de/index.html)\n- [Mila](https://mila.quebec/)\n- [Munich Center for Machine Learning (MCML)](https://mcml.ai/)\n\n### 🏦 Funding\n\nThis project has been funded by the following grants:\n\n| Funding Body                                             | Program                                                                          | Grant         |\n|----------------------------------------------------------|----------------------------------------------------------------------------------|---------------|\n| DARPA                                                    | [Young Faculty Award (PI: Benjamin Gyori)](https://indralab.github.io/#projects) | W911NF2010255 |\n| German Federal Ministry of Education and Research (BMBF) | [Munich Center for Machine Learning (MCML)](https://mcml.ai)                     | 01IS18036A    |\n| Samsung                                                  | Samsung AI Grant                                                                 | -             |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpykeen%2Filpc2022","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpykeen%2Filpc2022","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpykeen%2Filpc2022/lists"}