{"id":28700494,"url":"https://github.com/deepgraphlearning/inductiveqe","last_synced_at":"2025-06-14T11:08:06.112Z","repository":{"id":63013079,"uuid":"564128335","full_name":"DeepGraphLearning/InductiveQE","owner":"DeepGraphLearning","description":"Official implementation of Inductive Logical Query Answering in Knowledge Graphs (NeurIPS 2022)","archived":false,"fork":false,"pushed_at":"2022-11-10T03:35:58.000Z","size":3204,"stargazers_count":48,"open_issues_count":0,"forks_count":6,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-04-12T05:36:27.700Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/DeepGraphLearning.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-10T03:32:54.000Z","updated_at":"2025-03-21T09:31:02.000Z","dependencies_parsed_at":"2022-11-10T23:00:34.681Z","dependency_job_id":null,"html_url":"https://github.com/DeepGraphLearning/InductiveQE","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/DeepGraphLearning/InductiveQE","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FInductiveQE","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FInductiveQE/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FInductiveQE/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FInductiveQE/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/DeepGraphLearning","download_url":"https://codeload.github.com/DeepGraphLearning/InductiveQE/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/DeepGraphLearning%2FInductiveQE/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259804865,"owners_count":22913903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-14T11:08:05.433Z","updated_at":"2025-06-14T11:08:06.107Z","avatar_url":"https://github.com/DeepGraphLearning.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Inductive Logical Query Answering in Knowledge Graphs (NeurIPS 2022) #\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://arxiv.org/pdf/2210.08008.pdf\"\u003e\u003cimg src=\"http://img.shields.io/badge/Paper-PDF-red.svg\" alt=\"NeurIPS paper\"\u003e\u003c/a\u003e\n\u003ca href=\"https://doi.org/10.5281/zenodo.7306046\"\u003e\u003cimg src=\"https://zenodo.org/badge/DOI/10.5281/zenodo.7306046.svg\" alt=\"InductiveQE dataset\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n![InductiveQE animation](asset/inductive_qe.gif)\n\nThis is the official code base of the paper\n\n[Inductive Logical Query Answering in Knowledge Graphs][paper]\n\n[Mikhail Galkin](https://migalkin.github.io),\n[Zhaocheng Zhu](https://kiddozhu.github.io),\n[Hongyu Ren](http://hyren.me/),\n[Jian Tang](https://jian-tang.com)\n\n[paper]: https://arxiv.org/abs/2210.08008\n\n## Overview ##\n\n**Important: the camera-ready NeurIPS'22 version was identified to have datasets with possible test set leakages.\nThe new version including this repository and updated Arxiv submission have new datasets and experiments where this\nissue has been fixed. We recommend to use the latest version of datasets ([2.0 on Zenodo][dataset]) and experiments ([v2 on arXiv][paper]) for further\ncomparisons.**\n\n[dataset]: https://zenodo.org/record/7306046\n\nInductive query answering is the setup where at inference time an underlying graph can have new, unseen entities.\nIn this paper, we study a practical inductive setup when a training graph **is extended** with more nodes and edges\nat inference time. That is, an inference graph is always a superset of the training graph.\nNote that the inference graph always shares the same set of relation types with the training graph.\n\nThe two big implications of the inductive setup:\n* test queries involve new, unseen nodes where answers can be both seen and unseen nodes;\n* training queries now might have more answers among new nodes.\n\nThe two inductive approaches implemented in this repo:\n1. **NodePiece-QE** (Inductive node representations): based on [NodePiece](https://github.com/migalkin/NodePiece) and [CQD](https://github.com/pminervini/KGReasoning/).\nTrain on 1p link prediction, **inference-only** zero-shot logical query answering over unseen entities.\nThe NodePiece encoder can be extended with the additional GNN encoder (CompGCN) that is denoted as **NodePiece-QE w/ GNN** in the paper.\n2. **Inductive GNN-QE** (Inductive relational structure representations): based on [GNN-QE](https://github.com/DeepGraphLearning/GNN-QE).\nTrainable on complex queries, achieves higher performance than NodePiece-QE but is more expensive to train.\n\nWe additionally provide a dummy Edge-type Heuristic (`model.HeuristicBaseline`) that only considers possible tails of the last relation projection step of any query pattern.\n\n## Data ##\n\nWe created 10 new inductive query answering datasets where validation/test graphs extend the training graph and contain new entities:\n* Small-scale: 9 datasets based on FB15k-237 with the ratio of *inference-to-train nodes* varies from 106% to 550%, total of 15k nodes for various splits.\n* Large-scale: 1 dataset based on OGB WikiKG2 with the fixed ratio of 133% and 1.5M training nodes but with 500K new nodes and 5M new edges at inference.\n\n\u003cdetails\u003e\n\u003csummary\u003eDatasets Description\u003c/summary\u003e\n\nEach dataset is a zip archive containing 17 files:\n\n* `train_graph.txt` (pt for wikikg) - original training graph\n* `val_inference.txt` (pt) - inference graph (validation split), new nodes in validation are disjoint with the test inference graph\n* `val_predict.txt` (pt) - missing edges in the validation inference graph to be predicted.\n* `test_inference.txt` (pt) - inference graph (test splits), new nodes in test are disjoint with the validation inference graph\n* `test_predict.txt` (pt) - missing edges in the test inference graph to be predicted.\n* `train/valid/test_queries.pkl` - queries of the respective split, 14 query types for fb-derived datasets, 9 types for WikiKG (EPFO-only)\n* `*_answers_easy.pkl` - easy answers to respective queries that do not require predicting missing links but only edge traversal\n* `*_answers_hard.pkl` - hard answers to respective queries that DO require predicting missing links and against which the final metrics will be computed\n* `train_answers_valid.pkl` - the extended set of answers for training queries on the bigger validation graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models\n* `train_answers_test.pkl` - the extended set of answers for training queries on the bigger test graph, most of training queries have at least 1 more new answers. This is supposed to be an inference-only dataset to measure faithfulness of trained models\n* `og_mappings.pkl` - contains entity2id / relation2id dictionaries mapping local node/relation IDs from a respective dataset to the original fb15k237 / wikikg2\n* `stats.txt` - a small file with dataset stats\n\u003c/details\u003e\n\nAll datasets are available on [Zenodo](https://zenodo.org/record/7306046), please refer to v2.0 of the datasets.\nThe datasets will be downloaded automatically upon the first run.\n\nAdditionally, we provide lightweight dumps ([Zenodo](https://zenodo.org/record/7306061)) just of those graphs (without queries and answers) for training simple link prediction and KG completion models.\nPlease refer to v2.0 of the datasets.\n\n## Installation ##\n\nThe dependencies can be installed via either conda or pip. NodePiece-QE and GNN-QE are compatible\nwith Python 3.7/3.8/3.9 and PyTorch \u003e= 1.8.0.\n\n### From Conda ###\n\n```bash\nconda install torchdrug pytorch cudatoolkit -c milagraph -c pytorch -c pyg\nconda install pytorch-sparse pytorch-scatter -c pyg\nconda install easydict pyyaml -c conda-forge\n```\n\n### From Pip ###\n\n```bash\npip install torchdrug torch\npip install easydict pyyaml\npip install wandb tensorboardx\n```\n\nThen install `torch-scatter` and `torch-sparse` following the instructions in the [Github repo](https://github.com/rusty1s/pytorch_sparse).\nFor example, for PyTorch 1.10 and CUDA 10.2:\n\n```bash\npip install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu102.html\n```\n\n\n## Usage ##\n\n### NodePiece-QE ###\n\nConceptually, running NodePiece-QE consists of two parts:\n1. Training a neural link predictor using NodePiece (+ optional GNN), saving materialized embeddings of the test graph.\n2. Running CQD over the saved embeddings.\n\n**Step 1: Training a Link Predictor**\n\nUse the `NodePiece` model with the task `InductiveKnowledgeGraphCompletion` applied to the dataset of choice.\n\nWe prepared 5 configs for FB15k-237-derived datasets in the `config/lp_pretraining` directory, 2 for NodePiece w/o GNN\nand 3 for NodePiece w/ GNN, following the reported hyperparameters in the paper.\n`_550` configs have a higher `input_dim` so we decided to have a dedicated file for them to send less params to the\ntraining script.\n\nWe also provide 2 configs for the WikiKG graph and recommend running pre-training in the multi-gpu mode due to the size\nof the graph.\n\nExample of training a vanilla NodePiece on the 175% dataset:\n\n```bash\npython script/run.py -c config/lp_pretraining/nodepiece_nognn.yaml --ratio 175 --temp 0.5 --epochs 2000 --gpus [0] --logger console\n```\n\nNodePiece + GNN on the 175% dataset:\n```bash\npython script/run.py -c config/lp_pretraining/nodepiece_gnn.yaml --ratio 175 --temp 1.0 --epochs 1000 --gpus [0] --logger console\n```\n\nFor datasets of ratios 106-150 use the 5-layer GNN config `config/lp_pretraining/nodepiece_gnn.yaml`.\n\n* Use `--gpus null` to run the scripts on a CPU.\n* Use `--logger wandb` to send training logs to wandb, don't forget to prepend env variable `WANDB_ENTITY=(your_entity)`\nbefore executing the python script.\n\nAfter training, materialized entity and relation embeddings of the test graph will be stored in the `output_dir` folder.\n\nWikiKG training requires a vocabulary of mined NodePiece anchors, we ship a precomputed\nvocab `20000_anchors_d0.4_p0.4_r0.2_25sp_bfs.pkl` together with the `wikikg.zip` archive.\nYou can mine your own vocab playing around with the `NodePieceTokenizer` -- mining is implemented on a GPU and\nshould be much faster than the original NodePiece implementation.\n\nAn example WikiKG link prediction pre-training config should contain `--vocab` param to the mined vocab, e.g.,\n```bash\npython script/run.py -c config/lp_pretraining/wikikg_nodepiece_nognn.yaml --gpus [0] --vocab /path/to/pickle/vocab.pkl\n```\n\nWe highly recommend training both no-GNN and GNN versions of NodePiece on WikiKG using several GPUs, for example\n```bash\npython -m torch.distributed.launch --nproc_per_node=2 script/run.py -c config/lp_pretraining/wikikg_nodepiece_nognn.yaml --gpus [0,1] --vocab /path/to/pickle/vocab.pkl\n```\n\n**Step 2: CQD Inference**\n\nUse the pre-trained link predictor to run CQD inference over EPFO queries (negation is not supported in this version of CQD).\n\nExample of running CQD on the pre-trained 200d NodePiece w/ GNN model over the 175% dataset\n* Note that we need to specify a 2x smaller embedding dimension of the training model as by default we train a\nComplEx model with two parts - real and complex;\n* Use the full path to the **embeddings** of the pre-trained models, they are named smth like `/path/epoch_1000_ents`\nand `/path/epoch_1000_rels`, so just use the common prefix `/path/epoch_1000`.\n\n```bash\npython cqd/main.py --cuda --do_test --data_path ./data/175 -d 100 -cpu 6 --log_steps 10000 --test_log_steps 10000 --geo cqd --print_on_screen --cqd-k 32 --cqd-sigmoid --tasks \"1p.2p.3p.2i.3i.ip.pi.2u.up\" --inductive --checkpoint_path /path/epoch_1000 --skip_tr\n```\n\nTo evaluate training queries on the bigger test graphs, use the argument `--eval_train`\n\n```bash\npython cqd/main.py --cuda --do_test --data_path ./data/175 -d 100 -cpu 6 --log_steps 10000 --test_log_steps 10000 --geo cqd --print_on_screen --cqd-k 32 --cqd-sigmoid --tasks \"1p.2p.3p.2i.3i.ip.pi.2u.up\" --inductive --checkpoint_path /path/epoch_1000 --eval_train\n```\n\n### GNN-QE ###\n\nTo train GNN-QE and evaluate on the valid/test queries and desired dataset ratio, use the `gnnqe_main.yaml` config.\nExample on the 175% dataset:\n\n```bash\npython script/run.py -c config/complex_query/gnnqe_main.yaml --ratio 175 --gpus [0]\n```\n\nAlternatively, you may specify `--gpus null` to run GNN-QE on a CPU.\n\nThe hyperparameters are designed for 32GB GPUs, but you may adjust the batch size in the config file\nto fit a smaller GPU memory.\n\nTo run GNN-QE with multiple GPUs or multiple machines, use the following commands\n\n```bash\npython -m torch.distributed.launch --nproc_per_node=2 script/run.py -c config/complex_query/gnnqe_main.yaml --gpus [0,1]\n```\n\n```bash\npython -m torch.distributed.launch --nnodes=4 --nproc_per_node=4 script/run.py -c config/complex_query/gnnqe_main.yaml --gpus [0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3]\n```\n\nTo evaluate training queries on the bigger test graphs, use the config `gnnqe_eval_train.yaml` and specify the\ncheckpoint `--checkpoint` of the trained model. The best performing checkpoint is written in the log files after training the `main` config.\nFor example, if the best performing 175% model is `model_epoch_1.pth` then the path will be:\n```bash\npython script/run.py -c config/complex_query/gnnqe_eval_train.yaml --ratio 175 --gpus [0] --checkpoint /path/to/model/model_epoch_1.pth\n```\n\n### Heuristic Baseline ###\n\nFinally, we provide configs for the inference-only rule-based heuristic baseline\nthat only considers possible tails of the last relation projection step of any query pattern.\nThe two configs are `config/complex_query/heuristic_main.yaml` and `config/complex_query/heuristic_eval_train.yaml`.\n\nTo run the baseline on test queries (for example, on the 175% dataset):\n\n```bash\npython script/run.py -c config/complex_query/heuristic_main.yaml --ratio 175 --gpus [0]\n```\n\nTo run the baseline on train queries over bigger test graphs:\n```bash\npython script/run.py -c config/complex_query/heuristic_eval_train.yaml --ratio 175 --gpus [0]\n```\n\n## Citation ##\n\nIf you find this project useful in your research, please cite the following paper\n\n```bibtex\n@inproceedings{galkin2022inductive,\n  title={Inductive Logical Query Answering in Knowledge Graphs},\n  author={Mikhail Galkin and Zhaocheng Zhu and Hongyu Ren and Jian Tang},\n  booktitle={Advances in Neural Information Processing Systems},\n  year={2022},\n  url={https://openreview.net/forum?id=-vXEN5rIABY}\n}\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Finductiveqe","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepgraphlearning%2Finductiveqe","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepgraphlearning%2Finductiveqe/lists"}