{"id":28532305,"url":"https://github.com/predict-idlab/mindwalc","last_synced_at":"2025-07-07T14:30:53.166Z","repository":{"id":40975413,"uuid":"191322267","full_name":"predict-idlab/MINDWALC","owner":"predict-idlab","description":"Code \u0026 experiments for MINDWALC: Mining Interpretable, Discriminative Walks for Classification of Nodes in a Graph","archived":false,"fork":false,"pushed_at":"2024-07-04T10:37:57.000Z","size":531,"stargazers_count":13,"open_issues_count":4,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-09T15:53:02.312Z","etag":null,"topics":["classification","data-mining","decision-tree","interpretability","knowledge-graph"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/predict-idlab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-11T07:56:27.000Z","updated_at":"2024-11-06T14:42:07.000Z","dependencies_parsed_at":"2023-01-18T10:15:28.075Z","dependency_job_id":"54dc427a-94cb-4d5b-88e4-b8935b452bae","html_url":"https://github.com/predict-idlab/MINDWALC","commit_stats":null,"previous_names":["predict-idlab/mindwalc"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/predict-idlab/MINDWALC","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FMINDWALC","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FMINDWALC/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FMINDWALC/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FMINDWALC/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/predict-idlab","download_url":"https://codeload.github.com/predict-idlab/MINDWALC/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/predict-idlab%2FMINDWALC/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264091882,"owners_count":23556200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["classification","data-mining","decision-tree","interpretability","knowledge-graph"],"created_at":"2025-06-09T15:38:05.637Z","updated_at":"2025-07-07T14:30:53.160Z","avatar_url":"https://github.com/predict-idlab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MINDWALC: Mining Interpretable, Discriminative Walks for Classification of Nodes in a Graph\n\nMINDWALC is an algorithm that efficiently mines for a specific type of walks that maximize information gain. The walks have the following form: a walk of length `l` starts with a root, followed by `l - 2` wildcards (`*`) and then a named entity. An example could be: `root -\u003e * -\u003e * -\u003e * -\u003e Ghent` which would match the walk `Gilles Vandewiele --\u003e studiedAt --\u003e Ghent University --\u003e locatedIn --\u003e Ghent`. For this, root is replaced by the instance which we are classifying. MINDWALC can be combined with three different classification approaches.\n\n## Approach 1: building a decision tree with walks\n\nWe can recursively mine walks in order to create a decision tree. An example is displayed below. With this decision tree, we try to classify researchers into one of four research groups ([benchmark dataset AIFB](https://en.wikiversity.org/wiki/AIFB_DataSet)). In the root node, we find the walk `root -\u003e * -\u003e * -\u003e * -\u003e * -\u003e * -\u003e viewProjektOWL/id68instance`. When this walk can be found in the neighborhood of an instance, it can no longer be of the research affiliation `id4instance`, as this leaf does not occur in the subtree on the right. Moreover, this type of walk already demonstrates the added value of having a fixed depth, by the use of wildcards, in our walk. As a matter of fact, we could end up in an entity which is of a Project type in only two hops (e.g. `root -\u003e * -\u003e viewProjektOWL/id68instance`) from an instance in AIFB, but this results in a lot less information gain than when six hops need to be taken. It appears that only two people, who are both from affiliation `id3instance`, are directly involved in the Project `id68instance`, or in other words where this path with only two hops could be matched. On the other hand, it appears that these two people have written quite a large amount of papers with the other researchers in their affiliation. As such, a walk that first hops from a certain person (the root) to one of his or her papers, and going from there to one of the two people mentioned earlier through a `author` predicate can be found for 45 people from affiliation `id3instance`, 3 people from `id2instance` and 2 people from `id1instance`.\n\n![A decision tree that can be used to classify researchers, represented as a Knowledge Graph into one of four research groups.](images/tree_example.png) \n\n## Approach 2: building a forest\n\nAlternatively, a forest of trees can be built, where each tree is built from a subset of samples and vertices. This often results in better predictive performances but comes at a cost of a higher runtime and lower predictive performance.\n\n## Approach 3: creating binary feature vectors\n\nInstead of mining walks recursively, we can also perform only a single pass over the data and keep track of K walks maximizing information gain. These walks can then, in turn, be used to create binary feature vectors for training and testing entities. These feature vectors can then be fed to any classification algorithm.\n\n## Software Dependencies:\n### graphviz (optional)\nGraphviz is used to visualize trained decision trees and export them as pdf files. \nIf you need this feature, please install the graphviz kernel on your os:\n\n- Ubuntu/Debian: `sudo apt-get install graphviz`\n- MacOS: `brew install graphviz`\n- windows: https://graphviz.org/download/\n\nAnd install the python package `graphviz` with pip:\n```bash\npip install graphviz\n```\n\n## How can I use MINDWALC for my own dataset?\n\nDead simple! Our algorithm requires the following input:\n* A Knowledge Graph object -- we implemented our own Knowledge Graph object. We provide a function `Graph.rdflib_to_graph` to convert a graph from [rdflib](https://github.com/RDFLib/rdflib).\n* A list of train URIs -- our algorithm will extract neighborhoods around these URIs (or nodes in the KG) to extract features from\n* A list of corresponding training labels -- should be in same order as the train URIs\n\nFor the AIFB dataset, this becomes:\n```python3\nimport rdflib\nimport pandas as pd\nfrom sklearn.metrics import accuracy_score\n\nfrom tree_builder import MINDWALCTree, MINDWALCForest, MINDWALCTransform\nfrom datastructures import Graph\n\ng = rdflib.Graph()\ng.parse('data/AIFB/aifb.n3', format='n3')\n\ntrain_data = pd.read_csv('data/AIFB/AIFB_train.tsv', sep='\\t')\ntrain_entities = [rdflib.URIRef(x) for x in train_data['person']]\ntrain_labels = train_data['label_affiliation']\n\ntest_data = pd.read_csv('data/AIFB/AIFB_test.tsv', sep='\\t')\ntest_entities = [rdflib.URIRef(x) for x in test_data['person']]\ntest_labels = test_data['label_affiliation']\n\nkg = Graph.rdflib_to_graph(g, label_predicates=label_predicates)\n\nclf = MINDWALCTree()\n#clf = MINDWALCForest()\n#clf = MINDWALCTransform()\n\nclf.fit(kg, train_entities, train_labels)\n\npreds = clf.predict(kg, test_entities)\nprint(accuracy_score(test_labels, preds))\n```\n\nWe provided an example Jupyter notebook as well (`kgptree/Example (AIFB).ipynb`).\n\n## Reproducing paper results\n\nIn order to reproduce the results, you will first have to obtain the different datasets. The knowledge graph datasets can be obtained from [here](http://data.dws.informatik.uni-mannheim.de/rmlod/LOD_ML_Datasets/). Afterwards, the script `mindwalc/experiments/benchmark_knowledge_graphs.py` can be run, which will generate pickle files in the `output/` directory for each run. We already populated the `output/` directory with our own measurements. Afterwards, the pickle files can be processed by running `mindwalc/experiments/parse_results.py`\n\n## How to cite\n\nIf you use `MINDWALC` for research purposes, we would appreciate citations:\n```\n@inproceedings{vandewiele2019inducing,\n  title={Inducing a decision tree with discriminative paths to classify entities in a knowledge graph},\n  author={Vandewiele, Gilles and Steenwinckel, Bram and Ongenae, Femke and De Turck, Filip},\n  booktitle={SEPDA2019, the 4th International Workshop on Semantics-Powered Data Mining and Analytics},\n  pages={1--6},\n  year={2019}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpredict-idlab%2Fmindwalc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpredict-idlab%2Fmindwalc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpredict-idlab%2Fmindwalc/lists"}