{"id":17498334,"url":"https://github.com/simonepri/datasets-knowledge-embedding","last_synced_at":"2025-04-14T19:33:16.737Z","repository":{"id":48242593,"uuid":"192817331","full_name":"simonepri/datasets-knowledge-embedding","owner":"simonepri","description":"📝 A collection of common datasets used in knowledge embedding","archived":false,"fork":false,"pushed_at":"2020-03-22T14:29:52.000Z","size":77287,"stargazers_count":146,"open_issues_count":1,"forks_count":11,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-28T07:51:10.953Z","etag":null,"topics":["datasets","fb15k","fb15k-237","knowledge-embedding","wn18","wn18rr","yago3-10"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/simonepri.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"license","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-06-19T23:30:47.000Z","updated_at":"2024-12-20T09:13:54.000Z","dependencies_parsed_at":"2022-08-24T09:41:49.860Z","dependency_job_id":null,"html_url":"https://github.com/simonepri/datasets-knowledge-embedding","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fdatasets-knowledge-embedding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fdatasets-knowledge-embedding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fdatasets-knowledge-embedding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/simonepri%2Fdatasets-knowledge-embedding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/simonepri","download_url":"https://codeload.github.com/simonepri/datasets-knowledge-embedding/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248945757,"owners_count":21187380,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["datasets","fb15k","fb15k-237","knowledge-embedding","wn18","wn18rr","yago3-10"],"created_at":"2024-10-19T16:28:22.954Z","updated_at":"2025-04-14T19:33:16.695Z","avatar_url":"https://github.com/simonepri.png","language":"Shell","readme":"\u003ch1 align=\"center\"\u003e\n  \u003cb\u003edatasets-knowledge-embedding\u003c/b\u003e\n\u003c/h1\u003e\n\u003cp align=\"center\"\u003e\n  \u003c!-- License --\u003e\n  \u003ca href=\"https://github.com/simonepri/datasets-knowledge-embedding/tree/master/license\"\u003e\n    \u003cimg src=\"https://img.shields.io/github/license/simonepri/datasets-knowledge-embedding.svg\" alt=\"Project license\" /\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  📝 A collection of common datasets used in knowledge embedding\n\u003c/p\u003e\n\n\n## Synopsis\n\nThis project collects different datasets used in various knowledge embedding related papers.\nIt also standardizes the format of these datasets, making it easier to use them in the evaluation of new works.\n\nThe [datasets](#datasets) can be downloaded from the [release page][release].  \nFor licensing information, please refer to the original dataset license file.\n\nIf you are using this collection of datasets please consider to start ⭐️ the project to support it.\n\n\n## Datasets format\n\nEvery subfolder in this repo is a single dataset.  \nEvery folder contains the following `18` files.\n\n| File name | Description |\n|-----------|-------------|\n| `edges_as_text_{train,valid,test}.tsv` | These three files contain the three splits of the dataset where entities and relations are in a textual form (i.e. `italy\tlocatedin\teurope`).   |\n| `edges_as_text_all.tsv` | The concatenation of `edges_as_text_train.tsv`, `edges_as_text_valid.tsv`, and `edges_as_text_test.tsv`. |\n| `edges_as_id_{train,valid,test}.tsv` | These three files contain the three splits of the dataset where entities and relations are mapped to a numerical ID (i.e. `38\t1\t2`). Entities and relations that are more frequent are mapped to lower integers (e.g. the entity/relation with ID `0` is the most frequent entity/relation in the dataset).   |\n| `edges_as_id_all.tsv` | The concatenation of `edges_as_id_train.tsv`, `edges_as_id_valid.tsv`, and `edges_as_id_test.tsv`. |\n| `map_entity_id_to_text.tsv` | This file contains the mapping from numerical IDs used for entities in `edges_as_id_*.tsv` to the textual representation used in `edges_as_text_*.tsv` (i.e. `38\titaly, 2\teurope`). |\n| `map_relation_id_to_text.tsv` | This file contains the mapping from numerical IDs used for relations in `edges_as_id_*.tsv` to the textual representation used in `edges_as_text_*.tsv` (i.e `1\tlocatedin`). |\n| `frequency_entities_{all,train,valid,test}.tsv` | These files contain the frequency of each entity in the various splits of the dataset. |\n| `frequency_relations_{all,train,valid,test}.tsv` | These files contain the frequency of each relation in the various splits of the dataset. |\n\n\n## Add a new dataset\n\nIf you want to add a new dataset to this collection, first you need to create three files called `train.tsv`, `valid.tsv`, and `test.tsv` containing respectively the edges for the three splits train, validation and test.  \nThe files must contain tab-separated triples of the form `(head entity, relation, tail entity)`.\n\nOnce you did this, you can simply process the three files with the following bash script.\n\n```bash\nbash build.sh train.tsv valid.tsv test.tsv .\n```\n\nThe script uses the [edgelist-mapper][github:simonepri/edgelist-mapper] tool under the hood.\n\n\n## Datasets\n\nThe datasets are distributed in two formats, namely text-based and id-based (see the [dataset format section](#datasets-format) for the difference).\n\n### COUNTRIES-S1\nThis dataset was introduced in [On Approximate Reasoning Capabilities of Low-Rank Vector Spaces](https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10257).  \nThe link to the original dataset as released by the authors is unknown but a copy has been taken from [here](https://github.com/TimDettmers/ConvE/tree/master/countries).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 271 | 2 | 1159 | 1111 | 24 | 24 |\n\n[![Download COUNTRIES-S1.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/COUNTRIES-S1.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S1.tgz) [![Download COUNTRIES-S1-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/COUNTRIES-S1-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S1-ID.tgz)\n\n\n### COUNTRIES-S2\nThis dataset was introduced in [On Approximate Reasoning Capabilities of Low-Rank Vector Spaces](https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10257).  \nThe link to the original dataset as released by the authors is unknown but a copy has been taken from [here](https://github.com/TimDettmers/ConvE/tree/master/countries).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 271 | 2 | 1111 | 1063 | 24 | 24 |\n\n[![Download COUNTRIES-S2.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/COUNTRIES-S2.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S2.tgz) [![Download COUNTRIES-S2-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/COUNTRIES-S2-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S2-ID.tgz)\n\n### COUNTRIES-S3\nThis dataset was introduced in [On Approximate Reasoning Capabilities of Low-Rank Vector Spaces](https://www.aaai.org/ocs/index.php/SSS/SSS15/paper/view/10257).  \nThe link to the original dataset as released by the authors is unknown but a copy has been taken from [here](https://github.com/TimDettmers/ConvE/tree/master/countries).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 271 | 2 | 1033 | 985 | 24 | 24 |\n\n[![Download COUNTRIES-S3.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/COUNTRIES-S3.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S3.tgz) [![Download COUNTRIES-S3-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/COUNTRIES-S3-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/COUNTRIES-S3-ID.tgz)\n\n### FB15K\nThis dataset was introduced in [Translating Embeddings for Modeling Multi-relational Data](https://dl.acm.org/doi/10.5555/2999792.2999923).  \nThe original dataset as released by the authors is available [here](https://everest.hds.utc.fr/doku.php?id=en:transe).\n\n\u003e Entities in this dataset are represented trough the Freebase ids (i.e. `/m/07l450, /film/film/genre, /m/082gq`). Since they are hard to read we are considering to map them to Wikipedia pages (i.e. `The_Last_King_of_scotland_(film), /film/film/genre, War_film`).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 14951 | 1345 | 592213 | 483142 | 50000 | 59071 |\n\n[![Download FB15K.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/FB15K.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K.tgz) [![Download FB15K-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/FB15K-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K-ID.tgz)\n\n### FB15K-237\nThis dataset was introduced in [Observed versus latent features for knowledge base and text inference](https://www.aclweb.org/anthology/W15-4007/).  \nThe original dataset as released by the authors is available [here](https://www.microsoft.com/en-us/download/details.aspx?id=52312).\n\n\u003e Entities in this dataset are represented trough the Freebase ids (i.e. `/m/07l450, /film/film/genre, /m/082gq`). Since they are hard to read we are considering to map them to Wikipedia pages (i.e. `The_Last_King_of_scotland_(film), /film/film/genre, War_film`).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 14541 | 237 | 310116 | 272115 | 17535 | 20466 |\n\n[![Download FB15K-237.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/FB15K-237.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K-237.tgz) [![Download FB15K-237-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/FB15K-237-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/FB15K-237-ID.tgz)\n\n### KINSHIP\nThis dataset was introduced in [Learning systems of concepts with an infinite relational model](https://dl.acm.org/doi/10.5555/1597538.1597600).  \nThe original dataset as released by the authors is available [here](http://www.charleskemp.com/code/irm.html).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 104 | 25 | 10686 | 8544 | 1068 | 1074 |\n\n[![Download KINSHIP.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/KINSHIP.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/KINSHIP.tgz) [![Download KINSHIP-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/KINSHIP-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/KINSHIP-ID.tgz)\n\n### NATIONS\nThis dataset was introduced in [Learning systems of concepts with an infinite relational model](https://dl.acm.org/doi/10.5555/1597538.1597600).  \nThe original dataset as released by the authors is available [here](http://www.charleskemp.com/code/irm.html).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 14 | 55 | 1992 | 1592 | 199 | 201 |\n\n[![Download NATIONS.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/NATIONS.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/NATIONS.tgz) [![Download NATIONS-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/NATIONS-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/NATIONS-ID.tgz)\n\n### UMLS\nThis dataset was introduced in [Learning systems of concepts with an infinite relational model](https://dl.acm.org/doi/10.5555/1597538.1597600).  \nThe original dataset as released by the authors is available [here](http://www.charleskemp.com/code/irm.html).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 135 | 46 | 6529 | 5216 | 652 | 661 |\n\n[![Download UMLS.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/UMLS.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/UMLS.tgz) [![Download UMLS-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/UMLS-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/UMLS-ID.tgz)\n\n### WN18\nThis dataset was introduced in [Translating Embeddings for Modeling Multi-relational Data](https://dl.acm.org/doi/10.5555/2999792.2999923).  \nThe original dataset as released by the authors is available [here](https://everest.hds.utc.fr/doku.php?id=en:transe).\n\n\u003e In the original dataset, the entities are represented trough the WordNet offset id (i.e. `01257145 derivationally_related_form 07488875`), but the version distributed here has the offsets mapped to WordNet synsets that can be read by the `nltk` library (i.e. `sensual.s.02\tderivationally_related_form\tsensuality.n.01`).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 41105 | 18 | 151442 | 141442 | 5000 | 5000 |\n\n[![Download WN18.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/WN18.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18.tgz) [![Download WN18-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/WN18-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18-ID.tgz)\n\n### WN18RR\nThis dataset was introduced in [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476).  \nThe original dataset as released by the authors is available [here](https://github.com/TimDettmers/ConvE).\n\n\u003e In the original dataset, the entities are represented trough the WordNet offset id (i.e. `01257145 derivationally_related_form 07488875`), but the version distributed here has the offsets mapped to WordNet synsets that can be read by the `nltk` library (i.e. `sensual.s.02\tderivationally_related_form\tsensuality.n.01`).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 41105 | 11 | 93003 | 86835 | 3034 | 3134 |\n\n[![Download WN18RR.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/WN18RR.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18RR.tgz) [![Download WN18RR-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/WN18RR-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/WN18RR-ID.tgz)\n\n### YAGO3-10\nThis dataset was introduced in [Convolutional 2D Knowledge Graph Embeddings](https://arxiv.org/abs/1707.01476).  \nThe original dataset as released by the authors is available [here](https://github.com/TimDettmers/ConvE).\n\n| Entities | Relation Types | Edges | Train Edges | Validation Edges | Test Edges |\n|----------|----------------|-------|-------------|------------------|------------|\n| 123182 | 37 | 1089040 | 1079040 | 5000 | 5000 |\n\n[![Download YAGO3-10.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/YAGO3-10.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/YAGO3-10.tgz) [![Download YAGO3-10-ID.tgz](https://img.shields.io/github/downloads/simonepri/datasets-knowledge-embedding/latest/YAGO3-10-ID.tgz\n)](https://github.com/simonepri/datasets-knowledge-embedding/releases/latest/download/YAGO3-10-ID.tgz)\n\n\n## Authors\n\n- **Simone Primarosa** - [simonepri][github:simonepri]\n\nSee also the list of [contributors][contributors] who participated in this project.\n\n\n## License\n\nThis project is licensed under the MIT License - see the [license][license] file for details.\n\n\u003c!-- Links --\u003e\n[license]: https://github.com/simonepri/datasets-knowledge-embedding/tree/master/license\n[contributors]: https://github.com/simonepri/datasets-knowledge-embedding/contributors\n[release]: https://github.com/simonepri/datasets-knowledge-embedding/releases/latest\n\n[github:simonepri]: https://github.com/simonepri\n\n[github:simonepri/edgelist-mapper]: https://github.com/simonepri/edgelist-mapper\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonepri%2Fdatasets-knowledge-embedding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsimonepri%2Fdatasets-knowledge-embedding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsimonepri%2Fdatasets-knowledge-embedding/lists"}