Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dobraczka/sylloge
🗃️ Small library to simplify collecting and loading of entity alignment benchmark datasets
https://github.com/dobraczka/sylloge
datasets entity-alignment entity-resolution knowledge-graph
Last synced: 3 months ago
JSON representation
🗃️ Small library to simplify collecting and loading of entity alignment benchmark datasets
- Host: GitHub
- URL: https://github.com/dobraczka/sylloge
- Owner: dobraczka
- License: mit
- Created: 2022-08-15T11:20:22.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-14T18:22:23.000Z (8 months ago)
- Last Synced: 2024-09-27T18:07:45.321Z (4 months ago)
- Topics: datasets, entity-alignment, entity-resolution, knowledge-graph
- Language: Python
- Homepage: https://sylloge.readthedocs.io
- Size: 281 KB
- Stars: 6
- Watchers: 3
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
sylloge
This simple library aims to collect entity-alignment benchmark datasets and make them easily available.
Usage
=====
Load benchmark datasets:
```
>>> from sylloge import OpenEA
>>> ds = OpenEA()
>>> ds
OpenEA(backend=pandas, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
>>> ds.rel_triples_right.head()
head relation tail
0 http://www.wikidata.org/entity/Q6176218 http://www.wikidata.org/entity/P27 http://www.wikidata.org/entity/Q145
1 http://www.wikidata.org/entity/Q212675 http://www.wikidata.org/entity/P161 http://www.wikidata.org/entity/Q446064
2 http://www.wikidata.org/entity/Q13512243 http://www.wikidata.org/entity/P840 http://www.wikidata.org/entity/Q84
3 http://www.wikidata.org/entity/Q2268591 http://www.wikidata.org/entity/P31 http://www.wikidata.org/entity/Q11424
4 http://www.wikidata.org/entity/Q11300470 http://www.wikidata.org/entity/P178 http://www.wikidata.org/entity/Q170420
>>> ds.attr_triples_left.head()
head relation tail
0 http://dbpedia.org/resource/E534644 http://dbpedia.org/ontology/imdbId 0044475
1 http://dbpedia.org/resource/E340590 http://dbpedia.org/ontology/runtime 6480.0^^
2 http://dbpedia.org/resource/E840454 http://dbpedia.org/ontology/activeYearsStartYear 1948^^
3 http://dbpedia.org/resource/E971710 http://purl.org/dc/elements/1.1/description English singer-songwriter
4 http://dbpedia.org/resource/E022831 http://dbpedia.org/ontology/militaryCommand Commandant of the Marine CorpsThe gold standard entity links are stored as [eche](https://github.com/dobraczka/eche) ClusterHelper, which provides convenient functionalities:
>>> ds.ent_links.clusters[0]
{'http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186'}
>>> ('http://www.wikidata.org/entity/Q21197', 'http://dbpedia.org/resource/E123186') in ds.ent_links
True
>>> ('http://dbpedia.org/resource/E123186', 'http://www.wikidata.org/entity/Q21197') in ds.ent_links
True
>>> ds.ent_links.links('http://www.wikidata.org/entity/Q21197')
'http://dbpedia.org/resource/E123186'
>>> ds.ent_links.all_pairs()```
Most datasets are binary matching tasks, but for example the `MovieGraphBenchmark` provides a multi-source setting:
```
>>> ds = MovieGraphBenchmark(graph_pair="multi")
>>> ds
MovieGraphBenchmark(backend=pandas,graph_pair=multi, rel_triples_0=17507, attr_triples_0=20800 rel_triples_1=27903, attr_triples_1=23761 rel_triples_2=15455, attr_triples_2=20902, ent_links=3598, folds=5)
>>> ds.dataset_names
('imdb', 'tmdb', 'tvdb')
```Here the [`PrefixedClusterHelper`](https://eche.readthedocs.io/en/latest/reference/eche/#eche.PrefixedClusterHelper) various convenience functions:
```
Get pairs between specific dataset pairs>>> list(ds.ent_links.pairs_in_ds_tuple(("imdb","tmdb")))[0]
('https://www.scads.de/movieBenchmark/resource/IMDB/nm0641721', 'https://www.scads.de/movieBenchmark/resource/TMDB/person1236714')Get number of intra-dataset pairs
>>> ds.ent_links.number_of_intra_links
(1, 64, 22663)
```For all datasets you can get a canonical name for a dataset instance to use e.g. to create folders to store experiment results:
```
>>> ds.canonical_name
'openea_d_w_15k_v1'
```You can use [dask](https://www.dask.org/) as backend for larger datasets:
```
>>> ds = OpenEA(backend="dask")
>>> ds
OpenEA(backend=dask, graph_pair=D_W, size=15K, version=V1, rel_triples_left=38265, rel_triples_right=42746, attr_triples_left=52134, attr_triples_right=138246, ent_links=15000, folds=5)
```
Which replaces pandas DataFrames with dask DataFrames.Datasets can be written/read as parquet via `to_parquet` or `read_parquet`.
After the initial read datasets are cached using this format. The `cache_path` can be explicitly set and caching behaviour can be disable via `use_cache=False`, when initalizing a dataset.Some datasets come with pre-determined splits:
```bash
tree ~/.data/sylloge/open_ea/cached/D_W_15K_V1
├── attr_triples_left_parquet
├── attr_triples_right_parquet
├── dataset_names.txt
├── ent_links_parquet
├── folds
│ ├── 1
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ ├── 2
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ ├── 3
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ ├── 4
│ │ ├── test_parquet
│ │ ├── train_parquet
│ │ └── val_parquet
│ └── 5
│ ├── test_parquet
│ ├── train_parquet
│ └── val_parquet
├── rel_triples_left_parquet
└── rel_triples_right_parquet
```
some don't:
```bash
tree ~/.data/sylloge/oaei/cached/starwars_swg
├── attr_triples_left_parquet
│ └── part.0.parquet
├── attr_triples_right_parquet
│ └── part.0.parquet
├── dataset_names.txt
├── ent_links_parquet
│ └── part.0.parquet
├── rel_triples_left_parquet
│ └── part.0.parquet
└── rel_triples_right_parquet
└── part.0.parquet
```Installation
============
```bash
pip install sylloge
```Datasets
========
| Dataset family name | Year | # of Datasets | Sources | References |
|:--------------------|:----:|:-------------:|:-------:|:----------|
| [OpenEA](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OpenEA) | 2020 | 16 | DBpedia, Yago, Wikidata | [Paper](http://www.vldb.org/pvldb/vol13/p2326-sun.pdf), [Repo](https://github.com/nju-websoft/OpenEA#dataset-overview) |
| [MED-BBK](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.MED_BBK) | 2020 | 1 | Baidu Baike | [Paper](https://aclanthology.org/2020.coling-industry.17.pdf), [Repo](https://github.com/ZihengZZH/industry-eval-EA/tree/main#benchmark) |
| [MovieGraphBenchmark](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.MovieGraphBenchmark) | 2022 | 3 | IMDB, TMDB, TheTVDB | [Paper](http://ceur-ws.org/Vol-2873/paper8.pdf), [Repo](https://github.com/ScaDS/MovieGraphBenchmark) |
| [OAEI](https://sylloge.readthedocs.io/en/latest/source/datasets.html#sylloge.OAEI) | 2022 | 5 | Fandom wikis | [Paper](https://ceur-ws.org/Vol-3324/oaei22_paper0.pdf), [Website](http://oaei.ontologymatching.org/2022/knowledgegraph/index.html) |More broad statistics are provided in `dataset_statistics.csv`. You can also get a pandas DataFrame with statistics for specific datasets for example to create tables for publications:
```
>>> ds = MovieGraphBenchmark(graph_pair="multi")
>>> from sylloge.create_statistic import create_statistics_df
>>> stats_df = create_statistics_df([ds])
>>> stats_df.loc[("MovieGraphBenchmark","moviegraphbenchmark_multi","imdb")]
Entities Relation Triples Attribute Triples ... Clusters Intra-dataset Matches All Matches
Dataset family Task Name Dataset Name ...
MovieGraphBenchmark moviegraphbenchmark_multi imdb 5129 17507 20800 ... 3598 1 31230[1 rows x 9 columns]
```