https://github.com/dobraczka/klinker

🧱 blocking methods for entity resolution
https://github.com/dobraczka/klinker

blocking data-integration deduplication entity-alignment entity-resolution link-discovery record-linkage

Last synced: 6 days ago
JSON representation

🧱 blocking methods for entity resolution

Host: GitHub
URL: https://github.com/dobraczka/klinker
Owner: dobraczka
License: mit
Created: 2023-02-12T14:00:38.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-09-12T12:18:37.000Z (8 months ago)
Last Synced: 2025-04-16T19:21:39.730Z (10 days ago)
Topics: blocking, data-integration, deduplication, entity-alignment, entity-resolution, link-discovery, record-linkage
Language: Python
Homepage: https://klinker.readthedocs.io
Size: 1.19 MB
Stars: 6
Watchers: 3
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

        






 klinker


















Installation

============

Clone the repo and change into the directory:

```bash

git clone https://github.com/dobraczka/klinker.git

cd klinker

```

For usage with GPU create a [micromamba](https://mamba.readthedocs.io/en/latest/micromamba-installation.html) environment:

```bash

micromamba env create -n klinker-conda --file=klinker-conda.yaml

```

Activate it and install the remaining dependencies:

```

mamba activate klinker-conda

pip install -e .

```

Alternatively if you don't intend to utilize a GPU you can install it in a virtual environment:

```

python -m venv klinker-env

source klinker-env/bin/activate

pip install -e .[all]

```

or via [poetry](https://python-poetry.org/docs/):

```

poetry install

```

Usage

=====

Load a dataset:

```python

from sylloge import MovieGraphBenchmark

from klinker.data import KlinkerDataset

ds = KlinkerDataset.from_sylloge(MovieGraphBenchmark(graph_pair="tmdb-tvdb"))

```

Create blocks and write to parquet:

```python

from klinker.blockers import SimpleRelationalTokenBlocker

blocker = SimpleRelationalTokenBlocker()

blocks = blocker.assign(left=ds.left, right=ds.right, left_rel=ds.left_rel, right_rel=ds.right_rel)

blocks.to_parquet("tmdb-tvdb-tokenblocked")

```

Read blocks from parquet and evaluate:

```python

from klinker import KlinkerBlockManager

from klinker.eval_metrics import Evaluation

kbm = KlinkerBlockManager.read_parqet("tmdb-tvdb-tokenblocked")

ev = Evaluation.from_dataset(blocks=kbm, dataset=ds)

```

Reproduce Experiments

=====================

The `experiment.py` has commands for datasets and blockers. You can use `python experiment.py --help` to show the available commands. Subcommands can also offer help e.g. `python experiment.py gcn-blocker --help`.

You have to use a dataset command before a blocker command.

For example if you used micromamba for installation:

```bash

micromamba run -n klinker-conda python experiment.py movie-graph-benchmark-dataset --graph-pair "tmdb-tvdb" relational-token-blocker

```

This would be similar to the steps described in the above usage section.

In order to precisely reproduce the results from the paper we provide (adapted) run scripts from our SLURM batch scripts in the `run_scripts` folder. Please consult the `run_scripts/README.md` for further information. For archival purposes the experiment artifacts and the source code are stored in [Zenodo](https://zenodo.org/records/12774407).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dobraczka/klinker

Awesome Lists containing this project

README

klinker