Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dobraczka/klinker
🧱 blocking methods for entity resolution
https://github.com/dobraczka/klinker
blocking data-integration deduplication entity-alignment entity-resolution link-discovery record-linkage
Last synced: 3 months ago
JSON representation
🧱 blocking methods for entity resolution
- Host: GitHub
- URL: https://github.com/dobraczka/klinker
- Owner: dobraczka
- License: mit
- Created: 2023-02-12T14:00:38.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-15T11:34:04.000Z (9 months ago)
- Last Synced: 2024-05-22T00:05:55.555Z (9 months ago)
- Topics: blocking, data-integration, deduplication, entity-alignment, entity-resolution, link-discovery, record-linkage
- Language: Python
- Homepage: https://klinker.readthedocs.io
- Size: 861 KB
- Stars: 4
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
![]()
klinker
![]()
Installation
============
Clone the repo and change into the directory:```bash
git clone https://github.com/dobraczka/klinker.git
cd klinker
```For usage with GPU create a [micromamba](https://mamba.readthedocs.io/en/latest/micromamba-installation.html) environment:
```bash
micromamba env create -n klinker-conda --file=klinker-conda.yaml
```Activate it and install the remaining dependencies:
```
mamba activate klinker-conda
pip install -e .
```Alternatively if you don't intend to utilize a GPU you can install it in a virtual environment:
```
python -m venv klinker-env
source klinker-env/bin/activate
pip install -e .[all]
```or via [poetry](https://python-poetry.org/docs/):
```
poetry install
```Usage
=====
Load a dataset:
```python
from sylloge import MovieGraphBenchmark
from klinker.data import KlinkerDatasetds = KlinkerDataset.from_sylloge(MovieGraphBenchmark(graph_pair="tmdb-tvdb"))
```Create blocks and write to parquet:
```python
from klinker.blockers import SimpleRelationalTokenBlockerblocker = SimpleRelationalTokenBlocker()
blocks = blocker.assign(left=ds.left, right=ds.right, left_rel=ds.left_rel, right_rel=ds.right_rel)
blocks.to_parquet("tmdb-tvdb-tokenblocked")
```Read blocks from parquet and evaluate:
```python
from klinker import KlinkerBlockManager
from klinker.eval_metrics import Evaluationkbm = KlinkerBlockManager.read_parqet("tmdb-tvdb-tokenblocked")
ev = Evaluation.from_dataset(blocks=kbm, dataset=ds)
```Reproduce Experiments
=====================The `experiment.py` has commands for datasets and blockers. You can use `python experiment.py --help` to show the available commands. Subcommands can also offer help e.g. `python experiment.py gcn-blocker --help`.
You have to use a dataset command before a blocker command.
For example if you used micromamba for installation:
```bash
micromamba run -n klinker-conda python experiment.py movie-graph-benchmark-dataset --graph-pair "tmdb-tvdb" relational-token-blocker
```
This would be similar to the steps described in the above usage section.In order to precisely reproduce the results from the paper we provide (adapted) run scripts from our SLURM batch scripts in the `run_scripts` folder. Please consult the `run_scripts/README.md` for further information. For archival purposes the experiment artifacts and the source code are stored in [Zenodo](https://zenodo.org/records/12774407).