{"id":18985032,"url":"https://github.com/dobraczka/klinker","last_synced_at":"2025-04-19T20:26:10.688Z","repository":{"id":195825432,"uuid":"600754825","full_name":"dobraczka/klinker","owner":"dobraczka","description":"🧱 blocking methods for entity resolution","archived":false,"fork":false,"pushed_at":"2024-09-12T12:18:37.000Z","size":1248,"stargazers_count":6,"open_issues_count":2,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-16T19:21:39.730Z","etag":null,"topics":["blocking","data-integration","deduplication","entity-alignment","entity-resolution","link-discovery","record-linkage"],"latest_commit_sha":null,"homepage":"https://klinker.readthedocs.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dobraczka.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-12T14:00:38.000Z","updated_at":"2025-01-01T02:00:58.000Z","dependencies_parsed_at":"2023-09-19T19:48:46.838Z","dependency_job_id":"edb79faf-d66f-4f16-8b4c-d9c8b2ea9c17","html_url":"https://github.com/dobraczka/klinker","commit_stats":null,"previous_names":["dobraczka/klinker"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dobraczka%2Fklinker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dobraczka%2Fklinker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dobraczka%2Fklinker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dobraczka%2Fklinker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dobraczka","download_url":"https://codeload.github.com/dobraczka/klinker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249793718,"owners_count":21326576,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["blocking","data-integration","deduplication","entity-alignment","entity-resolution","link-discovery","record-linkage"],"created_at":"2024-11-08T16:24:09.591Z","updated_at":"2025-04-19T20:26:10.654Z","avatar_url":"https://github.com/dobraczka.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/dobraczka/klinker/raw/main/docs/assets/logo.png\" alt=\"klinker logo\", width=200/\u003e\n\u003c/p\u003e\n\u003ch2 align=\"center\"\u003e klinker\u003c/h2\u003e\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/dobraczka/klinker/actions/workflows/main.yml\"\u003e\u003cimg alt=\"Actions Status\" src=\"https://github.com/dobraczka/klinker/actions/workflows/main.yml/badge.svg?branch=main\"\u003e\u003c/a\u003e\n\u003ca href='https://klinker.readthedocs.io/en/latest/?badge=latest'\u003e\u003cimg src='https://readthedocs.org/projects/klinker/badge/?version=latest' alt='Documentation Status' /\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://github.com/dobraczka/klinker/raw/main/docs/assets/KlinkerArchitectureNoLogo.png\" alt=\"klinker overview\", width=800/\u003e\n\u003c/p\u003e\n\nInstallation\n============\nClone the repo and change into the directory:\n\n```bash\ngit clone https://github.com/dobraczka/klinker.git\ncd klinker\n```\n\nFor usage with GPU create a [micromamba](https://mamba.readthedocs.io/en/latest/micromamba-installation.html) environment:\n\n```bash\nmicromamba env create -n klinker-conda --file=klinker-conda.yaml\n```\n\nActivate it and install the remaining dependencies:\n```\nmamba activate klinker-conda\npip install -e .\n```\n\nAlternatively if you don't intend to utilize a GPU you can install it in a virtual environment:\n```\npython -m venv klinker-env\nsource klinker-env/bin/activate\npip install -e .[all]\n```\n\nor via [poetry](https://python-poetry.org/docs/):\n```\npoetry install\n```\n\nUsage\n=====\nLoad a dataset:\n```python\nfrom sylloge import MovieGraphBenchmark\nfrom klinker.data import KlinkerDataset\n\nds = KlinkerDataset.from_sylloge(MovieGraphBenchmark(graph_pair=\"tmdb-tvdb\"))\n```\n\nCreate blocks and write to parquet:\n\n```python\nfrom klinker.blockers import SimpleRelationalTokenBlocker\n\nblocker = SimpleRelationalTokenBlocker()\nblocks = blocker.assign(left=ds.left, right=ds.right, left_rel=ds.left_rel, right_rel=ds.right_rel)\nblocks.to_parquet(\"tmdb-tvdb-tokenblocked\")\n```\n\nRead blocks from parquet and evaluate:\n```python\nfrom klinker import KlinkerBlockManager\nfrom klinker.eval_metrics import Evaluation\n\nkbm = KlinkerBlockManager.read_parqet(\"tmdb-tvdb-tokenblocked\")\nev = Evaluation.from_dataset(blocks=kbm, dataset=ds)\n```\n\nReproduce Experiments\n=====================\n\nThe `experiment.py` has commands for datasets and blockers. You can use `python experiment.py --help` to show the available commands. Subcommands can also offer help e.g. `python experiment.py gcn-blocker --help`.\n\nYou have to use a dataset command before a blocker command.\n\nFor example if you used micromamba for installation:\n```bash\nmicromamba run -n klinker-conda python experiment.py movie-graph-benchmark-dataset --graph-pair \"tmdb-tvdb\" relational-token-blocker\n```\nThis would be similar to the steps described in the above usage section.\n\nIn order to precisely reproduce the results from the paper we provide (adapted) run scripts from our SLURM batch scripts in the `run_scripts` folder. Please consult the `run_scripts/README.md` for further information. For archival purposes the experiment artifacts and the source code are stored in [Zenodo](https://zenodo.org/records/12774407).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdobraczka%2Fklinker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdobraczka%2Fklinker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdobraczka%2Fklinker/lists"}