{"id":48372555,"url":"https://github.com/ai-team-uoa/autoer","last_synced_at":"2026-04-05T17:04:44.881Z","repository":{"id":243648423,"uuid":"791240796","full_name":"AI-team-UoA/AutoER","owner":"AI-team-UoA","description":"Code \u0026 Experiments for IEEE Access paper \"Auto-Configuring Entity Resolution Pipelines\" by K.Nikoletos, V.Efthymiou, G.Papadakis and K.Stafanidis","archived":false,"fork":false,"pushed_at":"2025-08-18T09:18:25.000Z","size":133323,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-09-09T01:17:13.594Z","etag":null,"topics":["automl","end-to-end","entity-resolution","hyperparameter-tuning"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AI-team-UoA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2024-04-24T11:01:19.000Z","updated_at":"2025-08-18T09:18:28.000Z","dependencies_parsed_at":"2024-06-10T12:27:24.454Z","dependency_job_id":"88a68f02-1d8d-472a-a326-aa9281c86c89","html_url":"https://github.com/AI-team-UoA/AutoER","commit_stats":null,"previous_names":["ai-team-uoa/pyjedai-autoconfiguration","ai-team-uoa/autoer"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/AI-team-UoA/AutoER","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-team-UoA%2FAutoER","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-team-UoA%2FAutoER/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-team-UoA%2FAutoER/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-team-UoA%2FAutoER/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AI-team-UoA","download_url":"https://codeload.github.com/AI-team-UoA/AutoER/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AI-team-UoA%2FAutoER/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31442926,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-05T15:22:31.103Z","status":"ssl_error","status_checked_at":"2026-04-05T15:22:00.205Z","response_time":75,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automl","end-to-end","entity-resolution","hyperparameter-tuning"],"created_at":"2026-04-05T17:04:44.080Z","updated_at":"2026-04-05T17:04:44.870Z","avatar_url":"https://github.com/AI-team-UoA.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cbr\u003e\u003cb\u003e\u003ch1\u003eAuto-Configuring Entity Resolution Pipelines\u003c/h1\u003e\u003c/b\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\" style=\"font-size:20px; font-weight:bold;\"\u003e\n    Konstantinos Nikoletos\u003csup\u003e1\u003c/sup\u003e, \n    Vasilis Efthymiou\u003csup\u003e2\u003c/sup\u003e, \n    George Papadakis\u003csup\u003e3\u003c/sup\u003e, \n    Kostas Stefanidis\u003csup\u003e4\u003c/sup\u003e\n\u003c/div\u003e\n\n\u003cdiv align=\"center\" style=\"font-size:14px; font-weight:normal; margin-top:6px;\"\u003e\n    \u003csup\u003e1\u003c/sup\u003eNational and Kapodistrian University of Athens, Greece (\u003ci\u003ek.nikoletos@di.uoa.gr\u003c/i\u003e)\u003cbr\u003e\n    \u003csup\u003e2\u003c/sup\u003eHarokopio University of Athens, Greece (\u003ci\u003evefthym@hua.gr\u003c/i\u003e)\u003cbr\u003e\n    \u003csup\u003e3\u003c/sup\u003eNational and Kapodistrian University of Athens, Greece (\u003ci\u003egpapadis@di.uoa.gr\u003c/i\u003e)\u003cbr\u003e\n    \u003csup\u003e4\u003c/sup\u003eTampere University, Finland (\u003ci\u003ekonstantinos.stefanidis@tuni.fi\u003c/i\u003e)\n\u003c/div\u003e\n\n\n---\n\n## Overview\n\nEntity Resolution (ER) is the task of identifying and linking different descriptions of the same real-world entity (e.g., a person, product, publication, or location) across diverse datasets. While ER is essential for improving data quality and enabling downstream applications such as analytics and machine learning, building an effective ER pipeline is far from trivial.  \n\nAn end-to-end ER workflow typically consists of multiple steps — such as **blocking**, **similarity estimation**, and **clustering** — each of which requires careful selection and tuning of algorithms and parameters. The search space for possible configurations is enormous, and the performance of a given pipeline is highly sensitive to these choices. Traditionally, this tuning process has been manual, time-consuming, and dependent on the availability of ground truth labels, making ER both labor-intensive and error-prone.\n\nThis project tackles the challenge by introducing **Auto-Configuring Entity Resolution Pipelines**, a framework that leverages **pre-trained language models** and **AutoML techniques** to automatically configure efficient, high-performing ER workflows. The framework addresses two key scenarios:\n\n1. **Ground-Truth Aware Auto-Configuration**  \n   When a portion of ground truth matches is available, we frame parameter tuning as a hyperparameter optimization problem. By integrating **sampling-based search techniques** (e.g., Random, TPE, QMC, Bayesian optimization), our approach drastically reduces the number of trials needed to approximate the optimal configuration, achieving near-optimal effectiveness in **orders of magnitude less time** compared to exhaustive grid search.\n\n2. **Ground-Truth Agnostic Auto-Configuration**  \n   When no ground truth is available, we introduce a **regression-based approach** that predicts the effectiveness of pipeline configurations. Using dataset profiling features and configurations from other datasets with ground truth, we train models (Random Forest and AutoML ensembles) to generalize and recommend effective configurations for unseen datasets.\n\n\u003cdiv align=\"center\"\u003e\n    \u003cimg src=\"./Nikoletos-paper/figures/pyjedai/pipeline-AutoER.png\" alt=\"ETEER Pipeline\" width=\"650\"/\u003e\n  \u003cbr\u003eFigure 1: End-to-End ER (ETEER) pipeline leveraged by AutoER.\n\u003c/div\u003e\n\n### Key Contributions\n- **Problem Definition:** Formalization of two novel ER problems — automatic pipeline configuration *with* and *without* ground truth — not previously explored in ER research.  \n- **Sampling for ER:** First application of hyperparameter optimization methods (Random, TPE, QMC, Bayesian search) to ER pipelines, demonstrating they can achieve near-optimal F1 with only a fraction of the search cost.  \n- **Regression-Based Auto-Configuration:** First regression-based solution for ER configuration without ground truth, leveraging dataset features and supervised learning to predict effective pipeline setups.  \n- **Extensive Evaluation:** Empirical results on **11 real-world benchmark datasets** show that the proposed approaches consistently balance high effectiveness with significant runtime efficiency.  \n- **Open-Source Implementation:** The framework is released openly to foster reproducibility and further research in automated ER.  \n\n\n# Repository Structure\n\n- `data/` – datasets used in experiments.  \n- `figures/` – figures from the paper (pipeline, results, etc.).  \n- `sheets/` – CSV and spreadsheets with experimental results.  \n- `with_gt/` – code \u0026 scripts for **Problem 1** (auto-config with ground truth).  \n- `without_gt/` – code \u0026 scripts for **Problem 2** (auto-config without ground truth).  \n- `baseline/` – replication of ZeroER \u0026 DITTO baselines.  \n- `benchmarking/` – scalability evaluation on DBpedia.  \n- `results.ipynb` – notebook to generate figures and tables.  \n\n# Datasets\n\nPlease in the initial directory execute commands to download and prepare datasets:\n\n```\nchmod +x prepare_datasets.sh\n./prepare_datasets.sh\n```\n\n# Problem 1: **With** Ground-Truth file\n\n## Build\n\nCreate conda env:\n\n```\nconda env create -f autoconf_env_p1_p2.yml\nconda activate autoconf_p1_p2\n```\n\n## Execution\n\nGo to `/with_gt/scripts/` and run \n\n```\nnohup ./run_exps.sh 2\u003e\u00261 \u0026 \n```\n\nin the end a concatenation is made to get the files in the appropriate format. \n\n# Problem 2: **Without** Ground-Truth file\n\n## AutoML Approach\n\n### Build\n\nCreate conda env:\n\n```\nconda env create -f autoconf_env_automl.yml\nconda activate autoconf_automl\n```\n\n### Execute\n\nTo run one experiment:\n```\npython -u regression_with_automl.py --trials_type $t --hidden_dataset $d --config $config_file\n```\n\nwhere:\n- `--trials_type` stands for training instances type\n- `--hidden_dataset` stands for training with Di..j and holding Dx us hidden for testing\n-  `--config` specifies experiment type\n\nTo run all one-by-one:\n```\nnohup ./automl_exps.sh ./automl/configs/12_4_0.json \u003e ./automl/logs/EXPS_12_4_0.log  2\u003e\u00261 \u0026\\n\n```\n\nthe config file specifies experiments characteristics, like overall/per model hours for auto-sklearn, etc. \n\nand in the end, you need to conactenate all results into a format that can be read by the notebook, for merging purposes. \n\nExecute:\n\n```\npython concatenate.py --exp 12_4_0\n```\n\nwhere `--exp` stands for the experiment name executed before.\n\n## Regression\n\n\n### Build\n\nCreate conda env:\n\n```\nconda env create -f autoconf_env_p1_p2.yml\nconda activate autoconf_env_p1_p2\n```\n\n### Execute\n\nTo run one experiment:\n```\npython -u regression_with_sklearn.py --trials $dataset --regressor \"LINEAR\"\n```\n\nwhere:\n- `--trials` stands for training instances type\n\nTo run all one-by-one:\n```\nnohup ./sklearn_exps.sh \u003e sklearn_exps.log 2\u003e\u00261 \u0026\n```\n\n## Merging all results into common files\n\nAfter all experiments have finished, run:\n\n```\npython concatenate_exps.py\n```\n\nand you're ready!\n\n# Scalability tests on DBpedia dataset\n\n## Using AutoML approach\n\nExecuting this will create the top-1 workflow suggested per training trials type for DBPedia.\n```\nnohup ./run_dbpedia_exps.sh \u003e ./logs/dbpedia.log  2\u003e\u00261 \u0026\n```\n\n## Using Regression approach\n\nCreate predictions for all instances:\n\n```\npython eteer_evaluate_ind_regressors.py --config ./configs/D1D10_DBPEDIA_ALL_LinearRegression.json\npython eteer_evaluate_ind_regressors.py --config ./configs/D1D10_DBPEDIA_OPTUNA_LinearRegression.json\npython eteer_evaluate_ind_regressors.py --config ./configs/D1D10_DBPEDIA_GRIDSEARCH_LinearRegression.json\n```\n\nwhere:\n- `--config` stands for the experiment specifications (these configs are included). For example `D1D10_DBPEDIA_ALL_LinearRegression.json`, title stands for train in D1...D10 test in DBPEDIA, use all trials instances, and Linear Regression.\n\n\n## Evaluating the prediction to get the real F1 (applies to both types of training)\n\nFor AutoML:\n```\n ./eval_dbpedia_exps.sh\n```\n\nor for LR:\n\n```\nnohup python -u evaluate.py --confcsv ./results/D1D10_DBPEDIA_{$TYPE}_LinearRegression.csv  --datajson ./configs/data/dbpedia.json \u003e ./logs/D1D10_DBPEDIA_LR.log 2\u003e\u00261 \u0026\n```\n\nsame for `{$TYPE} = ALL, OPTUNA, GRIDSEARCH`\n\nwhere:\n-  `--confcsv`: is used in a similar way as before\n-  ` --datajson`: contains the needed information of the dataset that will be evaluated\n\n# Baseline\n\n## ZeroER\n\n1. Go to `cd ./baselines`\n2. Create conda env\n    1. `conda env create -f environment.yml`\n    2. `conda activate ZeroER`\n4. Run all exps `./run.sh ./logs`\n\n## DITTO\n\nDownloading NVIDIA container toolkit:\n```\nchmod +x nvidia_installation.sh\n./nvidia_installation.sh\n```\n\nCreating the environment:\n```\nsudo docker build -t ditto ditto\n```\n\nConfiguration:\n```\nCUDA_VISIBLE_DEVICES=0 python train_ditto.py --task AutoER/D2  --batch_size 16 --max_len 256 --lr 3e-5 --n_epochs 5 --lm roberta --fp16 --da del --dk product --summarize\n```\n\nBlocks for DITTO created in ready_for_ditto_input directory, using:\n```\ntransform_all_for_ditto.sh\n```\nand more specifically:\n```\npython blocking.py --datajson '../../data/configs/D2.json'\n```\n\nwhere datajson is the configuration file for the dataset.\n\n\nMoving files inside docker container:\n```\ndocker cp ./configs.json acc70a93a256:/workspace/ditto     \ndocker cp ./ready_for_ditto_input/ acc70a93a256:/workspace/ditto/data/./ready_for_ditto_input/  \ndocker cp ./train_ditto.py acc70a93a256:/workspace/ditto\ndocker cp ./run_all_inside.sh 54d79d32d83d:/workspace/ditto\n``` \n\nEntering docker:\n```\nsudo docker run -it --gpus all --entrypoint=/bin/bash ditto       \n```\n\nInside docker:\n```\ncd /workspace/ditto\nmkdir logs\nchmod +x run_all_inside.sh\nnohup ./run_all_inside.sh \u003e nohup.out 2\u003e\u00261 \u0026 \n```\n\nResults will be in `./workspace/ditto/logs/`.\n\n# Resources\n\n| Spec    | Exp. P1 \u0026 P2                             | Exp. P2 - AutoML                                                   |\n|---------|------------------------------------------|--------------------------------------------------------------------|\n| OS      | Ubuntu 22.04 jammy                       | Ubuntu 22.04 jammy                                                 |\n| Kernel  | x86_64 Linux 6.2.0-36-generic            | x86_64 Linux 6.5.0-18-generic                                      |\n| CPU     | Intel Core i7-9700K @ 8x 4.9GHz [46.0°C] | Intel Xeon E5-4603 v2 @ 32x 2.2GHz [31.0°C]                        |\n| GPU     | NVIDIA GeForce RTX 2080 Ti               | Matrox Electronics Systems Ltd. G200eR2                            |\n| RAM     | 6622MiB / 64228MiB                       | 4381MiB / 128831MiB                                                |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-team-uoa%2Fautoer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fai-team-uoa%2Fautoer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fai-team-uoa%2Fautoer/lists"}