{"id":33129353,"url":"https://github.com/ipums/hlink","last_synced_at":"2026-02-19T21:59:13.460Z","repository":{"id":39851607,"uuid":"486261782","full_name":"ipums/hlink","owner":"ipums","description":"Hierarchical record linkage at scale","archived":false,"fork":false,"pushed_at":"2025-11-24T21:41:27.000Z","size":20255,"stargazers_count":13,"open_issues_count":17,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-11-27T12:51:50.065Z","etag":null,"topics":["machine-learning","pyspark","python","record-linkage"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mpl-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ipums.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE.txt","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2022-04-27T16:02:50.000Z","updated_at":"2025-11-21T20:36:55.000Z","dependencies_parsed_at":"2023-01-28T14:00:57.687Z","dependency_job_id":"bbf67b03-36b5-469b-801c-5ff533383611","html_url":"https://github.com/ipums/hlink","commit_stats":{"total_commits":178,"total_committers":6,"mean_commits":"29.666666666666668","dds":0.1910112359550562,"last_synced_commit":"32b771b1e14cda6e28081a95beccc902d7c864e2"},"previous_names":[],"tags_count":29,"template":false,"template_full_name":null,"purl":"pkg:github/ipums/hlink","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipums%2Fhlink","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipums%2Fhlink/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipums%2Fhlink/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipums%2Fhlink/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ipums","download_url":"https://codeload.github.com/ipums/hlink/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ipums%2Fhlink/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29634614,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-19T18:02:07.722Z","status":"ssl_error","status_checked_at":"2026-02-19T18:01:46.144Z","response_time":117,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["machine-learning","pyspark","python","record-linkage"],"created_at":"2025-11-15T08:00:37.547Z","updated_at":"2026-02-19T21:59:13.444Z","avatar_url":"https://github.com/ipums.png","language":"Python","funding_links":[],"categories":["Software"],"sub_categories":[],"readme":"[![HLink Docker CI](https://github.com/ipums/hlink/actions/workflows/docker-build.yml/badge.svg)](https://github.com/ipums/hlink/actions/workflows/docker-build.yml)\n\n# hlink: hierarchical record linkage at scale\n\nhlink is a Python package that provides a flexible, configuration-driven solution to probabilistic record linking at scale. It provides a high-level API for python as well as a standalone command line interface for running linking jobs with little to no programming. hlink supports the linking process from beginning to end, including preprocessing, filtering, training, model exploration, blocking, feature generation and scoring.\n\nIt is used at [IPUMS](https://www.ipums.org/) to link U.S. historical census data, but can be applied to any record linkage job. \nA paper on the creation and applications of this program on historical census data can be found at \u003chttps://www.tandfonline.com/doi/full/10.1080/01615440.2021.1985027\u003e.\n\n### Suggested Citation\nWellington, J., R. Harper, and K.J. Thompson. 2022. \"hlink.\" https://github.com/ipums/hlink: Institute for Social Research and Data Innovation, University of Minnesota.\n\n## Installation\n\nhlink requires\n\n- Python 3.10, 3.11, or 3.12\n- Java 8 or greater for integration with PySpark\n\nYou can install the newest version of the Python package directly from PyPI with pip:\n```\npip install hlink\n```\n\nWe do our best to make hlink compatible with Python 3.10-3.12. If you have a\nproblem using hlink on one of these versions of Python, please open an issue\nthrough GitHub. Versions of Python older than 3.10 are not supported.\n\nNote that PySpark 3.5 does not yet officially support Python 3.12. If you\nencounter PySpark-related import errors while running hlink on Python 3.12, try\n\n- Installing the setuptools package. The distutils package was deleted from the\n  standard library in Python 3.12, but some versions of PySpark still import\n  it. The setuptools package provides a hacky stand-in distutils library which\n  should fix some import errors in PySpark. We install setuptools in our\n  development and test dependencies so that our tests work on Python 3.12.\n\n- Downgrading Python to 3.10 or 3.11. PySpark officially supports these\n  versions of Python. So you should have better chances getting PySpark to work\n  well on Python 3.10 or 3.11.\n\n### Additional Machine Learning Algorithms\n\nhlink has optional support for two additional machine learning algorithms,\n[XGBoost](https://xgboost.readthedocs.io/en/stable/index.html) and\n[LightGBM](https://lightgbm.readthedocs.io/en/latest/index.html). Both of these\nalgorithms are highly performant gradient boosting libraries, each with its own\ncharacteristics. These algorithms are not implemented directly in Spark, so\nthey require some additional dependencies. To install the required Python\ndependencies, run\n\n```\npip install hlink[xgboost]\n```\n\nfor XGBoost or\n\n```\npip install hlink[lightgbm]\n```\n\nfor LightGBM. If you would like to install both at once, you can run\n\n```\npip install hlink[xgboost,lightgbm]\n```\n\nto get the Python dependencies for both. Both XGBoost and LightGBM also require\nlibomp, which will need to be installed separately if you don't already have it.\n\nAfter installing the dependencies for one or both of these algorithms, you can\nuse them as model types in training and model exploration. You can read more\nabout these models in the hlink documentation [here](https://hlink.docs.ipums.org/models.html).\n\n## Docs\n\nThe documentation site can be found at [hlink.docs.ipums.org](https://hlink.docs.ipums.org).\nThis includes information about installation and setting up your configuration files.\n\nAn example script and configuration file can be found in the `examples` directory.\n\n## Quick Start\n\nThe main class in the library is LinkRun, which represents a complete linking job. It provides access to each of the link tasks and their steps. Here is an example script that uses LinkRun to do some linking.\n\n```python\nfrom hlink.linking.link_run import LinkRun\nfrom hlink.spark.factory import SparkFactory\nfrom hlink.configs.load_config import load_conf_file\n\n# First we create a SparkSession with all default configuration settings.\nfactory = SparkFactory()\nspark = factory.create()\n\n# Now let's load in our config file. See the example config below.\n# This config file is in toml format, but we also allow json format.\n# Alternatively you can create a python dictionary directly with the same\n# keys and values as is in the config.\nconfig = load_conf_file(\"./my_conf.toml\")\n\nlr = LinkRun(spark, config)\n\n# Get some information about each of the steps in the\n# preprocessing task.\nprep_steps = lr.preprocessing.get_steps()\nfor (i, step) in enumerate(prep_steps):\n    print(f\"Step {i}:\", step)\n    print(\"Required input tables:\", step.input_table_names)\n    print(\"Generated output tables:\", step.output_table_names)\n\n# Run all of the steps in the preprocessing task.\nlr.preprocessing.run_all_steps()\n\n# Run the first two steps in the matching task.\nlr.matching.run_step(0)\nlr.matching.run_step(1)\n\n# Get the potential_matches table.\nmatches = lr.get_table(\"potential_matches\")\n\nassert matches.exists()\n\n# Get the Spark DataFrame for the potential_matches table.\nmatches_df = matches.df()\n```\n\nAn example configuration file:\n\n```toml\n### hlink config file ###\n# This is a sample config file for the hlink program in toml format.\n\n# The name of the unique identifier in the datasets\nid_column = \"id\" \n\n### INPUT ###\n\n# The input datasets\n[datasource_a]\nalias = \"a\"\nfile = \"data/A.csv\"\n\n[datasource_b]\nalias = \"b\"\nfile = \"data/B.csv\"\n\n### PREPROCESSING ###\n\n# The columns to extract from the sources and the preprocessing to be done on them.\n[[column_mappings]]\ncolumn_name = \"NAMEFRST\"\ntransforms = [\n    {type = \"lowercase_strip\"}\n]\n\n[[column_mappings]]\ncolumn_name = \"NAMELAST\"\ntransforms = [\n    {type = \"lowercase_strip\"}\n]\n\n[[column_mappings]]\ncolumn_name = \"AGE\"\ntransforms = [\n    {type = \"add_to_a\", value = 10}\n]\n\n[[column_mappings]]\ncolumn_name = \"SEX\"\n\n\n### BLOCKING ###\n\n# Blocking parameters\n# Here we are blocking on sex and +/- age. \n# This means that no comparisons will be done on records\n# where the SEX fields don't match exactly and the AGE \n# fields are not within a distance of 2.\n[[blocking]]\ncolumn_name = \"SEX\"\n\n[[blocking]]\ncolumn_name = \"AGE_2\"\ndataset = \"a\"\nderived_from = \"AGE\"\nexpand_length = 2\nexplode = true\n\n### COMPARISON FEATURES ###\n\n# Here we detail the comparison features that are\n# created between the two records. In this case\n# we are comparing first and last names using \n# the jaro-winkler metric.\n\n[[comparison_features]]\nalias = \"NAMEFRST_JW\"\ncolumn_name = \"NAMEFRST\"\ncomparison_type = \"jaro_winkler\"\n\n[[comparison_features]]\nalias = \"NAMELAST_JW\"\ncolumn_name = \"NAMELAST\"\ncomparison_type = \"jaro_winkler\"\n\n# Here we detail the thresholds at which we would\n# like to keep potential matches. In this case\n# we will keep only matches where the first name\n# jaro winkler score is greater than 0.79 and\n# the last name jaro winkler score is greater than 0.84.\n\n[comparisons]\noperator = \"AND\"\n\n[comparisons.comp_a]\ncomparison_type = \"threshold\"\nfeature_name = \"NAMEFRST_JW\"\nthreshold = 0.79\n\n[comparisons.comp_b]\ncomparison_type = \"threshold\"\nfeature_name = \"NAMELAST_JW\"\nthreshold = 0.84\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipums%2Fhlink","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fipums%2Fhlink","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fipums%2Fhlink/lists"}