{"id":21426764,"url":"https://github.com/j535d165/recordlinkage","last_synced_at":"2025-05-14T13:06:16.593Z","repository":{"id":37390526,"uuid":"44471657","full_name":"J535D165/recordlinkage","owner":"J535D165","description":"A powerful and modular toolkit for record linkage and duplicate detection in Python","archived":false,"fork":false,"pushed_at":"2024-02-21T18:15:38.000Z","size":73399,"stargazers_count":997,"open_issues_count":64,"forks_count":156,"subscribers_count":32,"default_branch":"master","last_synced_at":"2025-04-13T06:15:34.877Z","etag":null,"topics":["data-matching","dedupe","deduplication","entity-resolution","machine-learning","privacy","python","python-library","record-linkage","similarity","string-distance","utrecht-university"],"latest_commit_sha":null,"homepage":"http://recordlinkage.readthedocs.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/J535D165.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2015-10-18T09:00:02.000Z","updated_at":"2025-04-12T14:02:32.000Z","dependencies_parsed_at":"2023-02-09T19:15:46.222Z","dependency_job_id":"39dbf440-8545-4007-9d58-d60a0baf3222","html_url":"https://github.com/J535D165/recordlinkage","commit_stats":{"total_commits":868,"total_committers":21,"mean_commits":"41.333333333333336","dds":0.1716589861751152,"last_synced_commit":"2af1dce972fd02ee1fc41c76c7c0b1420e2dc122"},"previous_names":[],"tags_count":29,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J535D165%2Frecordlinkage","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J535D165%2Frecordlinkage/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J535D165%2Frecordlinkage/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/J535D165%2Frecordlinkage/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/J535D165","download_url":"https://codeload.github.com/J535D165/recordlinkage/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254149953,"owners_count":22022851,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-matching","dedupe","deduplication","entity-resolution","machine-learning","privacy","python","python-library","record-linkage","similarity","string-distance","utrecht-university"],"created_at":"2024-11-22T21:43:26.466Z","updated_at":"2025-05-14T13:06:16.537Z","avatar_url":"https://github.com/J535D165.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/J535D165/recordlinkage/master/docs/images/recordlinkage-banner-transparent.svg\"\u003e\u003cbr\u003e\n\u003c/div\u003e\n\n# RecordLinkage: powerful and modular Python record linkage toolkit\n\n[![Pypi Version](https://badge.fury.io/py/recordlinkage.svg)](https://pypi.python.org/pypi/recordlinkage/)\n[![Github Actions CI Status](https://github.com/J535D165/recordlinkage/workflows/tests/badge.svg?branch=master)](https://github.com/J535D165/recordlinkage/actions)\n[![Code Coverage](https://codecov.io/gh/J535D165/recordlinkage/branch/master/graph/badge.svg)](https://codecov.io/gh/J535D165/recordlinkage)\n[![Documentation Status](https://readthedocs.org/projects/recordlinkage/badge/?version=latest)](https://recordlinkage.readthedocs.io/en/latest/?badge=latest)\n[![Zenodo DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.3559042.svg)](https://doi.org/10.5281/zenodo.3559042)\n\n**RecordLinkage** is a powerful and modular record linkage toolkit to\nlink records in or between data sources. The toolkit provides most of\nthe tools needed for record linkage and deduplication. The package\ncontains indexing methods, functions to compare records and classifiers.\nThe package is developed for research and the linking of small or medium\nsized files.\n\nThis project is inspired by the [Freely Extensible Biomedical Record\nLinkage (FEBRL)](https://sourceforge.net/projects/febrl/) project, which\nis a great project. In contrast with FEBRL, the recordlinkage project\nuses [pandas](http://pandas.pydata.org/) and\n[numpy](http://www.numpy.org/) for data handling and computations. The\nuse of *pandas*, a flexible and powerful data analysis and manipulation\nlibrary for Python, makes the record linkage process much easier and\nfaster. The extensive *pandas* library can be used to integrate your\nrecord linkage directly into existing data manipulation projects.\n\nOne of the aims of this project is to make an easily extensible record\nlinkage framework. It is easy to include your own indexing algorithms,\ncomparison/similarity measures and classifiers.\n\n## Basic linking example\n\nImport the `recordlinkage` module with all important tools for record\nlinkage and import the data manipulation framework **pandas**.\n\n``` python\nimport recordlinkage\nimport pandas\n```\n\nLoad your data into pandas DataFrames.\n\n``` python\ndf_a = pandas.DataFrame(YOUR_FIRST_DATASET)\ndf_b = pandas.DataFrame(YOUR_SECOND_DATASET)\n```\n\nComparing all record can be computationally intensive. Therefore, we\nmake set of candidate links with one of the built-in indexing techniques\nlike **blocking**. In this example, only pairs of records that agree on\nthe surname are returned.\n\n``` python\nindexer = recordlinkage.Index()\nindexer.block('surname')\ncandidate_links = indexer.index(df_a, df_b)\n```\n\nFor each candidate link, compare the records with one of the comparison\nor similarity algorithms in the Compare class.\n\n``` python\nc = recordlinkage.Compare()\n\nc.string('name_a', 'name_b', method='jarowinkler', threshold=0.85)\nc.exact('sex', 'gender')\nc.date('dob', 'date_of_birth')\nc.string('str_name', 'streetname', method='damerau_levenshtein', threshold=0.7)\nc.exact('place', 'placename')\nc.numeric('income', 'income', method='gauss', offset=3, scale=3, missing_value=0.5)\n\n# The comparison vectors\nfeature_vectors = c.compute(candidate_links, df_a, df_b)\n```\n\nClassify the candidate links into matching or distinct pairs based on\ntheir comparison result with one of the [classification\nalgorithms](https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html).\nThe following code classifies candidate pairs with a Logistic Regression\nclassifier. This (supervised machine learning) algorithm requires\ntraining data.\n\n``` python\nlogrg = recordlinkage.LogisticRegressionClassifier()\nlogrg.fit(TRAINING_COMPARISON_VECTORS, TRAINING_PAIRS)\n\nlogrg.predict(feature_vectors)\n```\n\nThe following code shows the classification of candidate pairs with the\nExpectation-Conditional Maximisation (ECM) algorithm. This variant of\nthe Expectation-Maximisation algorithm doesn't require training data\n(unsupervised machine learning).\n\n``` python\necm = recordlinkage.ECMClassifier()\necm.fit_predict(feature_vectors)\n```\n\n## Main Features\n\nThe main features of this Python record linkage toolkit are:\n\n-   Clean and standardise data with easy to use tools\n-   Make pairs of records with smart indexing methods such as\n    **blocking** and **sorted neighbourhood indexing**\n-   Compare records with a large number of comparison and similarity\n    measures for different types of variables such as strings, numbers\n    and dates.\n-   Several classifications algorithms, both supervised and unsupervised\n    algorithms.\n-   Common record linkage evaluation tools\n-   Several built-in datasets.\n\n## Documentation\n\nThe most recent documentation and API reference can be found at\n[recordlinkage.readthedocs.org](http://recordlinkage.readthedocs.org/en/latest/).\nThe documentation provides some basic usage examples like\n[deduplication](http://recordlinkage.readthedocs.io/en/latest/guides/data_deduplication.html)\nand\n[linking](http://recordlinkage.readthedocs.io/en/latest/guides/link_two_dataframes.html)\ncensus data. More examples are coming soon. If you do have interesting\nexamples to share, let us know.\n\n## Installation\n\nThe Python Record linkage Toolkit requires Python 3.8 or higher. Install the\npackage easily with pip\n\n``` sh\npip install recordlinkage\n```\n\nThe toolkit depends on popular packages like\n[Pandas](https://github.com/pydata/pandas),\n[Numpy](http://www.numpy.org), [Scipy](https://www.scipy.org/) and,\n[Scikit-learn](http://scikit-learn.org/). A complete list of\ndependencies can be found in the [installation\nmanual](https://recordlinkage.readthedocs.io/en/latest/installation.html)\nas well as recommended and optional dependencies.\n\n## License\n\nThe license for this record linkage tool is BSD-3-Clause.\n\n## Citation\n\nPlease cite this package when being used in an academic context. Ensure\nthat the DOI and version match the installed version. Citatation styles\ncan be found on the publishers website\n[10.5281/zenodo.3559042](https://doi.org/10.5281/zenodo.3559042).\n\n``` text\n@software{de_bruin_j_2019_3559043,\n  author       = {De Bruin, J},\n  title        = {{Python Record Linkage Toolkit: A toolkit for\n                   record linkage and duplicate detection in Python}},\n  month        = dec,\n  year         = 2019,\n  publisher    = {Zenodo},\n  version      = {v0.14},\n  doi          = {10.5281/zenodo.3559043},\n  url          = {https://doi.org/10.5281/zenodo.3559043}\n}\n```\n\n## Need help?\n\nStuck on your record linkage code or problem? Any other questions? Don't\nhestitate to send me an email (\u003cjonathandebruinos@gmail.com\u003e).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj535d165%2Frecordlinkage","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fj535d165%2Frecordlinkage","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fj535d165%2Frecordlinkage/lists"}