{"id":13535770,"url":"https://github.com/RobinL/fuzzymatcher","last_synced_at":"2025-04-02T02:31:22.314Z","repository":{"id":37851436,"uuid":"111990189","full_name":"RobinL/fuzzymatcher","owner":"RobinL","description":"Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4","archived":false,"fork":false,"pushed_at":"2022-08-09T18:27:44.000Z","size":868,"stargazers_count":281,"open_issues_count":22,"forks_count":60,"subscribers_count":11,"default_branch":"master","last_synced_at":"2024-10-13T00:08:20.334Z","etag":null,"topics":["data-matching","fuzzy-matching","probabalistic-matching","pypi"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RobinL.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2017-11-25T08:58:00.000Z","updated_at":"2024-09-16T20:41:03.000Z","dependencies_parsed_at":"2022-08-08T22:01:52.356Z","dependency_job_id":null,"html_url":"https://github.com/RobinL/fuzzymatcher","commit_stats":null,"previous_names":[],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Ffuzzymatcher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Ffuzzymatcher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Ffuzzymatcher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RobinL%2Ffuzzymatcher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RobinL","download_url":"https://codeload.github.com/RobinL/fuzzymatcher/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246743623,"owners_count":20826565,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-matching","fuzzy-matching","probabalistic-matching","pypi"],"created_at":"2024-08-01T09:00:26.474Z","updated_at":"2025-04-02T02:31:21.906Z","avatar_url":"https://github.com/RobinL.png","language":"Python","funding_links":[],"categories":["Tools:","Python","Software"],"sub_categories":["Fuzzy matching/identity resolution:"],"readme":".. image:: https://badge.fury.io/py/fuzzymatcher.svg\n    :target: https://badge.fury.io/py/fuzzymatcher\n\n.. image:: https://codecov.io/gh/RobinL/fuzzymatcher/branch/dev/graph/badge.svg\n  :target: https://codecov.io/gh/RobinL/fuzzymatcher\n\n\nfuzzymatcher\n======================================\n\n**Note:  fuzzymatcher is no longer actively maintained.  Please see** `splink \u003chttps://github.com/moj-analytical-services/splink\u003e`_ **for a more accurate, scalable and performant solution**\n\nA Python package that allows the user to fuzzy match two pandas dataframes based on one or more common fields.\n\nFuzzymatches uses ``sqlite3``'s Full Text Search to find potential matches.\n\nIt then uses `probabilistic record linkage \u003chttps://en.wikipedia.org/wiki/Record_linkage#Probabilistic_record_linkage\u003e`_ to score matches.\n\nFinally it outputs a list of the matches it has found and associated score. \n\n\nInstallation\n------------\n\n``pip install fuzzymatcher``\n\nNote that you will need a build of sqlite which includes FTS4.  This seems to be widely included by default, but otherwise `see here \u003chttps://www.sqlite.org/fts3.html#compiling_and_enabling_fts3_and_fts4\u003e`_.\n\nUsage\n-----\n\nSee `examples.ipynb \u003chttps://github.com/RobinL/fuzzymatcher/blob/master/examples.ipynb\u003e`_ for examples of usage and the output.\n\nYou can run these examples interactively `here \u003chttps://mybinder.org/v2/gh/RobinL/fuzzymatcher/master?filepath=examples.ipynb\u003e`_.\n\nSimple example\n--------------\n\nSuppose you have a table called ``df_left`` which looks like this:\n\n====  =============\n  id  ons_name\n====  =============\n   0  Darlington\n   1  Monmouthshire\n   2  Havering\n   3  Knowsley\n   4  Charnwood\n ...  etc.\n====  =============\n\nAnd you want to link it to a table ``df_right`` that looks like this:\n\n====  =========================\n  id  os_name\n====  =========================\n   0  Darlington (B)\n   1  Havering London Boro\n   2  Sir Fynwy - Monmouthshire\n   3  Knowsley District (B)\n   4  Charnwood District (B)\n ...  etc.\n====  =========================\n\nYou can write:\n\n.. code:: python\n\n  import fuzzymatcher\n  fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on = \"ons_name\", right_on = \"os_name\")\n\nAnd you'll get:\n\n==================  =============  =========================\n  best_match_score  ons_name       os_name\n==================  =============  =========================\n          0.178449  Darlington     Darlington (B)\n          0.133371  Monmouthshire  Sir Fynwy - Monmouthshire\n          0.102473  Havering       Havering London Boro\n          0.155775  Knowsley       Knowsley District (B)\n          0.155775  Charnwood      Charnwood District (B)\n               ...  etc.           etc.\n==================  =============  =========================\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRobinL%2Ffuzzymatcher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FRobinL%2Ffuzzymatcher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FRobinL%2Ffuzzymatcher/lists"}