Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/RobinL/fuzzymatcher
Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4
https://github.com/RobinL/fuzzymatcher
data-matching fuzzy-matching probabalistic-matching pypi
Last synced: about 2 months ago
JSON representation
Record linking package that fuzzy matches two Python pandas dataframes using sqlite3 fts4
- Host: GitHub
- URL: https://github.com/RobinL/fuzzymatcher
- Owner: RobinL
- License: mit
- Created: 2017-11-25T08:58:00.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2022-08-09T18:27:44.000Z (over 2 years ago)
- Last Synced: 2024-10-13T00:08:20.334Z (2 months ago)
- Topics: data-matching, fuzzy-matching, probabalistic-matching, pypi
- Language: Python
- Homepage:
- Size: 848 KB
- Stars: 281
- Watchers: 11
- Forks: 60
- Open Issues: 22
-
Metadata Files:
- Readme: README.rst
- License: LICENSE
Awesome Lists containing this project
README
.. image:: https://badge.fury.io/py/fuzzymatcher.svg
:target: https://badge.fury.io/py/fuzzymatcher.. image:: https://codecov.io/gh/RobinL/fuzzymatcher/branch/dev/graph/badge.svg
:target: https://codecov.io/gh/RobinL/fuzzymatcherfuzzymatcher
======================================**Note: fuzzymatcher is no longer actively maintained. Please see** `splink `_ **for a more accurate, scalable and performant solution**
A Python package that allows the user to fuzzy match two pandas dataframes based on one or more common fields.
Fuzzymatches uses ``sqlite3``'s Full Text Search to find potential matches.
It then uses `probabilistic record linkage `_ to score matches.
Finally it outputs a list of the matches it has found and associated score.
Installation
------------``pip install fuzzymatcher``
Note that you will need a build of sqlite which includes FTS4. This seems to be widely included by default, but otherwise `see here `_.
Usage
-----See `examples.ipynb `_ for examples of usage and the output.
You can run these examples interactively `here `_.
Simple example
--------------Suppose you have a table called ``df_left`` which looks like this:
==== =============
id ons_name
==== =============
0 Darlington
1 Monmouthshire
2 Havering
3 Knowsley
4 Charnwood
... etc.
==== =============And you want to link it to a table ``df_right`` that looks like this:
==== =========================
id os_name
==== =========================
0 Darlington (B)
1 Havering London Boro
2 Sir Fynwy - Monmouthshire
3 Knowsley District (B)
4 Charnwood District (B)
... etc.
==== =========================You can write:
.. code:: python
import fuzzymatcher
fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on = "ons_name", right_on = "os_name")And you'll get:
================== ============= =========================
best_match_score ons_name os_name
================== ============= =========================
0.178449 Darlington Darlington (B)
0.133371 Monmouthshire Sir Fynwy - Monmouthshire
0.102473 Havering Havering London Boro
0.155775 Knowsley Knowsley District (B)
0.155775 Charnwood Charnwood District (B)
... etc. etc.
================== ============= =========================