Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/robinl/uk_address_matcher
https://github.com/robinl/uk_address_matcher
Last synced: about 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/robinl/uk_address_matcher
- Owner: RobinL
- License: mit
- Created: 2024-05-07T20:45:53.000Z (8 months ago)
- Default Branch: main
- Last Pushed: 2024-07-03T16:42:56.000Z (6 months ago)
- Last Synced: 2024-10-13T00:08:20.629Z (2 months ago)
- Language: Python
- Size: 3.32 MB
- Stars: 6
- Watchers: 1
- Forks: 1
- Open Issues: 4
-
Metadata Files:
- Readme: readme.md
- License: LICENSE
Awesome Lists containing this project
README
# Matching UK addresses using Splink
High performance address matching using a pre-trained [Splink](https://github.com/moj-analytical-services/splink) model.
Assuming you have two duckdb dataframes in this format:
| unique_id | address_concat | postcode |
|-----------|------------------------------|-----------|
| 1 | 123 Fake Street, Faketown | FA1 2KE |
| 2 | 456 Other Road, Otherville | NO1 3WY |
| ... | ... | ... |Match them with:
```python
from uk_address_matcher.cleaning_pipelines import (
clean_data_using_precomputed_rel_tok_freq,
)
from uk_address_matcher.splink_model import _performance_predictdf_1_c = clean_data_using_precomputed_rel_tok_freq(df_1, con=con)
df_2_c = clean_data_using_precomputed_rel_tok_freq(df_2, con=con)linker, predictions = _performance_predict(
df_addresses_to_match=df_1_c,
df_addresses_to_search_within=df_2_c,
con=con,
match_weight_threshold=-10,
output_all_cols=True,
include_full_postcode_block=True,
)
```Initial tests suggest you can match ~ 1,000 addresses per second against a list of 30 million addresses on a laptop.
Refer to [the example](example.py), which has detailed comments, for how to match your data.
See [an example of comparing two addresses](example_compare_two.py) to get a sense of what it does/how it scores
Run an interactive example in your browser:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RobinL/uk_address_matcher/blob/main/match_example_data.ipynb) Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/RobinL/uk_address_matcher/blob/main/interactive_comparison.ipynb) Investigate and understand how the model works