https://github.com/robinl/uk_address_matcher
https://github.com/robinl/uk_address_matcher
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/robinl/uk_address_matcher
- Owner: RobinL
- Created: 2024-05-07T20:45:53.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-20T14:49:11.000Z (4 months ago)
- Last Synced: 2025-03-20T15:38:16.458Z (4 months ago)
- Language: Python
- Size: 3.52 MB
- Stars: 12
- Watchers: 1
- Forks: 1
- Open Issues: 21
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# High performance UK addresses matcher (geocoder)
Extremely fast address matching using a pre-trained [Splink](https://github.com/moj-analytical-services/splink) model.
```
Full time taken: 11.05 seconds
to match 176,640 messy addresses to 273,832 canonical addresses
at a rate of 15,008 addresses per second(On Macbook M4 Max)
```## Installation
At the moment this uses a branch of Splink only available on Github.
```bash
pip install --pre uk_address_matcher
```## Usage
High performance address matching using a pre-trained [Splink](https://github.com/moj-analytical-services/splink) model.
Assuming you have two duckdb dataframes in this format:
| unique_id | address_concat | postcode |
|-----------|------------------------------|-----------|
| 1 | 123 Fake Street, Faketown | FA1 2KE |
| 2 | 456 Other Road, Otherville | NO1 3WY |
| ... | ... | ... |### Basic Matching
Match them with:
```python
import duckdbfrom uk_address_matcher import (
clean_data_using_precomputed_rel_tok_freq,
get_linker,
best_matches_with_distinguishability,
improve_predictions_using_distinguishing_tokens,
)p_ch = "./example_data/companies_house_addresess_postcode_overlap.parquet"
p_fhrs = "./example_data/fhrs_addresses_sample.parquet"con = duckdb.connect(database=":memory:")
df_ch = con.read_parquet(p_ch).order("postcode")
df_fhrs = con.read_parquet(p_fhrs).order("postcode")df_ch_clean = clean_data_using_precomputed_rel_tok_freq(df_ch, con=con)
df_fhrs_clean = clean_data_using_precomputed_rel_tok_freq(df_fhrs, con=con)linker = get_linker(
df_addresses_to_match=df_fhrs_clean,
df_addresses_to_search_within=df_ch_clean,
con=con,
include_full_postcode_block=True,
additional_columns_to_retain=["original_address_concat"],
)# First pass - standard probabilistic linkage
df_predict = linker.inference.predict(
threshold_match_weight=-50, experimental_optimisation=True
)
df_predict_ddb = df_predict.as_duckdbpyrelation()# Second pass - improve predictions using distinguishing tokens
df_predict_improved = improve_predictions_using_distinguishing_tokens(
df_predict=df_predict_ddb,
con=con,
match_weight_threshold=-20,
)# Find best matches within group and compute distinguishability
best_matches = best_matches_with_distinguishability(
df_predict=df_predict_improved,
df_addresses_to_match=df_fhrs,
con=con,
)best_matches
```
### Two-Pass Matching Approach
The package uses a two-pass approach to achieve high accuracy matching:
1. **First Pass**: A standard probabilistic linkage model using Splink generates candidate matches for each input address.
2. **Second Pass**: Within each candidate group, the model analyzes distinguishing tokens to refine matches:
- Identifies tokens that uniquely distinguish addresses within a candidate group
- Detects "punishment tokens" (tokens in the messy address that don't match the current candidate but do match other candidates)
- Uses this contextual information to improve match scoresThis approach is particularly effective when matching to a canonical (deduplicated) address list, as it can identify subtle differences between very similar addresses.
Refer to [the example](example_matching.py), which has detailed comments, for how to match your data.
See [an example of comparing two addresses](example_compare_two.py) to get a sense of what it does/how it scores
Run an interactive example in your browser:
[](https://colab.research.google.com/github/RobinL/uk_address_matcher/blob/main/match_example_data.ipynb) Match 5,000 FHRS records to 21,952 companies house records in < 10 seconds.
[](https://colab.research.google.com/github/RobinL/uk_address_matcher/blob/main/interactive_comparison.ipynb) Investigate and understand how the model works
## Development
The scripts and tests will run better if you create .vscode/settings.json with the following:
```json
{
"jupyter.notebookFileRoot": "${workspaceFolder}",
"python.analysis.extraPaths": [
"${workspaceFolder}"
],
"python.testing.pytestEnabled": true,
"python.testing.unittestEnabled": false,
"python.testing.pytestArgs": [
"-v",
"--capture=tee-sys"
]
}
```