https://github.com/dedupeio/address-matching
Python script for matching a list of messy addresses against a gazetteer using dedupe.
https://github.com/dedupeio/address-matching
Last synced: about 1 year ago
JSON representation
Python script for matching a list of messy addresses against a gazetteer using dedupe.
- Host: GitHub
- URL: https://github.com/dedupeio/address-matching
- Owner: dedupeio
- Created: 2014-02-17T20:11:57.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2020-03-31T20:47:01.000Z (about 6 years ago)
- Last Synced: 2025-03-28T19:53:40.312Z (about 1 year ago)
- Language: Python
- Size: 48.9 MB
- Stars: 62
- Watchers: 8
- Forks: 19
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
address-matching
================
Python script for matching a list of messy addresses against a gazetteer using dedupe. This also functions as a pseudo geocoder if your Gazetteer has lat/long information.
Part of the [Dedupe.io](https://dedupe.io/) cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data.
## Setup
Here's how to get this script working - without having dedupe already installed.
```bash
git clone git@github.com:datamade/address-matching.git
cd address-matching
pip install "numpy>=1.6"
pip install -r requirements.txt
```
## Gazetteer
You will need a Gazetteer of all unique addresses in a given area. For this example, we used the [Cook County Address Point shapefile](https://datacatalog.cookcountyil.gov/GIS-Maps/ccgisdata-Address-Point-Chicago/jev2-4wjs).
## List addresses you want to match
This program takes a list of addresses and matches them to individual records in the Gazetteer. For this example, we are using a messy list of early childhood education locations in Chicago. This file can have multiple entries referring to the same place.
## Usage
Once you have a Gazetteer and a messy input file, run `address_matching.py`
```bash
python address_matching.py
```
You will be prompted to label some training pairs for dedupe to do its thing. [More on this here](https://github.com/datamade/dedupe/blob/master/README.md#training).
The output will be saved to `address_matching_output.csv`