Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/nickcrews/spacy-address

Parse oneline US addresses using a spaCy NER model trained on OSM data
https://github.com/nickcrews/spacy-address

address address-parsing osm osm-data spacy spacy-nlp usaddress

Last synced: 3 months ago
JSON representation

Parse oneline US addresses using a spaCy NER model trained on OSM data

Host: GitHub
URL: https://github.com/nickcrews/spacy-address
Owner: NickCrews
License: mit
Created: 2024-10-06T20:34:39.000Z (4 months ago)
Default Branch: main
Last Pushed: 2024-10-16T21:47:48.000Z (3 months ago)
Last Synced: 2024-10-18T19:15:59.443Z (3 months ago)
Topics: address, address-parsing, osm, osm-data, spacy, spacy-nlp, usaddress
Language: Jupyter Notebook
Homepage:
Size: 458 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # spacy-address

Use [spaCy](https://spacy.io/)'s NER pipeline to parse oneline US addresses

Uses the the labeled data from [usaddress](https://github.com/datamade/usaddress)

with spaCy's very easy [training infrastructure](https://spacy.io/usage/training)

Inspired by the code and blog from https://github.com/swapnil-saxena/address-parser.

## Usage

There are currently two models, `en-us-address-ner-sm` and `en-us-address-ner-lg`,

following the naming conventions for small and large that spaCy uses.

### en-us-address-ner-sm

You probably want this one. Much better efficiency for not much worse accuracy.

As of 2024-10-06:

- F1 score for NER of .978

- model size of ~5MB

- on my 2021 M1 MacBook Pro, tags 1000 addresses in ~.2 sec

### en-us-address-ner-lg

Much larger and slower, a little more accurate.

As of 2024-10-06:

- F1 score for NER of .982

- model size of ~420MB

- on my 2021 M1 MacBook Pro, tags 1000 addresses in ~2 sec

You can find the released models in various [github releases](https://github.com/NickCrews/spacy-address/releases).

There, you can see the most up to date model size and F1 score.

The speed isn't reported anywhere easily, unfortunately.

You can install from a release directly with pip:

```bash

python -m pip install "en-us-address-ner-sm @ https://github.com/NickCrews/spacy-address/releases/download/20241007-072524-sm/en_us_address_ner_sm-0.0.0-py3-none-any.whl"

```

Now, this is accessible from python:

```python

import spacy

nlp = spacy.load("en-us-address-ner-sm")

doc = nlp("CO John SMITH, 123 E St elias stree S,   Oklahoma City, OK 99507-1234")

for ent in doc.ents:

    print(f"{ent.text} ({ent.label_})")

# CO John SMITH (Recipient)

# 123 (AddressNumber)

# E (StreetNamePreDirectional)

# St elias (StreetName)            # St isn't confused as an abbreviation for street!

# stree (StreetNamePostType)       # Typos are tagged correctly!

# S (StreetNamePostDirectional)

# Oklahoma City (PlaceName)        # Oklahoma isn't confused as a state!

# OK (StateName)

# 99507-1234 (ZipCode)

```

This uses the tags from the

"United States Thoroughfare, Landmark, and Postal Address Data Standard (Publication 28)".

See [labels.py](./labels.py) for details

## Licence

Released under the MIT license.