Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/nickcrews/spacy-address
Parse oneline US addresses using a spaCy NER model trained on OSM data
https://github.com/nickcrews/spacy-address
address address-parsing osm osm-data spacy spacy-nlp usaddress
Last synced: 3 months ago
JSON representation
Parse oneline US addresses using a spaCy NER model trained on OSM data
- Host: GitHub
- URL: https://github.com/nickcrews/spacy-address
- Owner: NickCrews
- License: mit
- Created: 2024-10-06T20:34:39.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-16T21:47:48.000Z (3 months ago)
- Last Synced: 2024-10-18T19:15:59.443Z (3 months ago)
- Topics: address, address-parsing, osm, osm-data, spacy, spacy-nlp, usaddress
- Language: Jupyter Notebook
- Homepage:
- Size: 458 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# spacy-address
Use [spaCy](https://spacy.io/)'s NER pipeline to parse oneline US addresses
Uses the the labeled data from [usaddress](https://github.com/datamade/usaddress)
with spaCy's very easy [training infrastructure](https://spacy.io/usage/training)Inspired by the code and blog from https://github.com/swapnil-saxena/address-parser.
## Usage
There are currently two models, `en-us-address-ner-sm` and `en-us-address-ner-lg`,
following the naming conventions for small and large that spaCy uses.### en-us-address-ner-sm
You probably want this one. Much better efficiency for not much worse accuracy.
As of 2024-10-06:
- F1 score for NER of .978
- model size of ~5MB
- on my 2021 M1 MacBook Pro, tags 1000 addresses in ~.2 sec### en-us-address-ner-lg
Much larger and slower, a little more accurate.
As of 2024-10-06:
- F1 score for NER of .982
- model size of ~420MB
- on my 2021 M1 MacBook Pro, tags 1000 addresses in ~2 secYou can find the released models in various [github releases](https://github.com/NickCrews/spacy-address/releases).
There, you can see the most up to date model size and F1 score.
The speed isn't reported anywhere easily, unfortunately.You can install from a release directly with pip:
```bash
python -m pip install "en-us-address-ner-sm @ https://github.com/NickCrews/spacy-address/releases/download/20241007-072524-sm/en_us_address_ner_sm-0.0.0-py3-none-any.whl"
```Now, this is accessible from python:
```python
import spacynlp = spacy.load("en-us-address-ner-sm")
doc = nlp("CO John SMITH, 123 E St elias stree S, Oklahoma City, OK 99507-1234")
for ent in doc.ents:
print(f"{ent.text} ({ent.label_})")
# CO John SMITH (Recipient)
# 123 (AddressNumber)
# E (StreetNamePreDirectional)
# St elias (StreetName) # St isn't confused as an abbreviation for street!
# stree (StreetNamePostType) # Typos are tagged correctly!
# S (StreetNamePostDirectional)
# Oklahoma City (PlaceName) # Oklahoma isn't confused as a state!
# OK (StateName)
# 99507-1234 (ZipCode)
```This uses the tags from the
"United States Thoroughfare, Landmark, and Postal Address Data Standard (Publication 28)".
See [labels.py](./labels.py) for details## Licence
Released under the MIT license.