https://github.com/lcvriend/toponym-extraction
| thesis project | Toponym extraction from LexisNexis data using named entity recognition
https://github.com/lcvriend/toponym-extraction
case-study extracting-toponyms lexisnexis ner
Last synced: 7 months ago
JSON representation
| thesis project | Toponym extraction from LexisNexis data using named entity recognition
- Host: GitHub
- URL: https://github.com/lcvriend/toponym-extraction
- Owner: lcvriend
- License: gpl-3.0
- Created: 2019-06-21T13:41:04.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2020-02-23T22:10:21.000Z (about 6 years ago)
- Last Synced: 2025-03-11T00:12:33.940Z (about 1 year ago)
- Topics: case-study, extracting-toponyms, lexisnexis, ner
- Language: Python
- Homepage: https://lcvriend.github.io/toponym-extraction/
- Size: 7.79 MB
- Stars: 3
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Toponym extraction
[](https://lcvriend.github.io/toponym_extraction/)
[](https://mybinder.org/v2/gh/lcvriend/toponym_extraction/master?filepath=notebooks%2Fexplore_data.ipynb)
This repo contains:
1. [Tools](#tools) for extracting toponyms (and lemmata) from newspaper articles downloaded from LexisNexis.
2. The [results](#results) that were collected with these tools for a research on toponyms in news on Brexit in Dutch newspapers.
3. A short write up on this [case study](https://lcvriend.github.io/toponym_extraction/). Check out the interactive map [here](https://lcvriend.github.io/toponym_extraction/map_toponyms.html).
## Workflow

## Tools
There are three main scripts that were used to generate the data for this case study. Each script contains further documentation on how they should be used:
- **Build NER model** :[Create a spaCy NER-model for extracting toponyms](scripts/01_create_model.py)
- **Build data set**: [Extract text and meta data from LexisNexis files](scripts/02_textraction.py)
- **Extract toponyms**: [Apply the model to the data set and extract statistics from it](scripts/03_spacify.py)
The `PhraseAnnotator` in [annotation_tools](src/annotation_tools.py) can be used to annotate the NER-results.
## Results
This tool currently extracts two main statistics for each geographical category defined in the [MODEL] chapter of [config.ini](config.ini):
1. Total frequency
2. Article counts
These scripts will generally store results in Python's [pickle](https://docs.python.org/3/library/pickle.html) format. In order to make the results of this study generally available the following data has been added to the repo as csv-files (some have been zipped):
1. The metadata for the [lexisnexis dataset](data/lexisnexis_dataset.csv)
2. The statistics of the [toponym recognition](results/toponym_results.gz)
3. The statistics of the [lemmata recognition](results/lemmata_results.gz)
4. The [annotation data](annotations)
The data and results have been made available through an online jupyter notebook. Access the notebook by clicking this button:
[](https://mybinder.org/v2/gh/lcvriend/toponym_extraction/master?filepath=notebooks%2Fexplore_data.ipynb)
Use [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) and [altair](https://altair-viz.github.io/index.html) to explore the data.