{"id":24491282,"url":"https://github.com/lcvriend/toponym-extraction","last_synced_at":"2025-09-15T08:38:56.652Z","repository":{"id":165291129,"uuid":"193104481","full_name":"lcvriend/toponym-extraction","owner":"lcvriend","description":"| thesis project | Toponym extraction from LexisNexis data using named entity recognition ","archived":false,"fork":false,"pushed_at":"2020-02-23T22:10:21.000Z","size":8173,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2025-03-11T00:12:33.940Z","etag":null,"topics":["case-study","extracting-toponyms","lexisnexis","ner"],"latest_commit_sha":null,"homepage":"https://lcvriend.github.io/toponym-extraction/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lcvriend.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-06-21T13:41:04.000Z","updated_at":"2025-01-09T12:40:20.000Z","dependencies_parsed_at":null,"dependency_job_id":"bb852d20-2d4c-4a38-b220-8bdc173daf58","html_url":"https://github.com/lcvriend/toponym-extraction","commit_stats":null,"previous_names":["lcvriend/toponym-extraction"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lcvriend%2Ftoponym-extraction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lcvriend%2Ftoponym-extraction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lcvriend%2Ftoponym-extraction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lcvriend%2Ftoponym-extraction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lcvriend","download_url":"https://codeload.github.com/lcvriend/toponym-extraction/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243673294,"owners_count":20328913,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["case-study","extracting-toponyms","lexisnexis","ner"],"created_at":"2025-01-21T18:17:40.675Z","updated_at":"2025-03-15T02:24:55.673Z","avatar_url":"https://github.com/lcvriend.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Toponym extraction\n\n[![Case Study](https://img.shields.io/badge/Repo-case_study-blue)](https://lcvriend.github.io/toponym_extraction/)\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lcvriend/toponym_extraction/master?filepath=notebooks%2Fexplore_data.ipynb)  \n\nThis repo contains:\n1. [Tools](#tools) for extracting toponyms (and lemmata) from newspaper articles downloaded from LexisNexis.\n2. The [results](#results) that were collected with these tools for a research on toponyms in news on Brexit in Dutch newspapers.\n3. A short write up on this [case study](https://lcvriend.github.io/toponym_extraction/). Check out the interactive map [here](https://lcvriend.github.io/toponym_extraction/map_toponyms.html).\n\n## Workflow\n\n\u003cimg src=\"docs/illustrations/workflow.svg\" alt=\"Workflow\"\u003e\n\n## Tools\nThere are three main scripts that were used to generate the data for this case study. Each script contains further documentation on how they should be used:\n- **Build NER model** :[Create a spaCy NER-model for extracting toponyms](scripts/01_create_model.py)\n- **Build data set**: [Extract text and meta data from LexisNexis files](scripts/02_textraction.py)\n- **Extract toponyms**: [Apply the model to the data set and extract statistics from it](scripts/03_spacify.py)\n\nThe `PhraseAnnotator` in [annotation_tools](src/annotation_tools.py) can be used to annotate the NER-results.\n\n## Results\nThis tool currently extracts two main statistics for each geographical category defined in the [MODEL] chapter of [config.ini](config.ini):\n1. Total frequency\n2. Article counts\n\nThese scripts will generally store results in Python's [pickle](https://docs.python.org/3/library/pickle.html) format. In order to make the results of this study generally available the following data has been added to the repo as csv-files (some have been zipped):\n1. The metadata for the [lexisnexis dataset](data/lexisnexis_dataset.csv)\n2. The statistics of the [toponym recognition](results/toponym_results.gz)\n3. The statistics of the [lemmata recognition](results/lemmata_results.gz)\n4. The [annotation data](annotations)\n\nThe data and results have been made available through an online jupyter notebook. Access the notebook by clicking this button:  \n\n[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/lcvriend/toponym_extraction/master?filepath=notebooks%2Fexplore_data.ipynb)\n\nUse [pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) and [altair](https://altair-viz.github.io/index.html) to explore the data.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flcvriend%2Ftoponym-extraction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flcvriend%2Ftoponym-extraction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flcvriend%2Ftoponym-extraction/lists"}