https://github.com/veldhub/veld_data__apis_oebl__ner_gold
Data velds encapsulating NLP / NER gold data.
https://github.com/veldhub/veld_data__apis_oebl__ner_gold
gold-data named-entity-recognition ner nlp spacy spacy-nlp spacy-nlp-ner training-data
Last synced: 3 months ago
JSON representation
Data velds encapsulating NLP / NER gold data.
- Host: GitHub
- URL: https://github.com/veldhub/veld_data__apis_oebl__ner_gold
- Owner: veldhub
- License: mit
- Created: 2024-12-04T14:00:44.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-01-20T19:29:50.000Z (5 months ago)
- Last Synced: 2025-01-30T10:18:32.377Z (5 months ago)
- Topics: gold-data, named-entity-recognition, ner, nlp, spacy, spacy-nlp, spacy-nlp-ner, training-data
- Homepage:
- Size: 1.19 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#  veld_data__apis_oebl__ner_gold
Randomized sentences from historical German biographies, containing named entities.
## Source of data
The original data was extracted from the [Austrian Biographical Lexicon
(ÖBL)](https://www.oeaw.ac.at/acdh/oebl) in the context of the [Austrian Prosopographical
Information System (APIS) project](https://www.oeaw.ac.at/acdh/projects/completed-projects/apis).From there, samples were randomly pulled and annotated for Named Entity Recognition tasks, which
form this dataset.The texts concern numerous smaller biographies in the time period between 19th and early 20th
century within historical Austria-Hungary, and were produced by the [Austrian Acadamey of
Sciences](https://www.oeaw.ac.at/en) between 1957 and 2023.The language style is rather condensed and contains a lot of domain-specific abbreviations.
The contained NER tags are `PER` (Person), `ORG` (Organisation), `LOC` (Location).
Extracted and transformed from: https://gitlab.oeaw.ac.at/acdh-ch/apis/spacy-ner in the context of a
VELD chain: https://github.com/steffres/veld_chain_6__apis_ner_transform_to_conll2003In the original spacy-ner repo, several data sets are split into corpus / training / eval sets,
scattered across folders with their respective nlp models, and also spread across different formats
(txt, json, pickle). In order to reuse this data, it was extracted, harmonized, cleaned,
deduplicated and transformed it into one consistent data source as json files, where only texts and
the indices and tags of their contained named entities are persisted.