Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/izuna385/wikia-and-wikipedia-el-dataset-creator
You can create datasets from Wikia/Wikipedia that can be used for entity recognition and Entity Linking. Dumps for ja-wiki and VTuber-wiki are available!
https://github.com/izuna385/wikia-and-wikipedia-el-dataset-creator
deep-learning entity-linking named-entity-recognition natural-language-processing natural-language-understanding spacy wikipedia
Last synced: 7 days ago
JSON representation
You can create datasets from Wikia/Wikipedia that can be used for entity recognition and Entity Linking. Dumps for ja-wiki and VTuber-wiki are available!
- Host: GitHub
- URL: https://github.com/izuna385/wikia-and-wikipedia-el-dataset-creator
- Owner: izuna385
- License: other
- Created: 2021-04-11T10:27:45.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2021-05-02T10:40:22.000Z (over 3 years ago)
- Last Synced: 2024-12-25T18:50:26.391Z (18 days ago)
- Topics: deep-learning, entity-linking, named-entity-recognition, natural-language-processing, natural-language-understanding, spacy, wikipedia
- Language: Python
- Homepage: https://qiita.com/izuna385/items/2d1dfa623924d823f633
- Size: 104 KB
- Stars: 17
- Watchers: 1
- Forks: 2
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Wikia/Wikipedia-NER-and-EL-Dataset-Creator
* You can create datasets from Wikia/Wikipedia that can be used for *both of* entity recognition and Entity Linking.* Sample Dataset is available [here](https://drive.google.com/drive/folders/1gvqrj9f4IVi3lscwsa_EdAp0I4CpNTAe?usp=sharing). See also [preprocessed data examples](#preprocessed-data-example-from-wikia).
## Sample ja-wiki dataset .
* [Here](https://drive.google.com/drive/folders/1fOKw46uljEDJi3ezfDHQ9JBg5PoWLifo?usp=sharing)
## Create en-wiki dataset.
* Ongoing under branch `feature/FixEnParseBug`.
## Environment Setup for Preprocessing.
```
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
$ (install wikiextractor==3.0.5 from source https://github.com/attardi/wikiextractor for activate --json option.)
```
## Dataset Preparation
### For Wikia
* Download [worldname]_pages_current.xml from wikia statistics page to `./dataset/`.* For example, if you are interested in Virtual Youtuber, download `virtualyoutuber_pages_current.xml` dump from [here](https://virtualyoutuber.fandom.com/wiki/Special:Statistics).
### For Wikipedia
* Download Wikipedia-dump from [here(en)](https://dumps.wikimedia.org/enwiki/) or [here(ja)](https://dumps.wikimedia.org/jawiki/) and unzip bzip2 file.## Sample Script for Creating EL Dataset.
```
$ sh ./scripts/vtuber.sh
```### Parameters for Creating Dataset
* `-augmentation_with_title_set_string_match` (Default:`True`)* When this parameter is `True`, first we construct title set from entire pages in one wikia `.xml`. Then, when string matches in this title set, we treat these mentions as annotated ones.
* `-in_document_augmentation_with_its_title` (Default:`True`)* When this parameter is `True`, we add another annotation to dataset with distant supervision from title, where the mention appears.
* For example, [the page of *Anakin Skywalker*](https://starwars.fandom.com/wiki/Anakin_Skywalker) mentions him without anchor link, as *Anakin* or *Skywalker*.
* With this parameter on, we treat these mentions as annotated ones.
* `-spacy_model` (Default: `en_core_web_md`)
* Specify spaCy model for sentence boundary detection.
* Note: SBD with spaCy is conducted only when `-multiprocessing` is `False`.
* `-language` (Default: `en`)* Specify language of document.
* When `en` is selected and `-multiprocessing` is `False`, [spaCy](https://github.com/explosion/spaCy) is used for SBD.
* When `en` is selected and `-multiprocessing` is `True`, [pysbd](https://github.com/nipunsadvilkar/pySBD) is used for SBD.
* When `ja` is selected, [konoha](https://github.com/himkt/konoha/) is used for SBD.* `-multiprocessing` (Default: `False`)
* If `True`, documents after preprocessing with wikiextractor are multiprocessed.
## License
* Dataset was constructed using Wikias (from FANDOM) and Wikipedia. This dataset is licensed under the Creative Commons Attribution-Share Alike License (CC-BY-SA).## Preprocessed data example from [Wikia](https://www.wikia.org/).
* [data](https://drive.google.com/drive/folders/1gvqrj9f4IVi3lscwsa_EdAp0I4CpNTAe?usp=sharing)### `annotation.json`
| key | its_content |
| ------------------------------- | ------------------------------------------------------------------------------------ |
| `document_title` | Page title where the annotation exists. |
| `anchor_sent` | Anchored sentence with `` and ``. This anchor can be used for Entity Linking. |
| `annotation_doc_entity_title` | Which entity to be linked if the mention is disambiguated. Redirects are also considered. |
| `mention` | Surface form as it is in sentence where the mention appeared. |
| `original_sentence` | Sentence without anchors. |
| `original_sentence_mention_start` | Mention span start position in original sentence. |
| `original_sentence_mention_end` | Mention span end position in original sentence. |* For instance, a real-world example of `annotations.json` is shown from [virtualyoutuber wikia](https://virtualyoutuber.fandom.com/).
```json
[
{
"document_title": "Melissa Kinrenka",
"anchor_sent": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji .",
"annotation_doc_entity_title": "Nijisanji",
"mention": "Nijisanji",
"original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"original_sentence_mention_start": 75,
"original_sentence_mention_end": 84
},
{
"document_title": "Melissa Kinrenka",
"anchor_sent": " Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"annotation_doc_entity_title": "Melissa Kinrenka",
"mention": "Melissa Kinrenka",
"original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"original_sentence_mention_start": 0,
"original_sentence_mention_end": 16
},
...
]
...```
### `doc_title2sents.json`
* Redirect-resolved title and its descriptions after sentence split are available.
```json
{
"Furen E Lustario": [
"Furen E Lustario (フレン・E・ルスタリオ) is a female Japanese Virtual YouTuber and member of Nijisanji.",
"A female knight of the Corvus Empire.",
"Introduction Video.",
"Furen's introduction.",
"Personality.",
"Furen lacks a surprising amount of common sense.",
"It has been displayed in at least two streams that she cannot tell from left to right.",
...
],
"Ibrahim": [
"Ibrahim (イブラヒム) is a male Japanese Virtual YouTuber and a member of Nijisanji.",
"A former oil tycoon from the Corvus Empire.",
"Since the value of oil has fallen, he now makes a living from a hot spring that he accidentally dug up.",
"History.",
"Background.",
"Ibrahim made his YouTube debut on 1 February 2020.",
...
],
...
}
```## WIP
* Add Entity Type to doc_title2sents.json for each entity.## Contact
* `izuna385(_atmark)gmail.com`
* PR and issues are welocome!