Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/izuna385/wikia-and-wikipedia-el-dataset-creator

You can create datasets from Wikia/Wikipedia that can be used for entity recognition and Entity Linking. Dumps for ja-wiki and VTuber-wiki are available!
https://github.com/izuna385/wikia-and-wikipedia-el-dataset-creator

deep-learning entity-linking named-entity-recognition natural-language-processing natural-language-understanding spacy wikipedia

Last synced: 7 days ago
JSON representation

You can create datasets from Wikia/Wikipedia that can be used for entity recognition and Entity Linking. Dumps for ja-wiki and VTuber-wiki are available!

Awesome Lists containing this project

README

        

# Wikia/Wikipedia-NER-and-EL-Dataset-Creator
* You can create datasets from Wikia/Wikipedia that can be used for *both of* entity recognition and Entity Linking.

* Sample Dataset is available [here](https://drive.google.com/drive/folders/1gvqrj9f4IVi3lscwsa_EdAp0I4CpNTAe?usp=sharing). See also [preprocessed data examples](#preprocessed-data-example-from-wikia).

## Sample ja-wiki dataset .

* [Here](https://drive.google.com/drive/folders/1fOKw46uljEDJi3ezfDHQ9JBg5PoWLifo?usp=sharing)

## Create en-wiki dataset.

* Ongoing under branch `feature/FixEnParseBug`.

## Environment Setup for Preprocessing.
```
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
$ (install wikiextractor==3.0.5 from source https://github.com/attardi/wikiextractor for activate --json option.)
```
## Dataset Preparation
### For Wikia
* Download [worldname]_pages_current.xml from wikia statistics page to `./dataset/`.

* For example, if you are interested in Virtual Youtuber, download `virtualyoutuber_pages_current.xml` dump from [here](https://virtualyoutuber.fandom.com/wiki/Special:Statistics).

### For Wikipedia
* Download Wikipedia-dump from [here(en)](https://dumps.wikimedia.org/enwiki/) or [here(ja)](https://dumps.wikimedia.org/jawiki/) and unzip bzip2 file.

## Sample Script for Creating EL Dataset.
```
$ sh ./scripts/vtuber.sh
```

### Parameters for Creating Dataset
* `-augmentation_with_title_set_string_match` (Default:`True`)

* When this parameter is `True`, first we construct title set from entire pages in one wikia `.xml`. Then, when string matches in this title set, we treat these mentions as annotated ones.

* `-in_document_augmentation_with_its_title` (Default:`True`)

* When this parameter is `True`, we add another annotation to dataset with distant supervision from title, where the mention appears.

* For example, [the page of *Anakin Skywalker*](https://starwars.fandom.com/wiki/Anakin_Skywalker) mentions him without anchor link, as *Anakin* or *Skywalker*.

* With this parameter on, we treat these mentions as annotated ones.

* `-spacy_model` (Default: `en_core_web_md`)

* Specify spaCy model for sentence boundary detection.

* Note: SBD with spaCy is conducted only when `-multiprocessing` is `False`.

* `-language` (Default: `en`)

* Specify language of document.

* When `en` is selected and `-multiprocessing` is `False`, [spaCy](https://github.com/explosion/spaCy) is used for SBD.

* When `en` is selected and `-multiprocessing` is `True`, [pysbd](https://github.com/nipunsadvilkar/pySBD) is used for SBD.

* When `ja` is selected, [konoha](https://github.com/himkt/konoha/) is used for SBD.

* `-multiprocessing` (Default: `False`)

* If `True`, documents after preprocessing with wikiextractor are multiprocessed.

## License
* Dataset was constructed using Wikias (from FANDOM) and Wikipedia. This dataset is licensed under the Creative Commons Attribution-Share Alike License (CC-BY-SA).

## Preprocessed data example from [Wikia](https://www.wikia.org/).
* [data](https://drive.google.com/drive/folders/1gvqrj9f4IVi3lscwsa_EdAp0I4CpNTAe?usp=sharing)

### `annotation.json`
| key | its_content |
| ------------------------------- | ------------------------------------------------------------------------------------ |
| `document_title` | Page title where the annotation exists. |
| `anchor_sent` | Anchored sentence with `` and ``. This anchor can be used for Entity Linking. |
| `annotation_doc_entity_title` | Which entity to be linked if the mention is disambiguated. Redirects are also considered. |
| `mention` | Surface form as it is in sentence where the mention appeared. |
| `original_sentence` | Sentence without anchors. |
| `original_sentence_mention_start` | Mention span start position in original sentence. |
| `original_sentence_mention_end` | Mention span end position in original sentence. |

* For instance, a real-world example of `annotations.json` is shown from [virtualyoutuber wikia](https://virtualyoutuber.fandom.com/).

```json
[
{
"document_title": "Melissa Kinrenka",
"anchor_sent": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji .",
"annotation_doc_entity_title": "Nijisanji",
"mention": "Nijisanji",
"original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"original_sentence_mention_start": 75,
"original_sentence_mention_end": 84
},
{
"document_title": "Melissa Kinrenka",
"anchor_sent": " Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"annotation_doc_entity_title": "Melissa Kinrenka",
"mention": "Melissa Kinrenka",
"original_sentence": "Melissa Kinrenka (メリッサ・キンレンカ) is a Japanese Virtual YouTuber and member of Nijisanji.",
"original_sentence_mention_start": 0,
"original_sentence_mention_end": 16
},
...
]
...

```
### `doc_title2sents.json`
* Redirect-resolved title and its descriptions after sentence split are available.
```json
{
"Furen E Lustario": [
"Furen E Lustario (フレン・E・ルスタリオ) is a female Japanese Virtual YouTuber and member of Nijisanji.",
"A female knight of the Corvus Empire.",
"Introduction Video.",
"Furen's introduction.",
"Personality.",
"Furen lacks a surprising amount of common sense.",
"It has been displayed in at least two streams that she cannot tell from left to right.",
...
],
"Ibrahim": [
"Ibrahim (イブラヒム) is a male Japanese Virtual YouTuber and a member of Nijisanji.",
"A former oil tycoon from the Corvus Empire.",
"Since the value of oil has fallen, he now makes a living from a hot spring that he accidentally dug up.",
"History.",
"Background.",
"Ibrahim made his YouTube debut on 1 February 2020.",
...
],
...
}
```

## WIP
* Add Entity Type to doc_title2sents.json for each entity.

## Contact
* `izuna385(_atmark)gmail.com`
* PR and issues are welocome!