An open API service indexing awesome lists of open source software.

https://github.com/wikipedia2vec/wikipedia2vec

A tool for learning vector representations of words and entities from Wikipedia
https://github.com/wikipedia2vec/wikipedia2vec

embeddings natural-language-processing nlp python text-classification wikipedia

Last synced: about 1 month ago
JSON representation

A tool for learning vector representations of words and entities from Wikipedia

Awesome Lists containing this project

README

        

# Wikipedia2Vec

[![tests](https://github.com/wikipedia2vec/wikipedia2vec/actions/workflows/test.yml/badge.svg?branch=master)](https://github.com/wikipedia2vec/wikipedia2vec/actions/workflows/test.yml)
[![pypi Version](https://img.shields.io/pypi/v/wikipedia2vec.svg?style=flat-square&logo=pypi&logoColor=white)](https://pypi.org/project/wikipedia2vec/)

Wikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia.
It is developed and maintained by [Studio Ousia](http://www.ousia.jp).

This tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space.
Embeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.

This tool implements the [conventional skip-gram model](https://en.wikipedia.org/wiki/Word2vec) to learn the embeddings of words, and its extension proposed in [Yamada et al. (2016)](https://arxiv.org/abs/1601.01343) to learn the embeddings of entities.

An empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available [here](https://arxiv.org/abs/1812.06280).

Documentation are available online at [http://wikipedia2vec.github.io/](http://wikipedia2vec.github.io/).

## Basic Usage

Wikipedia2Vec can be installed via PyPI:

```bash
% pip install wikipedia2vec
```

With this tool, embeddings can be learned by running a _train_ command with a Wikipedia dump as input.
For example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:

```bash
% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE
```

Then, the learned embeddings are written to _MODEL_FILE_.
Note that this command can take many optional parameters.
Please refer to [our documentation](https://wikipedia2vec.github.io/wikipedia2vec/commands/) for further details.

## Pretrained Embeddings

Pretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from [this page](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).

## Use Cases

Wikipedia2Vec has been applied to the following tasks:

- Entity linking: [Yamada et al., 2016](https://arxiv.org/abs/1601.01343), [Eshel et al., 2017](https://arxiv.org/abs/1706.09147), [Chen et al., 2019](https://arxiv.org/abs/1911.03834), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681), [van Hulst et al., 2020](https://arxiv.org/abs/2006.01969).
- Named entity recognition: [Sato et al., 2017](http://www.aclweb.org/anthology/I17-2017), [Lara-Clares and Garcia-Serrano, 2019](http://ceur-ws.org/Vol-2421/eHealth-KD_paper_6.pdf).
- Question answering: [Yamada et al., 2017](https://arxiv.org/abs/1803.08652), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).
- Entity typing: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960).
- Text classification: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960), [Yamada and Shindo, 2019](https://arxiv.org/abs/1909.01259), [Alam et al., 2020](https://link.springer.com/chapter/10.1007/978-3-030-61244-3_9).
- Relation classification: [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).
- Paraphrase detection: [Duong et al., 2018](https://ieeexplore.ieee.org/abstract/document/8606845).
- Knowledge graph completion: [Shah et al., 2019](https://aaai.org/ojs/index.php/AAAI/article/view/4162), [Shah et al., 2020](https://www.aclweb.org/anthology/2020.textgraphs-1.9/).
- Fake news detection: [Singh et al., 2019](https://arxiv.org/abs/1906.11126), [Ghosal et al., 2020](https://arxiv.org/abs/2010.10836).
- Plot analysis of movies: [Papalampidi et al., 2019](https://arxiv.org/abs/1908.10328).
- Novel entity discovery: [Zhang et al., 2020](https://arxiv.org/abs/2002.00206).
- Entity retrieval: [Gerritse et al., 2020](https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_7).
- Deepfake detection: [Zhong et al., 2020](https://arxiv.org/abs/2010.07475).
- Conversational information seeking: [Rodriguez et al., 2020](https://arxiv.org/abs/2005.00172).
- Query expansion: [Rosin et al., 2020](https://arxiv.org/abs/2012.12065).

## References

If you use Wikipedia2Vec in a scientific publication, please cite the following paper:

Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, [Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia](https://arxiv.org/abs/1812.06280).

```
@inproceedings{yamada2020wikipedia2vec,
title = "{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia",
author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},
booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},
year = {2020},
publisher = {Association for Computational Linguistics},
pages = {23--30}
}
```

The embedding model was originally proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/abs/1601.01343).

```
@inproceedings{yamada2016joint,
title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},
author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},
booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},
year={2016},
publisher={Association for Computational Linguistics},
pages={250--259}
}
```

The text classification model implemented in [this example](https://github.com/wikipedia2vec/wikipedia2vec/tree/master/examples/text_classification) was proposed in the following paper:

Ikuya Yamada, Hiroyuki Shindo, [Neural Attentive Bag-of-Entities Model for Text Classification](https://arxiv.org/abs/1909.01259).

```
@article{yamada2019neural,
title={Neural Attentive Bag-of-Entities Model for Text Classification},
author={Yamada, Ikuya and Shindo, Hiroyuki},
booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},
year={2019},
publisher={Association for Computational Linguistics},
pages = {563--573}
}
```

## License

[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)