{"id":13585139,"url":"https://github.com/wikipedia2vec/wikipedia2vec","last_synced_at":"2025-04-07T06:32:42.755Z","repository":{"id":38271786,"uuid":"44961854","full_name":"wikipedia2vec/wikipedia2vec","owner":"wikipedia2vec","description":"A tool for learning vector representations of words and entities from Wikipedia","archived":false,"fork":false,"pushed_at":"2024-05-03T21:51:02.000Z","size":2530,"stargazers_count":949,"open_issues_count":8,"forks_count":103,"subscribers_count":34,"default_branch":"master","last_synced_at":"2025-03-08T21:31:32.289Z","etag":null,"topics":["embeddings","natural-language-processing","nlp","python","text-classification","wikipedia"],"latest_commit_sha":null,"homepage":"http://wikipedia2vec.github.io/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wikipedia2vec.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2015-10-26T10:37:06.000Z","updated_at":"2025-03-04T06:48:06.000Z","dependencies_parsed_at":"2022-07-11T00:16:11.036Z","dependency_job_id":"aaf47d72-e117-4b2b-8048-6aa1cf29835f","html_url":"https://github.com/wikipedia2vec/wikipedia2vec","commit_stats":null,"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wikipedia2vec%2Fwikipedia2vec","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wikipedia2vec%2Fwikipedia2vec/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wikipedia2vec%2Fwikipedia2vec/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wikipedia2vec%2Fwikipedia2vec/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wikipedia2vec","download_url":"https://codeload.github.com/wikipedia2vec/wikipedia2vec/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247607550,"owners_count":20965942,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["embeddings","natural-language-processing","nlp","python","text-classification","wikipedia"],"created_at":"2024-08-01T15:04:45.618Z","updated_at":"2025-04-07T06:32:42.716Z","avatar_url":"https://github.com/wikipedia2vec.png","language":"Python","readme":"# Wikipedia2Vec\n\n[![tests](https://github.com/wikipedia2vec/wikipedia2vec/actions/workflows/test.yml/badge.svg?branch=master)](https://github.com/wikipedia2vec/wikipedia2vec/actions/workflows/test.yml)\n[![pypi Version](https://img.shields.io/pypi/v/wikipedia2vec.svg?style=flat-square\u0026logo=pypi\u0026logoColor=white)](https://pypi.org/project/wikipedia2vec/)\n\nWikipedia2Vec is a tool used for obtaining embeddings (or vector representations) of words and entities (i.e., concepts that have corresponding pages in Wikipedia) from Wikipedia.\nIt is developed and maintained by [Studio Ousia](http://www.ousia.jp).\n\nThis tool enables you to learn embeddings of words and entities simultaneously, and places similar words and entities close to one another in a continuous vector space.\nEmbeddings can be easily trained by a single command with a publicly available Wikipedia dump as input.\n\nThis tool implements the [conventional skip-gram model](https://en.wikipedia.org/wiki/Word2vec) to learn the embeddings of words, and its extension proposed in [Yamada et al. (2016)](https://arxiv.org/abs/1601.01343) to learn the embeddings of entities.\n\nAn empirical comparison between Wikipedia2Vec and existing embedding tools (i.e., FastText, Gensim, RDF2Vec, and Wiki2vec) is available [here](https://arxiv.org/abs/1812.06280).\n\nDocumentation are available online at [http://wikipedia2vec.github.io/](http://wikipedia2vec.github.io/).\n\n## Basic Usage\n\nWikipedia2Vec can be installed via PyPI:\n\n```bash\n% pip install wikipedia2vec\n```\n\nWith this tool, embeddings can be learned by running a _train_ command with a Wikipedia dump as input.\nFor example, the following commands download the latest English Wikipedia dump and learn embeddings from this dump:\n\n```bash\n% wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2\n% wikipedia2vec train enwiki-latest-pages-articles.xml.bz2 MODEL_FILE\n```\n\nThen, the learned embeddings are written to _MODEL_FILE_.\nNote that this command can take many optional parameters.\nPlease refer to [our documentation](https://wikipedia2vec.github.io/wikipedia2vec/commands/) for further details.\n\n## Pretrained Embeddings\n\nPretrained embeddings for 12 languages (i.e., English, Arabic, Chinese, Dutch, French, German, Italian, Japanese, Polish, Portuguese, Russian, and Spanish) can be downloaded from [this page](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).\n\n## Use Cases\n\nWikipedia2Vec has been applied to the following tasks:\n\n- Entity linking: [Yamada et al., 2016](https://arxiv.org/abs/1601.01343), [Eshel et al., 2017](https://arxiv.org/abs/1706.09147), [Chen et al., 2019](https://arxiv.org/abs/1911.03834), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681), [van Hulst et al., 2020](https://arxiv.org/abs/2006.01969).\n- Named entity recognition: [Sato et al., 2017](http://www.aclweb.org/anthology/I17-2017), [Lara-Clares and Garcia-Serrano, 2019](http://ceur-ws.org/Vol-2421/eHealth-KD_paper_6.pdf).\n- Question answering: [Yamada et al., 2017](https://arxiv.org/abs/1803.08652), [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).\n- Entity typing: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960).\n- Text classification: [Yamada et al., 2018](https://arxiv.org/abs/1806.02960), [Yamada and Shindo, 2019](https://arxiv.org/abs/1909.01259), [Alam et al., 2020](https://link.springer.com/chapter/10.1007/978-3-030-61244-3_9).\n- Relation classification: [Poerner et al., 2020](https://arxiv.org/abs/1911.03681).\n- Paraphrase detection: [Duong et al., 2018](https://ieeexplore.ieee.org/abstract/document/8606845).\n- Knowledge graph completion: [Shah et al., 2019](https://aaai.org/ojs/index.php/AAAI/article/view/4162), [Shah et al., 2020](https://www.aclweb.org/anthology/2020.textgraphs-1.9/).\n- Fake news detection: [Singh et al., 2019](https://arxiv.org/abs/1906.11126), [Ghosal et al., 2020](https://arxiv.org/abs/2010.10836).\n- Plot analysis of movies: [Papalampidi et al., 2019](https://arxiv.org/abs/1908.10328).\n- Novel entity discovery: [Zhang et al., 2020](https://arxiv.org/abs/2002.00206).\n- Entity retrieval: [Gerritse et al., 2020](https://link.springer.com/chapter/10.1007%2F978-3-030-45439-5_7).\n- Deepfake detection: [Zhong et al., 2020](https://arxiv.org/abs/2010.07475).\n- Conversational information seeking: [Rodriguez et al., 2020](https://arxiv.org/abs/2005.00172).\n- Query expansion: [Rosin et al., 2020](https://arxiv.org/abs/2012.12065).\n\n## References\n\nIf you use Wikipedia2Vec in a scientific publication, please cite the following paper:\n\nIkuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, Yuji Matsumoto, [Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia](https://arxiv.org/abs/1812.06280).\n\n```\n@inproceedings{yamada2020wikipedia2vec,\n  title = \"{W}ikipedia2{V}ec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from {W}ikipedia\",\n  author={Yamada, Ikuya and Asai, Akari and Sakuma, Jin and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu and Matsumoto, Yuji},\n  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations},\n  year = {2020},\n  publisher = {Association for Computational Linguistics},\n  pages = {23--30}\n}\n```\n\nThe embedding model was originally proposed in the following paper:\n\nIkuya Yamada, Hiroyuki Shindo, Hideaki Takeda, Yoshiyasu Takefuji, [Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation](https://arxiv.org/abs/1601.01343).\n\n```\n@inproceedings{yamada2016joint,\n  title={Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation},\n  author={Yamada, Ikuya and Shindo, Hiroyuki and Takeda, Hideaki and Takefuji, Yoshiyasu},\n  booktitle={Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning},\n  year={2016},\n  publisher={Association for Computational Linguistics},\n  pages={250--259}\n}\n```\n\nThe text classification model implemented in [this example](https://github.com/wikipedia2vec/wikipedia2vec/tree/master/examples/text_classification) was proposed in the following paper:\n\nIkuya Yamada, Hiroyuki Shindo, [Neural Attentive Bag-of-Entities Model for Text Classification](https://arxiv.org/abs/1909.01259).\n\n```\n@article{yamada2019neural,\n  title={Neural Attentive Bag-of-Entities Model for Text Classification},\n  author={Yamada, Ikuya and Shindo, Hiroyuki},\n  booktitle={Proceedings of The 23th SIGNLL Conference on Computational Natural Language Learning},\n  year={2019},\n  publisher={Association for Computational Linguistics},\n  pages = {563--573}\n}\n```\n\n## License\n\n[Apache License 2.0](http://www.apache.org/licenses/LICENSE-2.0)\n","funding_links":[],"categories":["Python"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwikipedia2vec%2Fwikipedia2vec","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwikipedia2vec%2Fwikipedia2vec","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwikipedia2vec%2Fwikipedia2vec/lists"}