Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/piskvorky/gensim-data
Data repository for pretrained NLP models and NLP corpora.
https://github.com/piskvorky/gensim-data
corpora dataset gensim glove-model lda-model lsi-model pretrained-models word2vec-model
Last synced: 4 days ago
JSON representation
Data repository for pretrained NLP models and NLP corpora.
- Host: GitHub
- URL: https://github.com/piskvorky/gensim-data
- Owner: piskvorky
- License: lgpl-2.1
- Created: 2017-10-13T18:22:15.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2018-03-16T13:55:53.000Z (almost 7 years ago)
- Last Synced: 2024-09-09T18:15:05.896Z (4 months ago)
- Topics: corpora, dataset, gensim, glove-model, lda-model, lsi-model, pretrained-models, word2vec-model
- Language: Python
- Homepage: https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
- Size: 66.4 KB
- Stars: 974
- Watchers: 39
- Forks: 131
- Open Issues: 20
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project
- awesome-ai4lam - Gensim datasets
README
# What is Gensim-data for?
Research datasets regularly disappear, change over time, become obsolete or come without a sane implementation to handle the data format reading and processing.
For this reason, [Gensim](https://github.com/RaRe-Technologies/gensim) launched its own dataset storage, committed to long-term support, a sane standardized usage API and focused on datasets for **unstructured text processing** (no images or audio). This [Gensim-data](https://github.com/RaRe-Technologies/gensim-data) repository serves as that storage.
**There's no need for you to use this repository directly**. Instead, simply install Gensim and use its download API (see the Quickstart below). It will "talk" to this repository automagically.
π‘ When you use the Gensim download API, all data is stored in your `~/gensim-data` home folder.
Read more about the project rationale and design decisions in this article: [New Download API for Pretrained NLP Models and Datasets](https://rare-technologies.com/new-download-api-for-pretrained-nlp-models-and-datasets-in-gensim/).
# How does it work?
Technically, the actual (sometimes large) corpora and model files are being stored as [release attachments](https://github.com/RaRe-Technologies/gensim-data/releases) here on Github. Each dataset (and each new version of each dataset) gets its own release, forever immutable.
Each release is accompanied by a usage example and release notes, for example: [Corpus of USPTO Patents from 2017](https://github.com/RaRe-Technologies/gensim-data/releases/tag/patent-2017); [English Wikipedia from 2017 with plaintext section](https://github.com/RaRe-Technologies/gensim-data/releases/tag/wiki-english-20171001).
π΄ **Each dataset comes with its own license, which the users should study carefully before using the dataset!**
----
## Quickstart
To load a model or corpus, use either the Python or command line interface of [Gensim](https://github.com/RaRe-Technologies/gensim) (you'll need Gensim installed first):
- **Python API**
Example: load a pre-trained model (gloVe word vectors):
```python
import gensim.downloader as apiinfo = api.info() # show info about available models/datasets
model = api.load("glove-twitter-25") # download the model and return as object ready for use
model.most_similar("cat")"""
output:[(u'dog', 0.9590819478034973),
(u'monkey', 0.9203578233718872),
(u'bear', 0.9143137335777283),
(u'pet', 0.9108031392097473),
(u'girl', 0.8880630135536194),
(u'horse', 0.8872727155685425),
(u'kitty', 0.8870542049407959),
(u'puppy', 0.886769711971283),
(u'hot', 0.8865255117416382),
(u'lady', 0.8845518827438354)]"""
```Example: load a corpus and use it to train a Word2Vec model:
```python
from gensim.models.word2vec import Word2Vec
import gensim.downloader as apicorpus = api.load('text8') # download the corpus and return it opened as an iterable
model = Word2Vec(corpus) # train a model from the corpus
model.most_similar("car")"""
output:[(u'driver', 0.8273754119873047),
(u'motorcycle', 0.769528865814209),
(u'cars', 0.7356342077255249),
(u'truck', 0.7331641912460327),
(u'taxi', 0.718338131904602),
(u'vehicle', 0.7177008390426636),
(u'racing', 0.6697118878364563),
(u'automobile', 0.6657308340072632),
(u'passenger', 0.6377975344657898),
(u'glider', 0.6374964714050293)]"""
```Example: **only** download a dataset and return the local file path (no opening):
```python
import gensim.downloader as apiprint(api.load("20-newsgroups", return_path=True)) # output: /home/user/gensim-data/20-newsgroups/20-newsgroups.gz
print(api.load("glove-twitter-25", return_path=True)) # output: /home/user/gensim-data/glove-twitter-25/glove-twitter-25.gz
```- The same operations, but from **CLI, command line interface**:
```bash
python -m gensim.downloader --info # show info about available models/datasets
python -m gensim.downloader --download text8 # download text8 dataset to ~/gensim-data/text8
python -m gensim.downloader --download glove-twitter-25 # download model to ~/gensim-data/glove-twitter-50/
```----
## Available data
### Datasets
| name | file size | read_more | description | license |
|------|-----------|-----------|-------------|---------|
| 20-newsgroups | 13 MB |
- http://qwone.com/~jason/20Newsgroups/
| fake-news | 19 MB |
- https://www.kaggle.com/mrisdal/fake-news
| patent-2017 | 2944 MB |
- http://patents.reedtech.com/pgrbft.php
| quora-duplicate-questions | 20 MB |
- https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs
| semeval-2016-2017-task3-subtaskA-unannotated | 223 MB |
- http://alt.qcri.org/semeval2016/task3/
- http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf
- https://github.com/RaRe-Technologies/gensim-data/issues/18
- https://github.com/Witiko/semeval-2016_2017-task3-subtaskA-unannotated-english
| semeval-2016-2017-task3-subtaskBC | 6 MB |
- http://alt.qcri.org/semeval2017/task3/
- http://alt.qcri.org/semeval2017/task3/data/uploads/semeval2017-task3.pdf
- https://github.com/RaRe-Technologies/gensim-data/issues/18
- https://github.com/Witiko/semeval-2016_2017-task3-subtaskB-english
| text8 | 31 MB |
- http://mattmahoney.net/dc/textdata.html
| wiki-english-20171001 | 6214 MB |
- https://dumps.wikimedia.org/enwiki/20171001/
### Models
| name | num vectors | file size | base dataset | read_more | description | parameters | preprocessing | license |
|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|
| conceptnet-numberbatch-17-06-300 | 1917247 | 1168 MB | ConceptNet, word2vec, GloVe, and OpenSubtitles 2016 |
- http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14972
- https://github.com/commonsense/conceptnet-numberbatch
- http://conceptnet.io/
- dimension - 300
| fasttext-wiki-news-subwords-300 | 999999 | 958 MB | Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) |
- https://fasttext.cc/docs/en/english-vectors.html
- https://arxiv.org/abs/1712.09405
- https://arxiv.org/abs/1607.01759
- dimension - 300
| glove-twitter-100 | 1193514 | 387 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 100
| glove-twitter-200 | 1193514 | 758 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 200
| glove-twitter-25 | 1193514 | 104 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 25
| glove-twitter-50 | 1193514 | 199 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 50
| glove-wiki-gigaword-100 | 400000 | 128 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 100
| glove-wiki-gigaword-200 | 400000 | 252 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 200
| glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 300
| glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) |
- https://nlp.stanford.edu/projects/glove/
- https://nlp.stanford.edu/pubs/glove.pdf
- dimension - 50
| word2vec-google-news-300 | 3000000 | 1662 MB | Google News (about 100 billion words) |
- https://code.google.com/archive/p/word2vec/
- https://arxiv.org/abs/1301.3781
- https://arxiv.org/abs/1310.4546
- https://www.microsoft.com/en-us/research/publication/linguistic-regularities-in-continuous-space-word-representations/?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf
- dimension - 300
| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) |
- https://www.academia.edu/24306935/WebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models
- http://rusvectores.org/en/
- https://github.com/RaRe-Technologies/gensim-data/issues/3
- dimension - 300
- window_size - 10
(generated by [generate_table.py](https://github.com/RaRe-Technologies/gensim-data/blob/master/generate_table.py) based on [list.json](https://github.com/RaRe-Technologies/gensim-data/blob/master/list.json))
----
# Want to add a new corpus or model?
1. Compress your data set using gzip or bz2.
2. Share the compressed file on any file-sharing service.
2. Create a [new issue](https://github.com/RaRe-Technologies/gensim-data/issues) and give us the dataset link. Add a **detailed description** on **why** and **how** you created the dataset, any related papers or research, plus how do you expect other users should use it. Include a code example where relevant.
----------------
`Gensim-data` is open source software released under the [LGPL 2.1 license](https://github.com/rare-technologies/gensim-data/blob/master/LICENSE).
Copyright (c) 2018 [RARE Technologies](https://rare-technologies.com/).