Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/chakki-works/chakin

Simple downloader for pre-trained word vectors
https://github.com/chakki-works/chakin

datasets machine-learning natural-language-processing word-embeddings word-vectors

Last synced: 2 months ago
JSON representation

Simple downloader for pre-trained word vectors

Host: GitHub
URL: https://github.com/chakki-works/chakin
Owner: chakki-works
License: mit
Created: 2017-05-19T03:40:25.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2022-06-21T21:11:46.000Z (almost 2 years ago)
Last Synced: 2024-02-21T02:22:39.775Z (4 months ago)
Topics: datasets, machine-learning, natural-language-processing, word-embeddings, word-vectors
Language: Python
Homepage: https://medium.com/chakki/simple-downloader-for-public-word-embeddings-fdbd3ce7ba5b
Size: 172 KB
Stars: 332
Watchers: 18
Forks: 49
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-embedding-models - chakin

README

        # chakin

**chakin** is a downloader for pre-trained word vectors. [Supported many vectors](#supported-vectors)

This library lets you download pre-trained word vectors without troublesome work.



  




-----------------

# Installation

To install chakin, simply:

```shell

$ pip install chakin

```

# Usage

You can download pre-trained word vectors as follows:

```shell

$ python

```

```python

>>> import chakin

>>> chakin.search(lang='English')

                   Name  Dimension                     Corpus VocabularySize  

2          fastText(en)        300                  Wikipedia           2.5M   

11         GloVe.6B.50d         50  Wikipedia+Gigaword 5 (6B)           400K   

12        GloVe.6B.100d        100  Wikipedia+Gigaword 5 (6B)           400K   

13        GloVe.6B.200d        200  Wikipedia+Gigaword 5 (6B)           400K   

14        GloVe.6B.300d        300  Wikipedia+Gigaword 5 (6B)           400K   

15       GloVe.42B.300d        300          Common Crawl(42B)           1.9M   

16      GloVe.840B.300d        300         Common Crawl(840B)           2.2M   

17    GloVe.Twitter.25d         25               Twitter(27B)           1.2M   

18    GloVe.Twitter.50d         50               Twitter(27B)           1.2M   

19   GloVe.Twitter.100d        100               Twitter(27B)           1.2M   

20   GloVe.Twitter.200d        200               Twitter(27B)           1.2M   

21  word2vec.GoogleNews        300          Google News(100B)           3.0M 

>>> chakin.download(number=2, save_dir='./') # select fastText(en)

Test: 100% ||               | Time: 0:00:02  60.7 MiB/s

'./wiki.en.vec'

```

# Supported vectors

So far, chakin supports following word vectors:

| Name 
|-------------- 
| fastText(ar) 
| fastText(de) 
| fastText(en) 
| fastText(es) 
| fastText(fr) 
| fastText(it) 
| fastText(ja) 
| fastText(ko) 
| fastText(pt) 
| fastText(ru) 
| fastText(zh) 
| GloVe.6B.50d 
| GloVe.6B.100d 
| GloVe.6B.200d 
| GloVe.6B.300d 
| GloVe.42B.300d 
| GloVe.840B.300d 
| GloVe.Twitter.25d 
| GloVe.Twitter.50d 
| GloVe.Twitter.100d 
| GloVe.Twitter.200d 
| word2vec.GoogleNews | 300 
| word2vec.Wiki-NEologd.50d

| Dimension | Corpus                    | VocabularySize | Method   | Language   | -------|-----------|---------------------------|----------------|----------|------------| | 300       | Wikipedia                 | 610K           | fastText | Arabic     | | 300       | Wikipedia                 | 2.3M           | fastText | German     | | 300       | Wikipedia                 | 2.5M           | fastText | English    | | 300       | Wikipedia                 | 985K           | fastText | Spanish    | | 300       | Wikipedia                 | 1.2M           | fastText | French     | | 300       | Wikipedia                 | 871K           | fastText | Italian    | | 300       | Wikipedia                 | 580K           | fastText | Japanese   | | 300       | Wikipedia                 | 880K           | fastText | Korean     | | 300       | Wikipedia                 | 592K           | fastText | Portuguese | | 300       | Wikipedia                 | 1.9M           | fastText | Russian    | | 300       | Wikipedia                 | 330K           | fastText | Chinese    | | 50        | Wikipedia+Gigaword 5 (6B) | 400K           | GloVe    | English    | | 100       | Wikipedia+Gigaword 5 (6B) | 400K           | GloVe    | English    | | 200       | Wikipedia+Gigaword 5 (6B) | 400K           | GloVe    | English    | | 300       | Wikipedia+Gigaword 5 (6B) | 400K           | GloVe    | English    | | 300       | Common Crawl(42B)         | 1.9M           | GloVe    | English    | | 300       | Common Crawl(840B)        | 2.2M           | GloVe    | English    | | 25        | Twitter(27B)              | 1.2M           | GloVe    | English    | | 50        | Twitter(27B)              | 1.2M           | GloVe    | English    | | 100       | Twitter(27B)              | 1.2M           | GloVe    | English    | | 200       | Twitter(27B)              | 1.2M           | GloVe    | English    | | Google News(100B)         | 3.0M           | word2vec | English    | | 50  | Wikipedia                 | 335K           | word2vec + NEologd | Japanese |