An open API service indexing awesome lists of open source software.

https://github.com/estamos/word2vec-thesis

🎓 Diploma Thesis | A Word2vec comparative study of CBOW and Skipgram
https://github.com/estamos/word2vec-thesis

cbow continuous-bag-of-words gensim gensim-word2vec machine-learning nlp skipgram skipgram-algorithm word-embeddings word2vec

Last synced: 7 months ago
JSON representation

🎓 Diploma Thesis | A Word2vec comparative study of CBOW and Skipgram

Awesome Lists containing this project

README

          



University of Thessaly



word2vec-thesis



Thesis as published in UTH Institutional Repository




Thesis


Thesis Presentation

Developed as a part of thesis "Big Data Analytics using Machine Learning Algorithms" - A Word2vec comparative study of CBOW and Skipgram.



Dataset


Releases


Pre-trained models

## Run
```bash
# Download latest available release
wget https://github.com/estamos/word2vec-thesis/releases/download/final/word2vec-thesis-final.tar.gz
tar -xvf word2vec-thesis-final.tar.gz
cd word2vec-thesis-final
cp test/test.py models
cd models
pip install gensim tabulate
python test.py
```
## Word2vec - CBOW & Skipgram Comparative Tool


Word2vec - CBOW & Skipgram Comparative Tool

## Word2vec Architectures Performance Comparison Graphs

#### Effective words per epoch
Effective words per epoch

#### Training time per epoch
Training time per epoch

## Word2vec Parameterization

|**Gensim parameter**|**Tensorflow parameter**|**Type**|**Details**|
|:------------------:|:----------------------:|:------:|:---------:|
| alpha | learning_rate | float | The initial learning rate |
| cbow_mean | - | boolean | 0: use the sum of the context word vectors
1: use the mean, only applies when cbow is used |
| epochs | epochs | int | Number of iterations (epochs) over the corpus |
| hs | - | boolean | 0: hierarchical softmax will be used for model training
1: if negative is non-zero, negative sampling will be used |
| min_count | min_count | int | Maximum distance between the current and predicted word within asentence |
| negative | num_neg_samples | int | how many "noise words" should be drawn |
| sample | subsample | float | The threshold for configuring which higher-frequency words are randomly downsampled |
| sg | - | boolean | 0: CBOW
1: Skipgram |
| vector_size | embedding_dim | int | Dimensionality of the word vectors |
| window | window_size | int | Maximum distance between the current and predicted word within a sentence |

## Statistics
Trained with parameters
|**Gensim parameter**|**Value**|
|:------:|:---------:|
| window | 10
| min_count | 2
| workers | 10
| total_examples | len(documents)
| epochs | 10
### Total training time

|**CBOW**|**Skipgram**|
|:------:|:----------:|
|956.5|3768.5|

### Total effective words

|**CBOW**|**Skipgram**|
|:------:|:----------:|
|1327456338|1327454735|

### Training time per epoch

|**Epoch**|**CBOW**|**Skipgram**|
|:-------:|:------:|:----------:|
|Average|95.65|376.85|
|1|95.9|338.3|
|2|95.3|340.0|
|3|96.7|339.9|
|4|96.1|448.0|
|5|95.4|339.3|
|6|95.3|339.8|
|7|95.6|339.9|
|8|95.3|599.3|
|9|95.3|342.8|
|10|95.6|341.2|

### Effective words per epoch

|**Epoch**|**CBOW**|**Skipgram**|
|:-------:|:------:|:----------:|
|Average|132745634|132745474|
|1|132750757|132744876|
|2|132744712|132741580|
|3|132743879|132750658|
|4|132748376|132743435|
|5|132747942|132749631|
|6|132746112|132744974|
|7|132744511|132745877|
|8|132742194|132744706|
|9|132740767|132745693|
|10|132747088|132743305|

#### Tree
```
.
├── LICENSE
├── README.md
├── dataset
│   └── wiki_en_corpus.txt
├── logs
│   ├── cbow-log.rtf
│   └── skipgram-log.rtf
├── models
│   ├── word2vec-cbow-trained.model
│   ├── word2vec-cbow-trained.model.syn1neg.npy
│   ├── word2vec-cbow-trained.model.wv.vectors.npy
│   ├── word2vec-skipgram-trained.model
│   ├── word2vec-skipgram-trained.model.syn1neg.npy
│   └── word2vec-skipgram-trained.model.wv.vectors.npy
├── test
│   └── test.py
└── train
├── cbow
│   └── cbow.py
└── skipgram
└── skipgram.py
```