https://github.com/OlgaChernytska/word2vec-pytorch

Implementation of the first paper on word2vec
https://github.com/OlgaChernytska/word2vec-pytorch

deep-learning natural-language-processing pytorch word2vec

Last synced: 5 days ago
JSON representation

Implementation of the first paper on word2vec

Host: GitHub
URL: https://github.com/OlgaChernytska/word2vec-pytorch
Owner: OlgaChernytska
Created: 2021-08-24T11:46:44.000Z (about 4 years ago)
Default Branch: main
Last Pushed: 2022-01-04T18:56:26.000Z (over 3 years ago)
Last Synced: 2024-09-26T20:30:15.180Z (about 1 year ago)
Topics: deep-learning, natural-language-processing, pytorch, word2vec
Language: Python
Homepage:
Size: 18 MB
Stars: 197
Watchers: 2
Forks: 87
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Word2Vec in PyTorch

Implementation of the first paper on word2vec - [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781). For detailed explanation of the code here, check my post - [Word2vec with PyTorch: Reproducing Original Paper](https://notrocketscience.blog/word2vec-with-pytorch-implementing-original-paper/).

## Word2Vec Overview

There 2 model architectures desctibed in the paper:

- Continuous Bag-of-Words Model (CBOW), that predicts word based on its context;
- Continuous Skip-gram Model (Skip-Gram), that predicts context for a word.

Difference with the original paper:

- Trained on [WikiText-2](https://pytorch.org/text/stable/datasets.html#wikitext-2) and [WikiText103](https://pytorch.org/text/stable/datasets.html#wikitext103) inxtead of Google News corpus.
- Context for both models is represented as 4 history and 4 future words.
- For CBOW model averaging for context word embeddings used instead of summation.
- For Skip-Gram model all context words are sampled with the same probability.
- Plain Softmax was used instead of Hierarchical Softmax. No Huffman tree used either.
- Adam optimizer was used instead of Adagrad.
- Trained for 5 epochs.
- Regularization applied: embedding vector norms are restricted to 1.

### CBOW Model in Details
#### High-Level Model
![alt text](docs/cbow_overview.png)
#### Model Architecture
![alt text](docs/cbow_detailed.png)

### Skip-Gram Model in Details
#### High-Level Model
![alt text](docs/skipgram_overview.png)
#### Model Architecture
![alt text](docs/skipgram_detailed.png)

## Project Structure

```
.
├── README.md
├── config.yaml
├── notebooks
│ └── Inference.ipynb
├── requirements.txt
├── train.py
├── utils
│ ├── constants.py
│ ├── dataloader.py
│ ├── helper.py
│ ├── model.py
│ └── trainer.py
└── weights
```

- **utils/dataloader.py** - data loader for WikiText-2 and WikiText103 datasets
- **utils/model.py** - model architectures
- **utils/trainer.py** - class for model training and evaluation

- **train.py** - script for training
- **config.yaml** - file with training parameters
- **weights/** - folder where expriments artifacts are stored
- **notebooks/Inference.ipynb** - demo of how embeddings are used

## Usage

```
python3 train.py --config config.yaml
```

Before running the command, change the training parameters in the config.yaml, most important:

- model_name ("skipgram", "cbow")
- dataset ("WikiText2", "WikiText103")
- model_dir (directory to store experiment artifacts, should start with "weights/")

## License
This project is licensed under the terms of the [MIT license](https://choosealicense.com/licenses/mit/).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/OlgaChernytska/word2vec-pytorch

Awesome Lists containing this project

README