Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/OlgaChernytska/word2vec-pytorch
Implementation of the first paper on word2vec
https://github.com/OlgaChernytska/word2vec-pytorch
deep-learning natural-language-processing pytorch word2vec
Last synced: about 1 month ago
JSON representation
Implementation of the first paper on word2vec
- Host: GitHub
- URL: https://github.com/OlgaChernytska/word2vec-pytorch
- Owner: OlgaChernytska
- Created: 2021-08-24T11:46:44.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-01-04T18:56:26.000Z (about 3 years ago)
- Last Synced: 2024-09-26T20:30:15.180Z (5 months ago)
- Topics: deep-learning, natural-language-processing, pytorch, word2vec
- Language: Python
- Homepage:
- Size: 18 MB
- Stars: 197
- Watchers: 2
- Forks: 87
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Word2Vec in PyTorch
Implementation of the first paper on word2vec - [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/abs/1301.3781). For detailed explanation of the code here, check my post - [Word2vec with PyTorch: Reproducing Original Paper](https://notrocketscience.blog/word2vec-with-pytorch-implementing-original-paper/).
## Word2Vec Overview
There 2 model architectures desctibed in the paper:
- Continuous Bag-of-Words Model (CBOW), that predicts word based on its context;
- Continuous Skip-gram Model (Skip-Gram), that predicts context for a word.Difference with the original paper:
- Trained on [WikiText-2](https://pytorch.org/text/stable/datasets.html#wikitext-2) and [WikiText103](https://pytorch.org/text/stable/datasets.html#wikitext103) inxtead of Google News corpus.
- Context for both models is represented as 4 history and 4 future words.
- For CBOW model averaging for context word embeddings used instead of summation.
- For Skip-Gram model all context words are sampled with the same probability.
- Plain Softmax was used instead of Hierarchical Softmax. No Huffman tree used either.
- Adam optimizer was used instead of Adagrad.
- Trained for 5 epochs.
- Regularization applied: embedding vector norms are restricted to 1.### CBOW Model in Details
#### High-Level Model
data:image/s3,"s3://crabby-images/b2949/b2949c633dcbb27be88d3c7c609ec8161f0bcb5e" alt="alt text"
#### Model Architecture
data:image/s3,"s3://crabby-images/fa00a/fa00a72b1aa22ad239c06fcb11c0e912e13e5b5e" alt="alt text"### Skip-Gram Model in Details
#### High-Level Model
data:image/s3,"s3://crabby-images/c73b5/c73b515d8e8efae01bf72898263ead860e2f1755" alt="alt text"
#### Model Architecture
data:image/s3,"s3://crabby-images/8b727/8b727a25551813d2ddb2ab23240080ed4a5af596" alt="alt text"## Project Structure
```
.
├── README.md
├── config.yaml
├── notebooks
│ └── Inference.ipynb
├── requirements.txt
├── train.py
├── utils
│ ├── constants.py
│ ├── dataloader.py
│ ├── helper.py
│ ├── model.py
│ └── trainer.py
└── weights
```- **utils/dataloader.py** - data loader for WikiText-2 and WikiText103 datasets
- **utils/model.py** - model architectures
- **utils/trainer.py** - class for model training and evaluation- **train.py** - script for training
- **config.yaml** - file with training parameters
- **weights/** - folder where expriments artifacts are stored
- **notebooks/Inference.ipynb** - demo of how embeddings are used## Usage
```
python3 train.py --config config.yaml
```Before running the command, change the training parameters in the config.yaml, most important:
- model_name ("skipgram", "cbow")
- dataset ("WikiText2", "WikiText103")
- model_dir (directory to store experiment artifacts, should start with "weights/")## License
This project is licensed under the terms of the [MIT license](https://choosealicense.com/licenses/mit/).