Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dayyass/language-modeling
Pipeline for training Language Models using PyTorch.
https://github.com/dayyass/language-modeling
decoding deep-learning gpt-2 language-modeling lstm natural-language-processing ngrams nlp python pytorch rnn sampling text-generation
Last synced: about 1 month ago
JSON representation
Pipeline for training Language Models using PyTorch.
- Host: GitHub
- URL: https://github.com/dayyass/language-modeling
- Owner: dayyass
- Created: 2021-01-02T17:55:26.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-05-24T20:34:15.000Z (over 2 years ago)
- Last Synced: 2023-03-06T20:35:50.041Z (over 1 year ago)
- Topics: decoding, deep-learning, gpt-2, language-modeling, lstm, natural-language-processing, ngrams, nlp, python, pytorch, rnn, sampling, text-generation
- Language: Python
- Homepage:
- Size: 68.4 KB
- Stars: 12
- Watchers: 1
- Forks: 0
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
### About
Pipeline for training Language Models using PyTorch.
Inspired by Yandex Data School [NLP Course](https://github.com/yandexdataschool/nlp_course) ([week03](https://github.com/yandexdataschool/nlp_course/tree/2020/week03_lm): Language Modeling)### Usage
First, install dependencies:
```
# clone repo
git clone https://github.com/dayyass/language_modeling.git# install dependencies
cd language_modeling
pip install -r requirements.txt
```### Data Format
Prepared text file with space separated words on each line.
More about it [here](data/README.md).### Statistical Language Modeling
#### Training
Script for training statistical language models:
```
python statistical_lm/train.py --path_to_data "data/arxiv_train.txt" --n 3 --path_to_save "models/3_gram_language_model.pkl"
```
Required arguments:
- **--path_to_data** - path to train data
- **--n** - n-gram orderOptional arguments:
- **--smoothing** - smoothing method (available: None, "add-k") (default: *None*)
- **--delta** - smoothing additive parameter (only for add-k smoothing) (default: *1.0*)
- **--path_to_save** - path to save model (default: *"models/language_model.pkl"*)
- **--verbose** - verbose (default: *True*)#### Validation
Script for validation statistical language models using perplexity:
```
python statistical_lm/validate.py --path_to_data "data/arxiv_test.txt" --path_to_model "models/3_gram_language_model.pkl"
```
Required arguments:
- **--path_to_data** - path to validation data
- **--path_to_model** - path to language modelOptional arguments:
- **--verbose** - verbose (default: *True*)#### Inference
Script for generation new sequences using statistical language models:
```
python statistical_lm/inference.py --path_to_model "models/3_gram_language_model.pkl" --prefix "artificial" --temperature 0.5
```
Required arguments:
- **--path_to_model** - path to language modelOptional arguments:
- **--prefix** - prefix before sequence generation (default: *""*)
- **--strategy** - decoding strategy (available: "sampling", "top-k-uniform", "top-k", "top-p-uniform", "top-p" and "beam search") (default: *"sampling"*)
- **--temperature** - sampling temperature, if temperature == 0.0, always takes most likely token - greedy decoding (only for "sampling" decoding strategy) (default: *0.0*)
- **--k** - top-k parameter (only for "top-k-uniform" and "top-k" decoding strategy) (default: *10*)
- **--p** - top-p parameter (only for "top-p-uniform" and "top-p" decoding strategy) (default: *0.9*)
- **--max_length** - max number of generated words (default: *100*)
- **--seed** - random seed (default: *42*)Command output with 3-gram language model trained on [*arxiv.txt*](data/README.md) with prefix "*artificial*" and temperature 0.5:
```
artificial neural network ( cnn ) architectures on h2o platform for real - world applications .
```### RNN Language Modeling
#### Training
Script for training RNN language models:
```
python rnn_lm/train.py --path_to_data "data/arxiv_train.txt" --path_to_save_folder "models/rnn_language_model" --n_epoch 5 --max_length 512 --batch_size 128 --embedding_dim 64 --rnn_hidden_size 256
```
Required arguments:
- **--path_to_data** - path to train data
- **--n_epoch** - number of epochs
- **--batch_size** - dataloader batch_size
- **--embedding_dim** - embedding dimension
- **--rnn_hidden_size** - LSTM hidden sizeOptional arguments:
- **--path_to_save_folder** - path to save folder (default: *"models/rnn_language_model"*)
- **--max_length** - max sentence length (chars) (default: *None*)
- **--shuffle** - dataloader shuffle (default: *True*)
- **--rnn_num_layers** - number of LSTM layers (default: *1*)
- **--rnn_dropout** - LSTM dropout (default: *0.0*)
- **--train_eval_freq** - evaluation frequency (number of batches) (default: *50*)
- **--clip_grad_norm** - max_norm parameter in clip_grad_norm (default: *1.0*)
- **--seed** - random seed (default: *42*)
- **--device** - torch device (available: "cpu", "cuda") (default: *"cuda"*)
- **--verbose** - verbose (default: *True*)#### Validation
Script for validation RNN language models using perplexity:
```
python rnn_lm/validate.py --path_to_data "data/arxiv_test.txt" --path_to_model_folder "models/rnn_language_model" --max_length 512
```
Required arguments:
- **--path_to_data** - path to validation data
- **--path_to_model** - path to language modelOptional arguments:
- **--max_length** - max sentence length (chars) (default: *None*)
- **--seed** - random seed (default: *42*)
- **--device** - torch device (available: "cpu", "cuda") (default: *"cuda"*)
- **--verbose** - verbose (default: *True*)#### Inference
Script for generation new sequences using RNN language models:
```
python rnn_lm/inference.py --path_to_model_folder "models/rnn_language_model" --prefix "artificial" --temperature 0.5
```
Required arguments:
- **--path_to_model_folder** - path to language model folderOptional arguments:
- **--prefix** - prefix before sequence generation (default: *""*)
- **--temperature** - sampling temperature, if temperature == 0.0, always takes most likely token - greedy decoding (default: *0.0*)
- **--max_length** - max number of generated tokens (chars) (default: *100*)
- **--seed** - random seed (default: *42*)
- **--device** - torch device (available: "cpu", "cuda") (default: *"cuda"*)Command output with RNN language model trained on [*arxiv.txt*](data/README.md) with prefix "*artificial*" and temperature 0.5:
```
artificial visual information of the number , using an intervidence for detection for order to the recognition
```### Models
List of implemented models:
- [x] [N-gram Language Model](https://github.com/dayyass/language_modeling/blob/b962edac04dfe10a3f87dfa16d4d37508af6d5de/model.py#L57)
- [x] [RNN Language Model](https://github.com/dayyass/language_modeling/blob/407d02b79d6d7fd614dc7c5fd235ad269cddcb2d/rnn_lm/model.py#L6) (char-based)
- [ ] GPT Language Model### Decoding Strategy
- [x] greedy
- [x] temperature sampling
- [x] top-k-uniform
- [x] top-k
- [x] top-p-uniform
- [x] top-p
- [ ] beam search### Smoothing (only for N-gram Language Models)
- [x] no smoothing
- [x] add-k / Laplace smoothing
- [ ] interpolation smoothing
- [ ] back-off / Katz smoothing
- [ ] Kneser-Ney smoothing### Models Comparison
Generation comparison available [here](https://github.com/dayyass/language-modeling/wiki/Generation-Comparison).#### Statistical Language Modeling
| perplexity (train / test) | none | add-k / Laplace | interpolation | back-off / Katz | Kneser-Ney |
| ------------------------- | ---------------- | -------------------| ------------- | --------------- | ---------- |
| **1-gram** | 881.27 / 1832.23 | 882.63 / 1838.22 | - | - | - |
| **2-gram** | 95.32 / 8.57e+7 | 1106.79 / 1292.02 | - | - | - |
| **3-gram** | 12.78 / 6.2e+22 | 7032.91 / 10499.24 | - | - | - |