{"id":18398889,"url":"https://github.com/lyeoni/pretraining-for-language-understanding","last_synced_at":"2025-04-07T05:33:56.212Z","repository":{"id":196956115,"uuid":"194381382","full_name":"lyeoni/pretraining-for-language-understanding","owner":"lyeoni","description":"Pre-training of Language Models for Language Understanding","archived":false,"fork":false,"pushed_at":"2019-08-24T14:21:34.000Z","size":575,"stargazers_count":83,"open_issues_count":0,"forks_count":14,"subscribers_count":8,"default_branch":"master","last_synced_at":"2025-03-22T14:03:32.832Z","etag":null,"topics":["language-model","language-modeling","language-understanding","nlp","pytorch","pytorch-tutorial"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lyeoni.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2019-06-29T08:12:39.000Z","updated_at":"2024-04-30T15:06:54.000Z","dependencies_parsed_at":"2023-09-28T11:47:28.698Z","dependency_job_id":null,"html_url":"https://github.com/lyeoni/pretraining-for-language-understanding","commit_stats":null,"previous_names":["lyeoni/pretraining-for-language-understanding"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fpretraining-for-language-understanding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fpretraining-for-language-understanding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fpretraining-for-language-understanding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lyeoni%2Fpretraining-for-language-understanding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lyeoni","download_url":"https://codeload.github.com/lyeoni/pretraining-for-language-understanding/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247601378,"owners_count":20964861,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["language-model","language-modeling","language-understanding","nlp","pytorch","pytorch-tutorial"],"created_at":"2024-11-06T02:24:47.717Z","updated_at":"2025-04-07T05:33:53.180Z","avatar_url":"https://github.com/lyeoni.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Pre-training For Language Understanding\n[![LICENSE](https://img.shields.io/github/license/lyeoni/pretraining-for-language-understanding?style=flat-square)](https://github.com/lyeoni/pretraining-for-language-understanding/blob/master/LICENSE)\n[![GitHub issues](https://img.shields.io/github/issues/lyeoni/pretraining-for-language-understanding?style=flat-square\u0026color=yellow)](https://github.com/lyeoni/pretraining-for-language-understanding/issues)\n[![GitHub stars](https://img.shields.io/github/stars/lyeoni/pretraining-for-language-understanding?style=flat-square\u0026color=important)](https://github.com/lyeoni/pretraining-for-language-understanding/stargazers)\n[![GitHub forks](https://img.shields.io/github/forks/lyeoni/pretraining-for-language-understanding?style=flat-square\u0026color=blueviolet)](https://github.com/lyeoni/pretraining-for-language-understanding/network/members)\n\nNow, Pre-training of Language Model for Language Understanding is a significant step in the context of NLP.\n\nA language model would be trained on a massive corpus, and then we can use it as a component in other models that need to handle language (e.g. using it for downstream tasks).\n\n## Overview\n### Language Model\nA Lanugage Model (LM) captures **the distribution over all possible sentences**.\n- Input : a sentence\n- Output : the probability of the input sentence\n\nWhile language modeling is a typical _unsupervised learning_ on massive corpus, we turn this into a _sequence of supervised learning_ in this repo.\n\n#### Autoregressive Language Model\n\n\u003cp align=\"center\"\u003e\n\u003cimg width=\"500\" src=\"https://storage.googleapis.com/deepmind-live-cms/documents/BlogPost-Fig2-Anim-160908-r01.gif\" align=\"middle\"\u003e\n\u003c/p\u003e\n\u003cbr\u003e\n\nAutoregressive language model captures the distribution over the next token is based on all the previous token. In other words, it looks at the previous token, and predicts the next token.\n\nThe objective of Autoregressive language model is expressed in a formula as follows:\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://latex.codecogs.com/svg.latex?\\dpi{100}\u0026space;input\\;\u0026space;sentence\u0026space;:\u0026space;x\u0026space;=\u0026space;(x_{1},\u0026space;x_{2},...,\u0026space;x_{t})\" title=\"input\\; sentence : x = (x_{1}, x_{2},..., x_{t})\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://latex.codecogs.com/svg.latex?\\dpi{100}\u0026space;likelihood\u0026space;:\u0026space;p(X)\u0026space;=\u0026space;p(x_{1})p(x_{2}|x_{1})\\cdots\u0026space;p(x_{t}|x_{1},...x_{t-1})\u0026space;=\u0026space;\\prod_{t=1}^{T}p(x_{t}|\u0026space;x_{\u003ct})\" title=\"likelihood : p(X) = p(x_{1})p(x_{2}|x_{1})\\cdots p(x_{t}|x_{1},...x_{t-1}) = \\prod_{t=1}^{T}p(x_{t}| x_{\u003ct})\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://latex.codecogs.com/svg.latex?\\dpi{100}\u0026space;objtective:\u0026space;\\underset{\\theta}{max}\\;\u0026space;logp_{\\theta}(x)\u0026space;=\u0026space;\\underset{\\theta}{max}\\;\u0026space;\\sum_{t=1}^{T}log(p(x_{t}|x_{\u003ct}))\" title=\"objtective: \\underset{\\theta}{max}\\; logp_{\\theta}(x) = \\underset{\\theta}{max}\\; \\sum_{t=1}^{T}log(p(x_{t}|x_{\u003ct}))\" /\u003e\n\u003c/p\u003e\n\nBecause Autoregressive language model should be forward or backward, only one-way uni-directional context information can be used. Therefore, it's difficult to understand the context in both directions simultaneously.\n\nRNNLM, ELMo are typical example of Autoregressive language model, and **Unidirectional/Bidirectional LSTM language models** are covered in this repo.\n\n- cf. Bidirectional LSTM LM, ELMo use context in both directions. However, only shallow understanding is possible beacuase it use contexts that are independently learned in each direction.\n- cf. For a detailed description of the model architecture, refer to the paper/repo in the Reference tab below.\n\n## 1. Build Corpus\n\n### Wikipedia\nWikipedia regularly distributes the entire document. You can download Korean Wikipedia dump [here](https://dumps.wikimedia.org/kowiki/) (and English Wikipedia dump [here](https://dumps.wikimedia.org/enwiki/)).\nWikipedia recommends using `pages-articles.xml.bz2`, which includes only the latest version of the entire document, and is approximately 600 MB compressed (for English, `pages-articles-multistream.xml.bz2`).\n\nYou can use `wikipedia_ko.sh` script to download the dump on the latest Korean Wikipedia document. For English, use `wikipedia_en.sh`\n\nexample:\n```\n$ cd build_corpus\n$ chmod 777 wikipedia_ko.sh\n$ ./wikipedia_ko.sh\n```\n\nThe downloaded dump using above shell script is in XML format, and we need to parse XML to text file. The Python script `WikiExtractor.py` in [attardi/wikiextractor](https://github.com/attardi/wikiextractor) repo, extracts and cleans text from the dump.\n\nexample:\n```\n$ git clone https://github.com/attardi/wikiextractor\n$ python wikiextractor/WikiExtractor.py kowiki-latest-pages-articles.xml\n\n$ head -n 4 text/AA/wiki_02\n\u003cdoc id=\"577\" url=\"https://ko.wikipedia.org/wiki?curid=577\" title=\"천문학\"\u003e\n천문학\n\n천문학(天文學, )은 별이나 행성, 혜성, 은하와 같은 천체와, 지구 대기의 ..\n\u003c/doc\u003e\n```\n\nThe extracted text is saved as text file of a certain size. To combine these, use `build_corpus.py`. The output `corpus.txt` contains _4,277,241 sentences, 55,568,030 words_.\n\nexample:\n```\n$ python build_corpus.py \u003e corpus.txt\n$ wc corpus.txt \n4277241  55568030 596460787 corpus.txt\n```\n\nNow, you need to split the corpus to train-set and test-set.\n\n```\n$ cat corpus.txt | shuf \u003e corpus.shuf.txt\n$ head -n 855448 corpus.shuf.txt \u003e corpus.test.txt\n$ tail -n 3421793 corpus.shuf.txt \u003e corpus.train.txt\n$ wc -l corpus.train.txt corpus.test.txt\n  3421793 corpus.train.txt\n   855448 corpus.test.txt\n  4277241 합계\n```\n \n## 2. Preprocessing\n\n### Build Vocab\nOur corpus `corpus.txt` has 55,568,030 words, and 608,221 unique words. If the minimum frequency needed to include a token in the vocabulary is set to 3, the vocabulary contains **_297,773_** unique words.\n\nHere we use the train corpus `corpus.train.txt` to build vocabulary.\nThe vocabulary built by train corpus contains **_557,627_** unique words, and **_271,503_** unique words that appear at least three times.\n\nexample:\n```\n$ python build_vocab.py --corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --min_freq 3 --lower\nNamespace(bos_token='\u003cbos\u003e', corpus='build_corpus/corpus.train.txt', eos_token='\u003ceos\u003e', is_tokenized=False, lower=True, min_freq=3, pad_token='\u003cpad\u003e', tokenizer='mecab', unk_token='\u003cunk\u003e', vocab='vocab.train.pkl')\nVocabulary size:  271503\nVocabulary saved to vocab.train.pkl\n```\n\nSince the vocabulary file is too large(~1.3GB) to upload on this repo, I uploaded it to Google Drive.\n- `vocab.train.pkl` : [[download]](https://drive.google.com/file/d/195kdXPQtiG0eqppH-L2VKoHAcgqCSR1l/view?usp=sharing)\n\n## 3. Training\n\n```\n$ python lm_trainer.py -h\nusage: lm_trainer.py [-h] --train_corpus TRAIN_CORPUS --vocab VOCAB\n                     --model_type MODEL_TYPE [--test_corpus TEST_CORPUS]\n                     [--is_tokenized] [--tokenizer TOKENIZER]\n                     [--max_seq_len MAX_SEQ_LEN] [--multi_gpu] [--cuda CUDA]\n                     [--epochs EPOCHS] [--batch_size BATCH_SIZE]\n                     [--clip_value CLIP_VALUE] [--shuffle SHUFFLE]\n                     [--embedding_size EMBEDDING_SIZE]\n                     [--hidden_size HIDDEN_SIZE] [--n_layers N_LAYERS]\n                     [--dropout_p DROPOUT_P]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --train_corpus TRAIN_CORPUS\n  --vocab VOCAB\n  --model_type MODEL_TYPE\n                        Model type selected in the list: LSTM, BiLSTM\n  --test_corpus TEST_CORPUS\n  --is_tokenized        Whether the corpus is already tokenized\n  --tokenizer TOKENIZER\n                        Tokenizer used for input corpus tokenization\n  --max_seq_len MAX_SEQ_LEN\n                        The maximum total input sequence length after\n                        tokenization\n  --multi_gpu           Whether to training with multiple GPU\n  --cuda CUDA           Whether CUDA is currently available\n  --epochs EPOCHS       Total number of training epochs to perform\n  --batch_size BATCH_SIZE\n                        Batch size for training\n  --clip_value CLIP_VALUE\n                        Maximum allowed value of the gradients. The gradients\n                        are clipped in the range\n  --shuffle SHUFFLE     Whether to reshuffle at every epoch\n  --embedding_size EMBEDDING_SIZE\n                        Word embedding vector dimension\n  --hidden_size HIDDEN_SIZE\n                        Hidden size of LSTM\n  --n_layers N_LAYERS   Number of layers in LSTM\n  --dropout_p DROPOUT_P\n                        Dropout rate used for dropout layer in LSTM\n```\n\nexample:\n```\n$ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --batch_size 16\n```\n\nYou can select your own parameter values via argument inputs. \n\n### Training with multiple GPU\n\nTraining a model with single GPU is not only very slow, it also limits adjusting batch size, model size, and so on.\nTo accelerate model training with multiple GPU and use large model, what you have to do is to include `--multi_gpu` flag like belows. For more details, please check [here](https://github.com/lyeoni/pretraining-for-language-understanding/blob/master/parallel.py).\n\n#### Training Unidiretional LSTM Language Model\nThis example code trains unidirectional-LSTM model on the Wikipedia corpus using parallel training on 8 * V100 GPUs.\n\n```\n$ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --multi_gpu\nNamespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='LSTM', multi_gpu=True, n_layers=3, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')\n=========MODEL=========\n DataParallelModel(\n  (module): LSTMLM(\n    (embedding): Embedding(271503, 256)\n    (lstm): LSTM(256, 1024, num_layers=3, batch_first=True, dropout=0.2)\n    (fc): Linear(in_features=1024, out_features=512, bias=True)\n    (fc2): Linear(in_features=512, out_features=271503, bias=True)\n    (softmax): LogSoftmax()\n  )\n)\n```\n\n#### Training Bidirectional LSTM Language Model\nThis example code trains Bidirectional-LSTM model on the Wikipedia corpus using parallel training on 8 * V100 GPUs.\n\n```\n$ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type BiLSTM --n_layers 1 --multi_gpu\nNamespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='BiLSTM', multi_gpu=True, n_layers=1, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')\n=========MODEL=========\n DataParallelModel(\n  (module): BiLSTMLM(\n    (embedding): Embedding(271503, 256)\n    (lstm): LSTM(256, 1024, batch_first=True, dropout=0.2, bidirectional=True)\n    (fc): Linear(in_features=2048, out_features=1024, bias=True)\n    (fc2): Linear(in_features=1024, out_features=512, bias=True)\n    (fc3): Linear(in_features=512, out_features=271503, bias=True)\n    (softmax): LogSoftmax()\n  )\n)\n```\n\n## 4. Evaluation\n\n### Perplexity\n\nA language model captures the distribution over all possible sentences. And, the best language model is one that the best predicts an unseen sentence. Perplexty is a very common measurement of how well a probability distribution predicts unseen sentences.\n\n**_Perplexity_** : _inverse probability of the given sentence, normalized by the number of words (by taking geometric mean)_ \n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://latex.codecogs.com/svg.latex?\\dpi{100}\u0026space;PP(W)\u0026space;=\u0026space;P(w_{1},\u0026space;w_{2}...w_{n})^{-\\frac{1}{n}}\u0026space;=\\sqrt[n]{\\frac{1}{P(w_{1}w_{2}...w_{N})}}\" title=\"PP(W) = P(w_{1}, w_{2}...w_{n})^{-\\frac{1}{n}} =\\sqrt[n]{\\frac{1}{P(w_{1}w_{2}...w_{N})}}\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://latex.codecogs.com/svg.latex?\\dpi{100}\u0026space;Chain\\;\u0026space;rule:\\;\u0026space;PP(W)\u0026space;=\u0026space;\\sqrt[n]{\\prod_{i=1}^{N}\\frac{1}{P(w_{i}|w_{1}...w_{i-1})}}\" title=\"Chain\\; rule:\\; PP(W) = \\sqrt[n]{\\prod_{i=1}^{N}\\frac{1}{P(w_{i}|w_{1}...w_{i-1})}}\" /\u003e\n\u003c/p\u003e\n\nAs you can see from the above equation, perplexity is defined as the exponentiated negative average log-likelihood. In other words, maximizing probability is the same as minimizing perplexity.\n\n### Results\n\nAnd now, perplexity is the metric that we're going to be using.\nA low perplexity indicates that the probability distribution is good at predicting the sentence. \n\n|Model|Loss|Perplexity|\n|-|-:|-:|\n|Unidirectional-LSTM|3.496|33.037|\n|Bidirectional-LSTM|1.896|6.669|\n|Bidirectional-LSTM-Large (_hidden_size_ = 1024)|1.771|5.887|\n\u003cbr\u003e\n\n\n## Reference\n\n### General\n- [Google DeepMind] [WaveNet: A Generative Model for Raw Audio](https://deepmind.com/blog/wavenet-generative-model-raw-audio/)\n- [Dan Jurafsky] [CS 124: From Languages to Information at Stanford](https://web.stanford.edu/class/cs124/lec/languagemodeling2019.pdf)\n- [attardi/wikiextractor] [WikiExtractor](https://github.com/attardi/wikiextractor)\n\n### Models\n\n#### Unidirectiaonl LSTM LM\n- [DSKSD] [6. Recurrent Neural Networks and Language Models](https://nbviewer.jupyter.org/github/DSKSD/DeepNLP-models-Pytorch/blob/master/notebooks/06.RNN-Language-Model.ipynb)\n- [yunjey/pytorch-tutorial] [Language Model (RNN-LM)](https://github.com/yunjey/pytorch-tutorial/blob/master/tutorials/02-intermediate/language_model/main.py)\n- [pytorch/examples] [Word-level language modeling RNN](https://github.com/pytorch/examples/tree/master/word_language_model)\n\n#### Bidirectional LSTM LM\n- [Mousa, Amr, and Björn Schuller] [Contextual Bidirectional Long Short-Term Memory Recurrent Neural Network Language Models:A Generative Approach to Sentiment Analysis](https://www.aclweb.org/anthology/E17-1096)\n- [Motoki Wu] [The Bidirectional Language Model](https://medium.com/@plusepsilon/the-bidirectional-language-model-1f3961d1fb27)\n\n### Multi GPU Training\n- [matthew l][PyTorch Multi-GPU 제대로 학습하기](https://medium.com/daangn/pytorch-multi-gpu-%ED%95%99%EC%8A%B5-%EC%A0%9C%EB%8C%80%EB%A1%9C-%ED%95%98%EA%B8%B0-27270617936b)\n- [zhanghang1989/PyTorch-Encoding] [PyTorch-Encoding](https://github.com/zhanghang1989/PyTorch-Encoding)\n, [Issue: How to use the DataParallelCriterion, DataParallelModel](https://github.com/zhanghang1989/PyTorch-Encoding/issues/54)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyeoni%2Fpretraining-for-language-understanding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flyeoni%2Fpretraining-for-language-understanding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flyeoni%2Fpretraining-for-language-understanding/lists"}