{"id":13534968,"url":"https://github.com/brightmart/bert_language_understanding","last_synced_at":"2025-04-13T05:03:15.650Z","repository":{"id":108167111,"uuid":"154396042","full_name":"brightmart/bert_language_understanding","owner":"brightmart","description":"Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN","archived":false,"fork":false,"pushed_at":"2019-01-01T16:54:54.000Z","size":16762,"stargazers_count":964,"open_issues_count":9,"forks_count":211,"subscribers_count":47,"default_branch":"master","last_synced_at":"2025-04-04T05:05:47.428Z","etag":null,"topics":["attention-is-all-you-need","bert-model","document-classification","fasttext","language-model","language-understanding","nlp","pre-training","question-answering","self-attention","text-classification","textcnn","transfer-learning","transformer-encoder"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brightmart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-23T20:58:53.000Z","updated_at":"2025-04-03T22:37:14.000Z","dependencies_parsed_at":null,"dependency_job_id":"abf32c3b-8eb2-4f0e-929c-58dde59e62ea","html_url":"https://github.com/brightmart/bert_language_understanding","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fbert_language_understanding","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fbert_language_understanding/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fbert_language_understanding/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fbert_language_understanding/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brightmart","download_url":"https://codeload.github.com/brightmart/bert_language_understanding/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248665748,"owners_count":21142123,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["attention-is-all-you-need","bert-model","document-classification","fasttext","language-model","language-understanding","nlp","pre-training","question-answering","self-attention","text-classification","textcnn","transfer-learning","transformer-encoder"],"created_at":"2024-08-01T08:00:47.762Z","updated_at":"2025-04-13T05:03:15.630Z","avatar_url":"https://github.com/brightmart.png","language":"Python","funding_links":[],"categories":["other resources for BERT:","Other Resources","Python","Repository \u0026 Tool"],"sub_categories":["Other","Weakly-supervised \u0026 Semi-supervised Learning"],"readme":"# Ideas from google's bert for language understanding: Pre-train TextCNN\n\n## Table of Contents\n\n#### 1.Introduction\n\n#### 2.Performance\n\n#### 3.Usage\n\n#### 4.Sample Data, Data Format \n\n#### 5.Suggestion to User\n\n#### 6.Short Description of BERT\n\n#### 7.Long Description of BERT from author\n\n#### 8.Pretrain Language Understanding Task \n\n#### 9.Environment\n\n#### 10.Implementation Details\n\n#### 11.Questions for Better Understanding of Transformer and BERT\n\n#### 12.Toy Task\n\n#### 13.Multi-label Classification Task\n\n#### 14.TODO List\n\n#### 15.Conclusion\n\n#### 16.References\n\n\n\n## Introduction\n##### Pre-train is all you need!\n\nBERT achieve new state of art result on more than 10 nlp tasks recently.\n\nThis is an tensorflow implementation of Pre-training of Deep Bidirectional Transformers for Language Understanding\n\n(Bert) and Attention is all you need(Transformer). \n\nUpdate: The majority part of replicate main ideas of these two papers was done, there is a apparent performance gain \n  \nfor pre-train a model \u0026 fine-tuning compare to train the model from scratch.\n\n\n##### Experiment with pre-trian and fine-tuning \nWe have done experiment to replace backbone network of bert from Transformer to TextCNN, and the result is that \n\npre-train the model with masked language model using lots of raw data can boost performance in a notable amount. \n\nMore generally, we believe that pre-train and fine-tuning strategy is model independent and pre-train task independent. \n\nwith that being said, you can replace backbone network as you like. and add more pre-train tasks or define some new pre-train tasks as \n\nyou can, pre-train will not be limited to masked language model and or predict next sentence task. What surprise us is that,\n\nwith a middle size data set that say, one million, even without use external data, with the help of pre-train task \n\nlike masked language model, performance can be boost in a big margin, and the model can converge even fast. sometime \n\ntraining can be in a only need a few epoch in fine-tuning stage.\n\n##### Intention\nWhile there is an open source(\u003ca href='https://github.com/tensorflow/tensor2tensor'\u003etensor2tensor\u003c/a\u003e) and official\n\nimplementation of Transformer and BERT official implementation coming soon, but there are/may hard to read, not easy to understand. \n\nWe are not intent to replicate original papers entirely, but to apply the main ideas and solve nlp problem in a better way.\n\nThe majority part fo work here was done by another repository last year: \u003ca href='https://github.com/brightmart/text_classification'\u003etext classification\u003c/a\u003e\n\n## Performance \n\nMIDDLE SIZE DATASET(\u003ca href='https://pan.baidu.com/s/1HUzBXB_-zzqv-abWZ74w2Q'\u003ecail2018\u003c/a\u003e, 450k)\n\nModel                        | TextCNN(No-pretrain)| TextCNN(Pretrain-Finetuning)| Gain from pre-train \n---                          | ---                 | ---                         | -----------     \nF1 Score after 1 epoch       |  0.09               | 0.58                        |  0.49        \nF1 Score after 5 epoch       |  0.40               | 0.74                        |  0.35    \nF1 Score after 7 epoch       |  0.44               | 0.75                        |  0.31                         \nF1 Score after 35 epoch      |  0.58               | 0.75                        |  0.27                                                \nTraining Loss at beginning   |  284.0              | 84.3                        |  199.7             \nValidation Loss after 1 epoch|  13.3               | 1.9                         |  11.4                 \nValidation Loss after 5 epoch|  6.7                | 1.3                         |  5.4   \nTraining time(single gpu)    |  8h                 | 2h                          |  6h\n----------------------------------------------------------------------------------------------\nNotice: \n\n  a.fine-tuning stage completed training after just running 7 epoch as max epoch reached to 35. \n\n     in fact, fine-tuning stage start training from epoch 27 where pre-train stage ended.\n\n  c.f1 Score reported here is on validation set, an average of micro and macro of f1 score. \n\n  d.f1 score after 35 epoch is reported on test set.\n  \n  e. from 450k raw documents, retrieved 2 million training data for masked language model, \n  \n     pre-train stage finished within 5 hours in single GPU. \n\nfine tuning after pre-train:\n\n\u003cimg src=\"https://github.com/brightmart/bert_language_understanding/blob/master/data/img/pretrain_completed.jpeg\"  width=\"60%\" height=\"60%\" /\u003e\n\nno pre-train:\n\n\u003cimg src=\"https://github.com/brightmart/bert_language_understanding/blob/master/data/img/no_pretrain.jpeg\"  width=\"60%\" height=\"60%\" /\u003e\n\nSMALL SIZE DATASET(private, 100k)\n\nModel                        | TextCNN(No-pretrain) | TextCNN(Pretrain-Finetuning) | Gain from pre-train \n---                          | ---                  | ---                          | -----------                \nF1 Score after 1 epoch       |  0.44                | 0.57                         | 10%+  \nValidation Loss after 1 epoch|  55.1                | 1.0                          |  54.1                \nTraining Loss at beginning    |  68.5                | 8.2                         | 60.3                \n            \n------------------------------------------------------------------------------------------------\n\n\n\n## Usage\n\nif you want to try BERT with pre-train of masked language model and fine-tuning. take two steps:\n\n  ##### [step 1]  pre-train masked language with BERT: \n     \n     python train_bert_lm.py [DONE]\n \n  ##### [step 2] fine-tuning:  \n   \n     python train_bert_fine_tuning.py [Done]\n    \n   as you can see, even at the start point of fine-tuning, just after restore parameters from pre-trained model, the loss of model is smaller\n   \n   than training from completely new, and f1 score is also higher while new model may start from 0.\n   \n   Notice: to help you try new idea first, you can set hyper-paramater test_mode to True. it will only load few data, and start to training quickly.\n  \n  \n   ##### [basic step] to handle a classification problem with transform(optional): \n    \n     python train_transform.py [DONE, but a bug exist prevent it from converge, welcome you to fix, email: brightmart@hotmail.com]\n        \n  #### Optional hyper-parameters\n  \n  d_model: dimension of model.   [512]\n  \n  num_layer: number of layers. [6]\n  \n  num_header: number of headers of self-attention [8]\n  \n  d_k: dimension of Key(K). dimension of Query(Q) is the same. [64]\n  \n  d_v: dimension of V. [64]\n  \n    default hyperparameter is d_model=512,h=8,d_k=d_v=64(big). if you have want to train the model fast, or has a small data set \n    \n    or want to train a small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny).\n  \n \n## Sample Data \u0026 Data Format\n\n##### for pre-train stage \neach line is document(several sentences) or a sentence. that is free-text you can get easily.\n\ncheck data/bert_train.txt or bert_train2.txt in the zip file.\n\n##### for data used on fine-tuning stage:\n\ninput and output is in the same line, each label is start with '__label__'. \n\nthere is a space between input string and the first label, each label is also splitted by a space.\n\ne.g. \ntoken1 token2 token3 __label__l1 __label__l5 __label__l3\n\ntoken1 token2 token3 __label__l2 __label__l4\n\ncheck data/bert_train.txt or bert_train2.txt in the zip file.\n\ncheck 'data' folder for sample data. \u003ca href='https://pan.baidu.com/s/1HUzBXB_-zzqv-abWZ74w2Q'\u003edown load a middle size data set here\n\n\u003c/a\u003ewith 450k 206 classes, each input is a document, average length is around 300, one or multi-label associate with input.\n\n\u003ca href='https://ai.tencent.com/ailab/nlp/embedding.html'\u003edownload pre-train word embedding from tencent ailab\u003c/a\u003e \n## Suggestion for User\n\n1. things can be easy: \n           \n      1) download data set(around 200M, 450k data, with some cache file), unzip it and put it in data/ folder,\n           \n      2) run step 1 for pre-train, \n           \n      3) and run step 2 for fine-tuning.\n\n2. i finish above three steps, and want to have a better performance, how can i do further. do i need to find a big dataset?\n\n     No. you can generate a big data set yourself for pre-train stage by downloading some free-text, make sure each line \n     \n     is a document or sentence then replace data/bert_train2.txt with your new data file.\n\n3. what's more?\n\n     try some big hyper-parameter or big model(by replacing backbone network) util you can observe all your pre-train data.\n\n     play around with model:model/bert_cnn_model.py, or check pre-process with data_util_hdf5.py.\n\n\n\n## Short Description of BERT:\nPretrain mashed language model and next sentence prediction task on large scale of corpus, \n\nbased on multiple layer self-attetion model, then fine tuning by add a classification layer.\n\nAs BERT model is based on Transformer, currently we are working on add pretrain task to the model.\n\n\u003cimg src=\"https://github.com/brightmart/bert_language_understanding/blob/master/data/img/aa3.jpeg\"  width=\"60%\" height=\"60%\" /\u003e\n\n\u003cimg src=\"https://github.com/brightmart/bert_language_understanding/blob/master/data/img/aa4.jpeg\"  width=\"65%\" height=\"65%\" /\u003e\n\n\nNotice: \n cail2018 is around 450k as link above.\n\n training size of private data set is around 100k, number of classes is 9, for each input there exist one or more label(s).\n \n f1 score for cail2018 is reported as micro f1 score.\n\n## Long Description of BERT from author\nThe basic idea is very simple. For several years, people have been getting very good results \"pre-training\" DNNs as a language model \n\nand then fine-tuning on some downstream NLP task (question answering, natural language inference, sentiment analysis, etc.).\n\nLanguage models are typically left-to-right, e.g.:\n\n    \"the man went to a store\"\n\n     P(the | \u003cs\u003e)*P(man|\u003cs\u003e the)*P(went|\u003cs\u003e the man)*…\n\nThe problem is that for the downstream task you usually don't want a language model, you want a the best possible contextual representation of \n\neach word. If each word can only see context to its left, clearly a lot is missing. So one trick that people have done is to also train a \n\nright-to-left model, e.g.:\n\n     P(store|\u003c/s\u003e)*P(a|store \u003c/s\u003e)*…\n\nNow you have Â two representations of each word, one left-to-right and one right-to-left, and you can concatenate them together for your downstream task.\n\nBut intuitively, it would be much better if we could train a single model that wasÂ deeply bidirectional.\n\nIt's unfortunately impossible to train a deep bidirectional model like a normal LM, because that would create cycles where words can indirectly\n \n\"see themselves,\" and the predictions become trivial.\n\nWhat we can do instead is the very simple trick that's used in de-noising auto-encoders, where we mask some percent of words from the input and \n\nhave to reconstruct those words from context. We call this a \"masked LM\" but it is often called a Cloze task.\n\n\n## Pretrain Language Understanding Task \n\n### task 1: masked language model\n \n  we feed the input through a deep Transformer encoder and then use the final hidden states corresponding to the masked positions to\n    \n   predict what word was masked, exactly like we would train a language model.\n\n    source_file each line is a sequence of token, can be a sentence.\n    Input Sequence  : The man went to [MASK] store with [MASK] dog\n    Target Sequence :                  the                his\n    \n   how to get last hidden state of masked position(s)?\n   \n     1) we keep a batch of position index,\n     2) one hot it, multiply with represenation of sequences,\n     3) everywhere is 0 for the second dimension(sequence_length), only one place is 1,\n     4) thus we can sum up without loss any information.\n            \n   for more detail, check method of mask_language_model from pretrain_task.py and train_vert_lm.py\n\n### task 2: next sentence prediction\n  \n  many language understanding task, like question answering,inference, need understand relationship\n  \n  between sentence. however, language model is only able to understand without a sentence. next sentence\n  \n  prediction is a sample task to help model understand better in these kinds of task.\n  \n  50% of chance the second sentence is tbe next sentence of the first one, 50% of not the next one.\n   \n  given two sentence, the model is asked to predict whether the second sentence is real next sentence of \n  \n  the first one.\n  \n    Input : [CLS] the man went to the store [SEP] he bought a gallon of milk [SEP]\n    Label : Is Next\n\n    Input = [CLS] the man heading to the store [SEP] penguin [MASK] are flight ##less birds [SEP]\n    Label = NotNext\n    \n\u003cimg src=\"https://github.com/brightmart/bert_language_understanding/blob/master/data/img/aa1.jpeg\"  width=\"65%\" height=\"65%\" /\u003e\n\n\u003cimg src=\"https://github.com/brightmart/bert_language_understanding/blob/master/data/img/aa2.jpeg\"  width=\"65%\" height=\"65%\" /\u003e\n\n\n## Environment\npython 3+ tensorflow 1.10\n\n## Implementation Details\n1. what share and not share between pre-train and fine-tuning stages?\n\n   1).basically, all of parameters of backbone network used by pre-train and fine-tuning stages are shared each other.\n   \n   2).as we can to share parameters as much as possible, so that during fine-tuning stage we need to learn as few \n   \n   parameter as possible, we also shared word embedding for these two stages.\n   \n   3).therefore most of parameters were already learned at the beginning of fine-tuning stage.\n   \n2. how we implement masked language model?\n   \n   to make things easily, we generate sentences from documents, split them into sentences. for each sentence\n   \n   we truncate and padding it to same length, and randomly select a word, then replace it with [MASK], its self and a random \n   \n   word.\n   \n3. how to make fine-tuning stage more efficient, while not break result and knowledge we learned from pre-train stage?\n   \n   we use a small learning rate during fine-tuning, so that adjust was done in a tiny extent.\n\n## Questions for Better Understanding of Transformer and BERT\n\n1. why we need self-attention?\n\n     self-attention a new type of network recently gain more and more attention. traditionally we use \n\n     rnn or cnn to solve problem. however rnn has a problem in parallel, and cnn is not good at model position sensitive tasks.\n\n     self-attention can run in parallel, while able to model long distance dependency.\n\n\n2. what is multi-heads self-attention, what does q,k,v stand for? add something here.\n\n     mulit-heads self-attention is a self-attention, while it divide and project q and k into several different subspace,\n\n     then do attention. \n\n     q stand for question, k stand for keys. for machine translation task, q is previous hidden state of decodes, k represent \n\n     hidden states of encoder. each of element of k will compute a similarity score with q. and then softmax will be used\n\n     to do normalize score, we will get weights. finally a weighted sum is computed by using weights apply to v.\n\n     but in self-attention scenario, q,k,v are all the same, as the representation of input sequences of a task.\n\n3. what is position-wise feedfoward?\n\n     it is a feed forward layer, also called fully connected(FC) layer. but since in Transformer, all the input and output of \n\n     layers are sequence of vectors:[sequence_length, d_model]. we usually do FC to a vector of input. so we do it again,\n\n     but different time step has its own FC.\n\n4. what is the main contribution of BERT?\n\n     while pre-train task  already exist for many years, it introduce a new way(so called bi-directional) to do language model \n\n     and use it for down stream task. as data for language model is everywhere. it proved to be powerful, and hence it reshape \n\n     nlp world.\n\n5. why author use three different types of tokens when generating training data of masked language model?\n\n   the authors believe that in fine-tuning stage there is no [MASK] token. so it mismatch between pre-train and fine-tuning.\n\n   it also force the model to attention all the context information in a sentence.\n\n6. what made BERT model to achieve new state of art result in language understanding tasks?\n\n   Big model, Big computation, and most importantly--New algorithm Pre-train the model using free-text data.\n\n## Toy Task\n\ntoy task is used to check whether model can work properly without depend on real data.\n\nit ask the model to count numbers, and sum up of all inputs. and a threshold is used, \n\nif summation is greater(or less) than a threshold, then the model need to predict it as 1( or 0).\n\ninside model/transform_model.py, there is a train and predict method. \n\nfirst you can run train() to start training, then run predict() to start prediction using trained model. \n\nas the model is pretty big, with default hyperparamter(d_model=512, h=8,d_v=d_k=64,num_layer=6), it require lots of data before it can converge.\n\nat least 10k steps is need, before loss become less than 0.1. if you want to train it fast with small\n\ndata, you can use small set of hyperparmeter(d_model=128, h=8,d_v=d_k=16, num_layer=6)\n\n\n## Multi-label Classification Task with transformer and BERT\nyou can use it two solve binary classification, multi-class classification or multi-label classification problem.\n\nit will print loss during training,  and print f1 score for each epoch during validation.\n\n## TODO List\n1. fix a bug in Transformer [IMPORTANT,recruit a team member and need a merge request] \n\n   ( Transformer: why loss of pre-train stage is decrease for early stage, but loss is still not so small(e.g. loss=8.0)? even with\n\n   more pre-train data, loss is still not small)\n\n2. support sentence pair task [IMPORTANT,recruit a team member and need a merge request]\n\n3. add pre-train task of next sentence prediction [IMPORTANT,recruit a team member and need a merge request]\n\n4. need a data set for sentiment analysis or text classification in english [IMPORTANT,recruit a team member and need a merge request]\n\n5. position embedding is not shared between with pre-train and fine-tuning yet. since here on pre-train stage length may \n\n     shorter than fine-tuning stage.\n     \n6. special handle first token [cls] as input and classification [DONE]\n\n7. pre-train with fine_tuning: need load vocabulary of tokens from pre-train stage, but labels from real task. [DONE]\n\n8. learning rate should be smaller when fine-tuning. [Done]\n\n\n## Conclusions\n\n1. pre-train is all you need. while using transformer or some other complex deep model can help you achieve top performance\n\n   in some tasks, pretrain with other model like textcnn using huge amount of raw data then fine-tuning your model on task specific data set, \n\n   will always help you gain additional performance.\n\n2. add more here.\n\nAdd suggestion, problem, or want to make a contribution, welcome to contact with me: brightmart@hotmail.com\n\n## References\n1. \u003ca href='https://arxiv.org/abs/1706.03762'\u003eAttention Is All You Need\u003c/a\u003e\n\n2. \u003ca href='https://arxiv.org/abs/1810.04805'\u003eBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding\u003c/a\u003e\n\n3. \u003ca href='https://github.com/tensorflow/tensor2tensor'\u003eTensor2Tensor for Neural Machine Translation\u003c/a\u003e\n\n4. \u003ca href='https://arxiv.org/abs/1408.5882'\u003eConvolutional Neural Networks for Sentence Classification\u003c/a\u003e\n\n5. \u003ca href='https://arxiv.org/pdf/1807.02478.pdf'\u003eCAIL2018: A Large-Scale Legal Dataset for Judgment Prediction\u003c/a\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrightmart%2Fbert_language_understanding","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrightmart%2Fbert_language_understanding","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrightmart%2Fbert_language_understanding/lists"}