{"id":19125326,"url":"https://github.com/brightmart/nlu_sim","last_synced_at":"2025-04-09T12:06:43.508Z","repository":{"id":108167704,"uuid":"131139011","full_name":"brightmart/nlu_sim","owner":"brightmart","description":"all kinds of baseline models for sentence similarity 句子对语义相似度模型","archived":false,"fork":false,"pushed_at":"2018-07-05T17:10:59.000Z","size":17155,"stargazers_count":297,"open_issues_count":2,"forks_count":89,"subscribers_count":16,"default_branch":"master","last_synced_at":"2025-04-02T10:12:47.669Z","etag":null,"topics":["atec","nlu","qa","question-answering","questions-and-answers","semantic-similarity","sentence-similarity","similarity-measurement","word2vec"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/brightmart.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-04-26T10:24:11.000Z","updated_at":"2025-02-19T16:51:08.000Z","dependencies_parsed_at":null,"dependency_job_id":"e13602bf-d06e-485f-8464-2f3b5ddfb37c","html_url":"https://github.com/brightmart/nlu_sim","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fnlu_sim","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fnlu_sim/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fnlu_sim/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/brightmart%2Fnlu_sim/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/brightmart","download_url":"https://codeload.github.com/brightmart/nlu_sim/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248036063,"owners_count":21037092,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["atec","nlu","qa","question-answering","questions-and-answers","semantic-similarity","sentence-similarity","similarity-measurement","word2vec"],"created_at":"2024-11-09T05:35:27.342Z","updated_at":"2025-04-09T12:06:43.488Z","avatar_url":"https://github.com/brightmart.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"NLU SIMILARITY\n-------------------------------------------------------------------------\nall kinds of baseline models for modeling tasks with pair of sentences: semantic text similarity(STS), natural language inference(NLI), paraphrase identification(PI), question answering(QA).\n\n\n1.Desc\n-------------------------------------------------------------------------\nthis repository contain models that learn to detect sentence similarity for natural language understanding tasks.\n\nthere are two different kinds of models: \n\n 1) sentence encoding-based models that separate the encoding of the individual sentences,\n \n 2) joint methods that allow to use encoding of both sentences( to use cross-features or attention from one sentence to the other)\n \nwe will try to cover both of these two methods.\n\nfind more about task, data or even start AI completation by check here:\n\n \u003ca href='https://dc.cloud.alipay.com/index#/topic/data?id=3'\u003ehttps://dc.cloud.alipay.com/index#/topic/data?id=3\u003c/a\u003e\n\n\u003cimg src=\"https://github.com/brightmart/nlu_sim/blob/master/data/nlu_similiarity.jpg\"  width=\"60%\" height=\"60%\" /\u003e\n\n\n\n2.Data Processing: data enhancement and word segmentation strategy\n-------------------------------------------------------------------------\n length of sentence. 5 stand for less than 5; 10 stand for great than 5 and less than 10\n\n source data in .csv file.\n\n data format: line_no,sentence1,sentence2,label. 4 columns are splitted by \"\\t\"\n\n     001\\t question1\\t question2\\t label\n\n { 5: 0.11388705332181162, 10: 0.6559243633406191, 15: 0.1654043613073756, 20: 0.04325725613785391})\n\n as you can see that most of sentences in this task is quite short, short less 15 or 20.\n\na.swap sentence 1 and sentence 2\n\n       if sentence 1 and sentence 2 represent the same meaning, then sentence 2 and sentence 1 also have same meaning.\n\n       check: method get_training_data() at data_util.py\n\nb.randomly change order given a sentence.\n\n       as same key words in the same may contain most important message in a sentence, change order of these key words should also able to send those message;\n\n       however there may exist cases, which it not count that big percentage, that meaning of sentence way changed when we change order of words.\n\n       check: method get_training_data() at data_util.py\n\n  after data enhancement:length of training data: 81922 ;validation data: 1600; test data:800; percent of true label: 0.217\n\nc.tokenize style\n\n     you can train the model use character, or word or pinyin. for example even you train this model in pinyin, it still can get pretty reasonable performance.\n\n     tokenize sentence in pinyin: we will first tokenize sentence into word, then translate it into pinyin. e.g. it now become: ['nihao', 'wo', 'de', 'pengyou']\n\n\n3.Feature Engineering\n-------------------------------------------------------------------------\nget data mining features given two sentences as string.\n\n    1)n-gram similiarity(blue score for n-gram=1,2,3...);\n\n    2) get length of questions, difference of length\n\n    3) how many words are same, how many words are unique\n\n    4) question 1,2 start with how/why/when(wei shen me,zenme，ruhe，weihe）\n\n    5）edit distance\n\n    6) cos similiarity using bag of words for sentence representation(combine tfidf with word embedding from word2vec,fasttext)\n\n    7) manhattan_distance,canberra_distance,minkowski_distance,euclidean_distance\n\n  check data_mining_features method under data_util.py\n\n\n4.Imbalance Classification for Skew Data\n-------------------------------------------------------------------------\n   20% percent is postive label, 80% is negative label. by predict negative for all test data, you can get 80% acc, but recall is 0%.\n\n   1)if you random guess, what is f1 score for your task?\n\n   by using random number+ feed forward(fully connected) layer, f1 score is around 0.34. so after lots of work and if your model achive less then 0.4,\n\n   obviously it is not a good model.\n\n   2) how to adjust the weight for each label?\n\n   one way is to calculate validate accuracy for each label after a epoch, and use it as indicator to adjust the weight. set high weight for label with low accuracy.\n\n   but for small dataset, validate accuracy may fluctuate(unstable), so you can use move average of accuracy or set a ceiling value for the weight.\n\n   check weight_boosting.py\n\n\n5.Transfer Learning \u0026 Pretrained Word Embedding\n-------------------------------------------------------------------------\n\n   since this is small dataset, transfer learning may be helpful.\n\n   option 1):\n\n   download pretrained word embedding(embedding size is 64) at 80g big files, it has around 90% coverage of words used in this task. it boost peformance around 4%.\n\n   choose .bin file, download from  https://pan.baidu.com/s/1o7MWrnc, password: wzqv\n\n\n   option 2):\n   currently we train word embedding on a 1 million dataset for finance online customer, it has\n\n   around 20k words. in total 8000 unique words in this dataset, around 5000 words also exists in external dataset. after merge this task's dataset and\n\n   external dataset, then train in word2vec, this percentage increase(2336 not exist,5930 exist).\n\n   pretrained word embedding in data\\asttext_fin_model_50.vec\n\n\n6.Models\n-------------------------------------------------------------------------\n1) DualTextCNN:\n\n      each sentence to be compared is pass to a TextCNN to extract feature(e.g.f1,f2), then use mutliply(f1Wf2, where W is a learnable parameter)\n\n      to learn relationship\n\n2) DualBiLSTM:\n\n      each sentence to be compared is pass to BiLSTM to extract feature(e.g.f1,f2), then use mutliply(f1Wf2, where W is a learnable parameter)\n\n      to learn relationship\n\n3) DualBiLSTMCNN:\n\n     features from DualTextCNN and DualBiLSTM are concated, then send to linear classifier.\n\n4) Pure Data Mining:\n\n     features from data mining and deep learning(CNN/RNN) are sent to FC connected layer, and then to linear classifier.\n\n     check inference_mix in xxx_model.py\n\n5)ESIM: Enhanced LSTM for Natural Language Inference\n\n        encode with bi-lstm---\u003elocal inference modeling--\u003eenhance of local information--\u003ecomposition layer--\u003epooling\n\n\u003cimg src=\"https://github.com/brightmart/nlu_sim/blob/master/data/enhanced_sequential_inference_model.jpg\"  width=\"60%\" height=\"60%\" /\u003e\n\n        \n6) SSE: Shortcut-Stacked Sentence Encoders for Multi-Domain Inference\n        \n        shortcut(or residual connected) stacked encoder. ---\u003e\n        \n        multiple layer of bi-lstm as encoder with shortcut or  residual connection between layers.---\u003e\n        \n        max-pooling ---\u003e\n        \n        apply three matching methods to the two vectors then concatenate these three match vectors(m)\n        \n        feed this final concatenated result m into a MLP layer and use a softmax layer to make final classification.\n\n\u003cimg src=\"https://github.com/brightmart/nlu_sim/blob/master/data/stacked_shortcut_biLSTM.jpg\"  width=\"60%\" height=\"60%\" /\u003e\n\n7) Bi-lstm with self attention:\n    \n         a) we use bi-lstm to get features for each input in the sentence paris. each result it is a sequence, length is sentence length.\n         \n         b) self-attention apply to each sequence, and get final feature. now it is a vector.\n         \n         c) concat vectors for input 1 and input2\n         \n         d) use a classifier to learn \n   \n7.Performance\n-------------------------------------------------------------------------\n   performance on validation dataset(seperate from training data):\n\nModel | Epoch|Loss| Accuracy|F1 Score|Precision|Recall|\n---         | ---   | ---   | ---   |---    |---         |---|\nDualTextCNN |  9 | 0.833\t| 0.689\t| 0.390 |\t0.443\t | 0.349|\nDualTextCNN  | test |  0.915| 0.662 | 0.301 |\t0.362    | 0.257|\nBiLSTM       |  5   | 0.783 |\t0.656|\t0.453 |\t0.668 |  0.342 |\nBiLSTM(pinyin)| 8\t|0.816\t| 0.685\t|0.445|0.587 |0.358 |\nBiLSTM(word)  |  5   | 0.567 | 0.74\t|\t0.503 |\t0.576 |  0.445 |\nBiLSTMCNN(char)    |  3\t| 0.696 |\t0.767|\t0.380 |\t0.311\t| 0.487|\nBiLSTMCNN(char)    |  9\t|   1.131| 0.636 |\t0.464|\t0.712   |\tx  |\nBiLSTMCNN(word)    | 9\t| 0.775\t| 0.639\t|0.401\t|0.547 |0.316 |\nBiLSTMCNN(word,noDataEnhance) | 9\t| 0.871\t| 0.601 | 0.411 | 0.632\t| 0.305\n\n【DualTextCNN2.word.Validation】Epoch 5\t Loss:0.539\tAcc 0.794\tF1 Score:0.575\tPrecision:0.604\tRecall:0.549\n\n【DualTextCNN2.word.Validation】Epoch 8\t Loss:0.528\tAcc 0.787\tF1 Score:0.550\tPrecision:0.586\tRecall:0.517\n\n【DualTextCNN2.char.Validation】Epoch 6\t Loss:0.557\tAcc 0.766\tF1 Score:0.524\tPrecision:0.580\tRecall:0.478\n\n above f1 score on validation may not 100% true since a bug in previous version. below f1 score on validation is true:\n\n (data mining features + deep learning features(CNN and or RNN)) + feed foward layer ===\u003e f1 score: 0.55\n\n ESIM[9]===\u003ef1 score: 0.49 (100k training data,epoch5)\n \n SSE[10]:Shortcut-Stacked Sentence Encoder(residual):0.516\n \n SSE[10]:Shortcut-Stacked Sentence Encoder(stacked):0.525, embedding and hidden_size is 64.\n \n Bi_lstm-attention:Bi-lstm with self-attention: 0.449\n\n\n\n8.Error Analysis\n----------------------------------------------------------------\n\n   #1. i already adjust weights of label, and want to fine tuning weights to get best possible performance. what can i do?\n\n   error analysis! print error case(with information:target label,predicted label,inputs) when you are doing validation for each epoch.\n\n   randomly count examples(e.g. 30 cases). for this 30 cases, how many percent the target label is true, how many percent the target label is false.\n\n   if you set weight for true label to very high value, or not use weight at all, you will get two extremes, change your weight when compute loss so that\n\n   percent of target label==true and target label==false in error case has no big difference.\n\n   error case for different weight:\n\n   log1:not use weight at all:                           target label is false: 16; target label is true: 4.\n\n   log2:set weight for true label to a high value(3).    target label is false: 10; target label is true: 20.\n\n   log3:set weight for true label to a middle value(1.3).target label is false: 10; target label is true: 10.\n\n   #2. where to check detail of error cases?\n\n   see data/log_predict_error.txt\n\n\n9.Usage\n-------------------------------------------------------------------------\n  to train the model using sample data, run below command:\n\n  try mix model in word:\n\n  python -u a1_dual_bilstm_cnn_train.py --model_name=mix --ckpt_dir=dual_mix_word_checkpoint/ --tokenize_style=word --name_scope=mix_word\n\n  try mix model in char:\n  \n  python -u a1_dual_bilstm_cnn_train.py --model_name=mix --ckpt_dir=dual_mix_char_checkpoint/ --tokenize_style=char --name_scope=mix_char\n\n\n  The following arguments are optional:\n\n    --model                  models that supported {mix,shortcut_stacked,esim,dual_bilstm_cnn,dual_bilstm,dual_cnn} [mix]\n\n    --tokenize_style         how to tokenize the data {char,word,pinyin} [char]\n\n    --similiarity_strategy   how to do match two features {additive,multiply} [additive]\n    \n    --max_pooling_style     how to do max polling. {chunk_max_pooling, max_pooling,k_max_pooling｝ [chunk_max_polling]\n\n\n  to make a prediction(and save result in file system), run below command:\n\n  ./run.sh data/test.csv data/target_file.csv\n\n  test.csv is the source file you want to make a prediction, target_file.csv is the predicted result save as file.\n\n\n\n10.Environment\n-------------------------------------------------------------------------\n   python 2.7 + tensorflow 1.8\n\n   for people use python3, just comment out three lines below in the begining of file:\n\n      import sys\n\n      reload(sys)\n\n      sys.setdefaultencoding('utf-8')\n\n\n11.Model Details\n-------------------------------------------------------------------------\n   1)DualTextCNN buidling blocks:\n\n         a.tokenize sentence in character way--\u003eembedding--\u003eb.multiply Convolution with multiple filters--\u003ec.BN(o)--\u003ed.activation--\u003e\n\n         e.max_pooling--\u003ef.concat features--\u003eg.dropout(o)--\u003eh.fully connectioned layer(o)\n\n   2)DualBiLSTM:\n\n        a.tokenize sentence in character way--\u003eembedding--\u003eb.BiLSTM--\u003ec.k-maxpooling---\u003ed.similiarity strategy(additive or multiply)\n\n   3)DualBiLSTMCNN:\n\n        a.get first part feature using DualTextCNN--\u003eb.get second part feature using DualBiLSTM--\u003ec.concat features---\u003ed.FC(o)---\u003ee.Dropout(o)--\u003eclassifier\n\n   4)Mix: method of data mining features,features from RNN and(or) CNN.\n\n       a. get data mining features like cosine similiarity using sum of word embeddings, get features from CNN and or bi-lstm\n\n       b. combine two kinds of features\n\n       c. fully connected layer + linear classifier\n\n   5)ESIM: Enhanced LSTM for Natural Language Inference\n\n        a.input encode with bi-lstm;\n\n        b.local inference modeling--\u003ecollected over sequences;\n\n        c.enhance of local information by doing subtract and element-wise multiplication\n\n        d.composition layer(bi-lstm)\n\n        e.max and mean pooling--\u003econcat features\n\n        f.classifier\n\n   check method of inference_esim under xxx_model.py, for more check \u003ca href='https://arxiv.org/pdf/1609.06038.pdf'\u003ehere\u003c/a\u003e\n   \n   6) SSE: Shortcut-Stacked Sentence Encoders for Multi-Domain Inference\n  \n        shortcut(or residual connection) stacked encoder. in original paper, the input of next layer is concatenation of word embedding \n        \n        and all previous layers output. similiarly, when use residual connection the input of next layer is summation of previous layers' input and output.\n        \n        in my implementation, we use 3 layers of bi-lstm with residual connection.\n        \n        a.multiple layer of bi-lstm as encoder. input of next layer is all previous output and word embedding, or use residual connection between layers.\n        \n        b.max-pooling\n        \n        c.apply three matching methods to the two vectors:\n        \n              (i) concatenation (ii) element-wise distance and (iii) element- wise product for these two vectors\n              \n          and then concatenate these three match vectors(m)\n          \n        d.feed this final concatenated result m into a MLP layer and use a softmax layer to make final classification.\n      \n      \n   check method of inference_shortcut_stacked_bilstm under xxx_model.py. for more check \u003ca href='https://arxiv.org/pdf/1708.02312.pdf'\u003ehere\u003c/a\u003e\n\n\n\n12.TODO\n-------------------------------------------------------------------------\n\n   1) extract more data mining features\n\n   2) use traditional machin learning like xgboost,random forest\n\n   3) use pingying to tackle miss typed character\n\n   4) try some classic similiarity network\n\n\n13.Conclusion\n-------------------------------------------------------------------------\n   1) for small dataset like this which contains only 40k number of data(stage 1), data mining features are crucial.\n\n   2) combine data mining features with other features and sent to neural network\n\n   3) instead of use big network, small network is better for small dataset.\n\n   4) can not rely on deep learning for small dataset\n\n   5) for skew data, that is imbalance data classification, for example here 20% is positive label, use f1 score to evaluate performance.\n\n   adjust weight for each label will significant imporve performance. it will impose model to pay more attention on those label with higher weight.\n\n\n14.Reference\n-------------------------------------------------------------------------\n  1) \u003ca href='https://arxiv.org/pdf/1408.5882v2.pdf'\u003eTextCNN:Convolutional Neural Networks for Sentence Classification\u003c/a\u003e\n\n  2) A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification\n\n  3) Deep Learning for Chatbots, Part 2 – Implementing a Retrieval-Based Model in Tensorflow, from www.wildml.com\n\n  4) \u003ca href='https://www.kaggle.com/c/quora-question-pairs/discussion/34355'\u003eQuora Question Pairs-Can you identify question pairs that have the same intent?(1st place solution)\u003c/a\u003e\n\n  5) \u003ca href='http://www.sohu.com/a/222501203_717210'\u003eQuora Question Pairs.1st place solution.Chinese translation\u003c/a\u003e\n\n  6) \u003ca href='http://nbviewer.jupyter.org/github/danielfrg/word2vec/blob/master/examples/word2vec.ipynb'\u003eWord2Vec tutorial\u003c/a\u003e\n\n  7) \u003ca href='https://github.com/facebookresearch/fastText'\u003efastText:Bag of Tricks for Efficient Text Classification\u003c/a\u003e\n\n  8) \u003ca href='https://youtu.be/vA1V8A69e9c'\u003eAbhishek Thakur - Is That a Duplicate Quora Question? (Youtube speech)\u003c/a\u003e\n\n  9) \u003ca href='https://arxiv.org/pdf/1609.06038.pdf'\u003eESIM:Enhanced LSTM for Natural Language Inference\u003c/a\u003e\n\n  10) \u003ca href='https://arxiv.org/pdf/1708.02312.pdf'\u003eSSE:Shortcut-Stacked Sentence Encoders for Multi-Domain Inference\u003c/a\u003e\n  \n  11) \u003ca href=\"https://nlp.stanford.edu/projects/snli/\"\u003eThe Stanford Natural Language Inference (SNLI) Corpus and state of art models\u003c/a\u003e\n  \n  12) \u003ca href=\"https://arxiv.org/abs/1705.02364\"\u003eSupervised Learning of Universal Sentence Representations from Natural Language Inference Data\u003c/a\u003e\n  \n\nif you are smart or can contribute new ideas, join with us.\n\nto be continued. for any problem, contact brightmart@hotmail.com\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrightmart%2Fnlu_sim","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbrightmart%2Fnlu_sim","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbrightmart%2Fnlu_sim/lists"}