{"id":19186763,"url":"https://github.com/gauravbh1010tt/dl-text","last_synced_at":"2025-06-18T19:32:01.837Z","repository":{"id":85243104,"uuid":"126597967","full_name":"GauravBh1010tt/DL-text","owner":"GauravBh1010tt","description":"Text pre-processing library for deep learning (Keras, tensorflow).","archived":false,"fork":false,"pushed_at":"2018-08-06T06:07:23.000Z","size":116,"stargazers_count":117,"open_issues_count":3,"forks_count":22,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-04-20T05:32:36.404Z","etag":null,"topics":["deep-learning","nlp-machine-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GauravBh1010tt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-03-24T12:40:48.000Z","updated_at":"2025-01-20T10:47:51.000Z","dependencies_parsed_at":null,"dependency_job_id":"5595865c-5ee0-4024-bde7-c4029f970125","html_url":"https://github.com/GauravBh1010tt/DL-text","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GauravBh1010tt%2FDL-text","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GauravBh1010tt%2FDL-text/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GauravBh1010tt%2FDL-text/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GauravBh1010tt%2FDL-text/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GauravBh1010tt","download_url":"https://codeload.github.com/GauravBh1010tt/DL-text/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252979698,"owners_count":21835105,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","nlp-machine-learning"],"created_at":"2024-11-09T11:16:38.014Z","updated_at":"2025-05-08T01:23:55.696Z","avatar_url":"https://github.com/GauravBh1010tt.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DL-Text : pre-processing modules for deep learning (Keras, tensorflow).\nThis repository consists of modules for pre-processing the textual data. Examples are also given for training deep models (DNN, CNN, RNN, LSTM). There are many additional functionilities which are as follows:\n- Preparing data for problems like sentiment analysis, sentence contextual similarity, question answering, machine translation, etc.\n- Compute lexical and semantic hand-crafted features like words overlap, n-gram overlap, td-idf, count features, etc.  Most of these features are used in the following papers:\n  - [External features for community question answering](http://maroo.cs.umass.edu/getpdf.php?id=1281). \n  - [Voltron: A Hybrid System For Answer Validation Based On Lexical And Distance Features](http://alt.qcri.org/semeval2015/cdrom/pdf/SemEval043.pdf). \n- Implementation of deep models as described in the following papers (for reproducible code refer to [DeepLearn-repo](https://github.com/GauravBh1010tt/DeepLearn)):\n  - [WIKIQA: A Challenge Dataset for Open-Domain Question Answering](https://aclweb.org/anthology/D15-1237).\n  - [Learning to Rank Short Text Pairs with Convolutional Deep Neural Networks](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.723.6492\u0026rep=rep1\u0026type=pdf).\n  - [Neural-based Approaches for Ranking in Community Question Answering](http://aclweb.org/anthology/S/S16/S16-1128.pdf).\n- Implementation of evaluation metrics such as MAP, MRR, AP@k, BM25 etc.\n\n## Dependencies\n#### The required dependencies are mentioned in requirement.txt. You can install them manually or by using the following command:\n```python\n$ pip install -r requirements.txt\n```\n\n## Prepare the data for NLP problems like sentiment analysis.\n#### 1. The data and labels looks like this:\n```python\nraw_data = ['this,,, is$$ a positive ..sentence','this is a ((*negative ,,@sentence',\n        'yet another..'' positive$$ sentence','the last one is ...,negative']\nlabels = [1,0,1,0]\n```\nThis type of data is commonly used in sentiment analysis type problems. The first step is to clean the data:\n```python\nfrom dl_text import dl\ndata = []\nfor sent in raw_data:\n    data.append(dl.clean(sent))\n    \nprint data\n['this is a positive sentence', 'this is a negative sentence', \n'yet another positive sentence', 'the last one is negative']\n```\nOnce the raw data is cleaned, the next step is the prepare that can be passed to the deep models. Use the following function:\n```python\ndata_inp = dl.process_data(sent_l = data, dimx = 10)\n```\n\nThe `process_data` function preprocesses the data that can be used with deep models. The `process_data` has following parameters:\n```python\nprocess_data(sent_l,sent_r,wordVec_model,dimx,dimy,vocab_size,embedding_dim)\n```\nwhere,\n\n- `sent_l` : data to be sent to training model (if you are using only one channel, as in the case of sentiment analysis, then use this parameter)\n- `sent_r` : data for the second channel (discussed later)\n- `wordVec_model` : pre-trained word vector embeddings (either GloVe or Word2vec)\n- `dimx` and `dimy` : number of words to be included (if a sentence has lesser words than this value, it will be padded by 0, otherwise extra words will be truncated)\n- `vocab_size` : number of unique words to be included in the vocabulary\n- `embedding_dim` : size of the embeddings for wordVec_models\n\n#### 2. Using pre-trained word vector embeddings\n```python\nfrom dl_text import dl\nimport gensim\n\n# for 50-dim glove embeddings use:\nwordVec_model = dl.loadGloveModel('path_of_the_embeddings/glove.6B.50d.txt')\n\n# for 300 dim word2vec embeddings use: \nwordVec_model = gensim.models.KeyedVectors.load_word2vec_format(\"path/GoogleNews-vectors-negative300.bin.gz\",\n                                                                 binary=True)\n\ndata_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10)\n```\n\n#### 3. Defining deep models\n\n```python\nfrom dl_text import dl\nfrom keras.layers import Input, Dense, Dropout, Merge, Conv1D, Lambda, Flatten, MaxPooling1D\n\ndef model_dnn(dimx, embedding_matrix):\n    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   \n    embed = dl.word2vec_embedding_layer(embedding_matrix)(inpx)\n    flat_embed = Flatten()(embed)\n    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)\n    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)\n    model = Model([inpx],nnet_out)\n    model.compile(loss='mse',optimizer='adam')\n    return model\n\ndef model_cnn(dimx, embedding_matrix):\n    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   \n    embed = dl.word2vec_embedding_layer(embedding_matrix)(inpx)\n    sent = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embed)\n    pool = MaxPooling1D()(sent)\n    flat_embed = Flatten()(pool)\n    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)\n    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)\n    model = Model([inpx],nnet_out)\n    model.compile(loss='mse',optimizer='adam')\n    return model\n```\n#### 4. Training the models\n\n```python\ndata = ['this is a positive sentence', 'this is a negative sentence', 'yet another positive sentence', 'the last one is negative']\nlabels = [1,0,1,0]\n\ndata_inp, embedding_matrix = dl.process_data(sent_l = data, wordVec_model = wordVec_model, dimx = 10)\n\nmodel = model_dnn(dimx = 10, embedding_matrix = embedding_matrix)\nmodel.fit(data_inp, labels)\n\nmodel = model_cnn(dimx = 10, embedding_matrix = embedding_matrix)\nmodel.fit(data_inp, labels)\n```\n## Prepare the data for NLP problems like computing sentence similarity, question answering, etc.\n#### 1. Creating two channel models\nThese type of models use two data streams. This can be used to NLP tasks such as question answering, sentence similarity computation, etc. The data looks like this\n\n```python\ndata_l = ['this is a positive sentence','this is a negative sentence', \n          'yet another positive sentence', 'the last one is negative']\n          \ndata_r = ['positive words are good, better, best, etc.', 'negative words are bad, sad, etc.', \n          'feeling good', 'sooo depressed.']\n         \nlabels = [1,0,1,0]\n```\nHere, `data_l` and `data_r` can be two sentences for computing sentence similarity, question-answer pairs for question answering problem, etc.\nLet's define a model for the these type of tasks\n\n``` python\n\ndef model_cnn2(dimx, dimy, embedding_matrix):\n    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   \n    embedx = dl.word2vec_embedding_layer(embedding_matrix)(inpx)\n    inpy = Input(shape=(dimx,),dtype='int32',name='inpy')   \n    embedy = dl.word2vec_embedding_layer(embedding_matrix)(inpy)\n    \n    sent_l = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedx)\n    sent_r = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedy)\n    pool_l = MaxPooling1D()(sent_l)\n    pool_r = MaxPooling1D()(sent_r)\n    \n    combine  = merge(mode='concat')([pool_l, pool_r])\n    flat_embed = Flatten()(combine)\n    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)\n    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)\n    model = Model([inpx],nnet_out)\n    model.compile(loss='mse',optimizer='adam')\n    \n    return model\n```\n#### 2. Tarining a two channel deep model\n\n```python\n\ndata_inp_l, data_inp_r, embedding_matrix = dl.process_data(sent_l = data_l, sent_r = data_r, \n                                                           wordVec_model = wordVec_model, dimx = 10, dimy = 10)\n\nmodel = model_cnn2(dimx = 10, dimy = 10, embedding_matrix = embedding_matrix)\nmodel.fit([data_inp_l, data_inp_r], labels)\n```\n## Hand crafted features - These could be used with problems like sentence similarity, question answering, etc. \n#### 1. Computing lexical and semantic features.\n```python\n\u003e\u003e\u003e from dl_text import lex_sem_ft\n\n\u003e\u003e\u003e sent1 = 'i like natural language processing'\n\u003e\u003e\u003e sent2 = 'i like deep learning'\n\n\u003e\u003e\u003e lex_sem_ft.tokenize(sent1) # tokenizing a sentence\n['i', 'like', 'natural', 'language', 'processing']\n\n\u003e\u003e\u003e lex_sem_ft.overlap(sent1,sent2) # number of words common\n2\n```\nFunctions currently present in the `lex_sem_ft` are:\n- tokenize(sent): tokenize a given string\n- length(sent) : Number Of Words In A String (Returns Integer)\n- substringCheck(sent1, sent2) : Whether A String Is Subset Of Other (Returns 1 and 0)\n- overlap(sent1, sent2): Number Of Same Words In Two Sentences (Returns Float)\n- overlapSyn(sent1, sent2): Number Of Synonyms In Two Sentences (Returns Float)\n- train_BOW(lst) : Forming Bag Of Words (BOW) (Returns BOW Dictionary)\n- Sum_BOW(sent, dic) : Sum Of BOW Values For A Sent (Returns Float)\n- train_bigram(lst) : Training Bigram Model (Returns Dictionary of Dictionaries)\n- sum_bigram(sent, model) : Total Sum Of Bigram Probablity Of A Sentence (Returns Float)\n- train_trigram(lst): Training Trigram Model (Returns Dictionary of Dictionaries)\n- sum_trigram(sent, model) : Total Sum Of Trigram Probablity Of A Sentence (Returns Float)\n- W2V_train(lst1, lst2) : Word2Vec Training (Returns Vector)\n- W2V_Vec(sent1, sent2, vec) : Returns The Difference Between Word2Vec Sum Of All The Words In Two Sentences (Returns Vec)\n- LDA_train(doc) : Trains LDA Model (Returns Model)\n- LDA(doc1, doc2, lda) : Returns Average Of Probablity Of Word Present In LDA Model For Input Document (Returns Float)\n\n#### 2. Computing text readability features.\n```python\n\u003e\u003e\u003e from dl_text import rd_ft\n\n\u003e\u003e\u003e sent1 = 'i like natural language processing'\n\u003e\u003e\u003e rd_ft.CPW(sent1) # average characters per word\n6.0\n\u003e\u003e\u003e rd_ft.ED('good','great') # edit distance between two words\n4.0\n```\nFunctions currently present in the `rd_ft` are:\n- CPW(text) : Average Character Per Word In A Sentence (Returns Float)\n- WPS(text) : Number Of Words Per Sentence (Returns Integer)\n- SPW(text) : Average Number Of Syllables In Sentence (Returns Float)\n- LWPS(text) : Long Words In A Sentence (Returns Integer)\n- LWR(text) : Fraction Of Long Words In A Sentence (Returns Float)\n- CWPS(text) : Number Of Complex Word Per Sentence (Returns Float)\n- DaleChall(text) : Dale-Chall Readability Index (Returns Float)\n- ED(s1, s2) : Edit Distance Value For Two String (Returns Integer)\n- nouns(text) : Get A List Of Nouns From String (Returns List Of Sting)\n- EditDist_Dist(t1,t2) : Average Edit Distance Value For Two String And The Average Edit Distance Between The Nouns Present In Them (Returns Float)\n- LCS_Len(a, b) : Longest Common Subsequence (Returns Integer)\n- LCW(t1, t2) : Length Of Longest Common Subsequence (Returns Integer)\n\n## Training deep models using textutal sentences and hand features.\n#### 1. Preparing the data\n```python\nfrom dl_text import dl\nfrom dl_text import lex_sem_ft\nfrom dl_text import rd_ft\n\ndata_l = ['this is a positive sentence','this is a negative sentence', \n          'yet another positive sentence', 'the last one is negative']\ndata_r = ['positive words are good, better, best, etc.', 'negative words are bad, sad, etc.', \n          'feeling good', 'sooo depressed.']\nlabels = [1,0,1,0]\n\nwordVec_model = dl.loadGloveModel('path_of_the_embeddings/glove.6B.50d.txt')\n\nall_feat = []\nfor i,j in zip(data_l, data_r):\n    feat1 = lex_sem_ft.overlap(i, j)\n    feat2 = lex_sem_ft.W2V_Vec(i, j, wordVec_model)\n    feat3 = rd_ft.ED(i, j)\n    feat4 = rd_ft.LCW(i, j)\n    all_feat.append(feat1)\n    all_feat.append(feat2)\n    all_feat.append(feat3)\n    all_feat.append(feat4)\n    \ndata_inp_l, data_inp_r, embedding_matrix = dl.process_data(sent_l = data_l, sent_r = data_r, \n                                                           wordVec_model = wordVec_model, dimx = 10, dimy = 10)\n```\n#### 2. Let's define a model for incorporating external features with deep models.\n\n``` python\n\ndef model_cnn_ft(dimx, dimy, dimft, embedding_matrix):\n    inpx = Input(shape=(dimx,),dtype='int32',name='inpx')   \n    embedx = dl.word2vec_embedding_layer(embedding_matrix)(inpx)\n    inpy = Input(shape=(dimx,),dtype='int32',name='inpy')   \n    embedy = dl.word2vec_embedding_layer(embedding_matrix)(inpy)\n    inpz = Input(shape=(dimft,),dtype='int32',name='inpz')\n    \n    sent_l = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedx)\n    sent_r = Conv1D(nb_filter=3,filter_length=2,activation='relu')(embedy)\n    pool_l = MaxPooling1D()(sent_l)\n    pool_r = MaxPooling1D()(sent_r)\n    \n    combine  = merge(mode='concat')([pool_l, pool_r,inpz])\n    flat_embed = Flatten()(combine)\n    nnet_h = Dense(units=10,activation='sigmoid')(flat_embed)\n    nnet_out = Dense(units=2,activation='sigmoid')(nnet_h)\n    model = Model([inpx],nnet_out)\n    model.compile(loss='mse',optimizer='adam')\n    \n    return model\n```\n#### 3. Training the deep model.\n```python\nmodel = model_cnn_ft(dimx = 10, dimy = 10, dimz = len(all_feat), embedding_matrix = embedding_matrix)\nmodel.fit([data_inp_l, data_inp_r, all_feat], labels)\n```\n\n## Evaluation metrics - MAP, MRR, AP@k, etc.\nThe mean average precision (MAP) and mean reciprocal recall (MRR) is computed as:\n\n\u003cimg src=\"https://github.com/GauravBh1010tt/DL-text/blob/master/img.JPG\" width=\"550\"\u003e\n\nIn our implementation we assume that the ground truth is arranged starting with the true labels and is/are followed by false labels.\n```python\n\u003e\u003e\u003e from dl_text import metrics\n\u003e\u003e\u003e pred = [[0,0,1],[0,0,1]] # we have two queries with 3 answers for each; 1 - relevant, 0 - irrelevant\n\n'''Converting the prediction list to dictionary'''\n\n\u003e\u003e\u003e dict1 = {}\n\u003e\u003e\u003e for i,j in enumerate(pred):\n        dict1[i] = j\n        \n\u003e\u003e\u003e metrics.Map(dict1)\n0.33\n\u003e\u003e\u003e metrics.Mrr(dict1)\n33.33\n\n\u003e\u003e\u003e pred = [[0,1,1],[0,1,0]]\n\u003e\u003e\u003e for i,j in enumerate(pred):\n        dict1[i] = j\n\u003e\u003e\u003e metrics.Map(dict1)\n0.5416666666666666\n\u003e\u003e\u003e metrics.Mrr(dict1)\n50.0\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgauravbh1010tt%2Fdl-text","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgauravbh1010tt%2Fdl-text","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgauravbh1010tt%2Fdl-text/lists"}