{"id":15034803,"url":"https://github.com/kk7nc/text_classification","last_synced_at":"2025-05-14T06:14:05.467Z","repository":{"id":40897148,"uuid":"139912879","full_name":"kk7nc/Text_Classification","owner":"kk7nc","description":"Text Classification Algorithms: A Survey","archived":false,"fork":false,"pushed_at":"2025-04-01T00:35:13.000Z","size":14464,"stargazers_count":1810,"open_issues_count":2,"forks_count":543,"subscribers_count":72,"default_branch":"master","last_synced_at":"2025-04-11T01:41:48.053Z","etag":null,"topics":["boosting-algorithms","conditional-random-fields","convolutional-neural-networks","decision-trees","deep-belief-network","deep-learning","deep-neural-network","dimensionality-reduction","document-classification","hierarchical-attention-networks","k-nearest-neighbours","logistic-regression","naive-bayes-classifier","nlp-machine-learning","random-forest","recurrent-neural-networks","rocchio-algorithm","support-vector-machines","text-classification","text-processing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kk7nc.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":"CONTRIBUTING.rst","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2018-07-06T00:10:18.000Z","updated_at":"2025-04-03T14:52:40.000Z","dependencies_parsed_at":"2024-12-19T01:02:25.368Z","dependency_job_id":"e9021723-ddad-4e5e-a4ec-843ed88e10cd","html_url":"https://github.com/kk7nc/Text_Classification","commit_stats":{"total_commits":223,"total_committers":9,"mean_commits":24.77777777777778,"dds":"0.053811659192825156","last_synced_commit":"7092ca64619b305e0f184ceb6fd0341f5c16f57d"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kk7nc%2FText_Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kk7nc%2FText_Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kk7nc%2FText_Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kk7nc%2FText_Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kk7nc","download_url":"https://codeload.github.com/kk7nc/Text_Classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254083809,"owners_count":22011902,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["boosting-algorithms","conditional-random-fields","convolutional-neural-networks","decision-trees","deep-belief-network","deep-learning","deep-neural-network","dimensionality-reduction","document-classification","hierarchical-attention-networks","k-nearest-neighbours","logistic-regression","naive-bayes-classifier","nlp-machine-learning","random-forest","recurrent-neural-networks","rocchio-algorithm","support-vector-machines","text-classification","text-processing"],"created_at":"2024-09-24T20:26:23.618Z","updated_at":"2025-05-14T06:14:05.445Z","avatar_url":"https://github.com/kk7nc.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n################################################\nText Classification Algorithms: A Survey\n################################################\n\n|UniversityCube| |DOI| |Best| |medium| |mendeley| |contributions-welcome| |arXiv| |ansicolortags| |contributors| |twitter|\n  \n  \n.. figure:: docs/pic/WordArt.png \n \n \n Referenced paper : `Text Classification Algorithms: A Survey \u003chttps://arxiv.org/abs/1904.08067\u003e`__    \n \n|BPW|  \n\n\n\n##################\nTable of Contents\n##################\n.. contents::\n  :local:\n  :depth: 4\n\n============\nIntroduction\n============\n\n.. figure:: docs/pic/OverviewTextClassification.png \n \n    \n    \n====================================\nText and Document Feature Extraction\n====================================\n\n----\n\n\nText feature extraction and pre-processing for classification algorithms are very significant. In this section, we start to talk about text cleaning since most of documents contain a lot of noise. In this part, we discuss two primary methods of text feature extractions- word embedding and weighted word.\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nText Cleaning and Pre-processing\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text documents. In many algorithms like statistical and probabilistic learning methods, noise and unnecessary features can negatively affect the overall perfomance. So, elimination of these features are extremely important.\n\n\n-------------\nTokenization\n-------------\n\nTokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens. The main goal of this step is to extract individual words in a sentence. Along with text classifcation, in text mining, it is necessay to incorporate a parser in the pipeline which performs the tokenization of the documents; for example:\n\nsentence:\n\n.. code::\n\n  After sleeping for four hours, he decided to sleep for another four\n\n\nIn this case, the tokens are as follows:\n\n.. code::\n\n    {'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}\n\n\nHere is python code for Tokenization:\n\n.. code:: python\n\n  from nltk.tokenize import word_tokenize\n  text = \"After sleeping for four hours, he decided to sleep for another four\"\n  tokens = word_tokenize(text)\n  print(tokens)\n\n-----------\nStop words\n-----------\n\n\nText and document classification over social media, such as Twitter, Facebook, and so on is usually affected by the noisy nature (abbreviations, irregular forms) of the text corpuses.\n\nHere is an exmple from  `geeksforgeeks \u003chttps://www.geeksforgeeks.org/removing-stop-words-nltk-python/\u003e`__\n\n.. code:: python\n\n  from nltk.corpus import stopwords\n  from nltk.tokenize import word_tokenize\n\n  example_sent = \"This is a sample sentence, showing off the stop words filtration.\"\n\n  stop_words = set(stopwords.words('english'))\n\n  word_tokens = word_tokenize(example_sent)\n\n  filtered_sentence = [w for w in word_tokens if not w in stop_words]\n\n  filtered_sentence = []\n\n  for w in word_tokens:\n      if w not in stop_words:\n          filtered_sentence.append(w)\n\n  print(word_tokens)\n  print(filtered_sentence)\n\n\n\nOutput:\n\n.. code:: python \n\n  ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', \n  'off', 'the', 'stop', 'words', 'filtration', '.']\n  ['This', 'sample', 'sentence', ',', 'showing', 'stop',\n  'words', 'filtration', '.']\n\n\n---------------\nCapitalization\n---------------\n\nSentences can contain a mixture of uppercase and lower case letters. Multiple sentences make up a text document. To reduce the problem space, the most common approach is to reduce everything to lower case. This brings all words in a document in same space, but it often changes the meaning of some words, such as \"US\" to \"us\" where first one represents the United States of America and second one is a pronoun. To solve this, slang and abbreviation converters can be applied.\n\n.. code:: python\n\n  text = \"The United States of America (USA) or America, is a federal republic composed of 50 states\"\n  print(text)\n  print(text.lower())\n\nOutput:\n\n.. code:: python\n\n  \"The United States of America (USA) or America, is a federal republic composed of 50 states\"\n  \"the united states of america (usa) or america, is a federal republic composed of 50 states\"\n\n-----------------------\nSlangs and Abbreviations\n-----------------------\n\nSlangs and abbreviations can cause problems while executing the pre-processing steps. An abbreviation  is a shortened form of a word, such as SVM stand for Support Vector Machine. Slang is a version of language that depicts informal conversation or text that has different meaning, such as \"lost the plot\", it essentially means that 'they've gone mad'. Common method to deal with these words is converting them to formal language.\n\n---------------\nNoise Removal\n---------------\n\n\nAnother issue of text cleaning as a pre-processing step is noise removal. Text documents generally contains characters like punctuations or  special characters and they are not necessary for text mining or classification purposes. Although punctuation is critical to understand the meaning of the sentence, but it can affect the classification algorithms negatively.\n\n\nHere is simple code to remove standard noise from text:\n\n\n.. code:: python\n\n  def text_cleaner(text):\n      rules = [\n          {r'\u003e\\s+': u'\u003e'},  # remove spaces after a tag opens or closes\n          {r'\\s+': u' '},  # replace consecutive spaces\n          {r'\\s*\u003cbr\\s*/?\u003e\\s*': u'\\n'},  # newline after a \u003cbr\u003e\n          {r'\u003c/(div)\\s*\u003e\\s*': u'\\n'},  # newline after \u003c/p\u003e and \u003c/div\u003e and \u003ch1/\u003e...\n          {r'\u003c/(p|h\\d)\\s*\u003e\\s*': u'\\n\\n'},  # newline after \u003c/p\u003e and \u003c/div\u003e and \u003ch1/\u003e...\n          {r'\u003chead\u003e.*\u003c\\s*(/head|body)[^\u003e]*\u003e': u''},  # remove \u003chead\u003e to \u003c/head\u003e\n          {r'\u003ca\\s+href=\"([^\"]+)\"[^\u003e]*\u003e.*\u003c/a\u003e': r'\\1'},  # show links instead of texts\n          {r'[ \\t]*\u003c[^\u003c]*?/?\u003e': u''},  # remove remaining tags\n          {r'^\\s+': u''}  # remove spaces at the beginning\n      ]\n      for rule in rules:\n      for (k, v) in rule.items():\n          regex = re.compile(k)\n          text = regex.sub(v, text)\n      text = text.rstrip()\n      return text.lower()\n    \n\n\n-------------------\nSpelling Correction\n-------------------\n\n\nAn optional part of the pre-processing step is correcting the misspelled words. Different techniques, such as hashing-based and context-sensitive spelling correction techniques, or  spelling correction using trie and damerau-levenshtein distance bigram have been introduced to tackle this issue.\n\n\n.. code:: python\n\n  from autocorrect import spell\n\n  print spell('caaaar')\n  print spell(u'mussage')\n  print spell(u'survice')\n  print spell(u'hte')\n\nResult:\n\n.. code::\n\n    caesar\n    message\n    service\n    the\n\n\n------------\nStemming\n------------\n\n\nText Stemming is modifying a word to obtain its variants using different linguistic processeses like affixation (addition of affixes). For example, the stem of the word \"studying\" is \"study\", to which -ing.\n\n\nHere is an example of Stemming from `NLTK \u003chttps://pythonprogramming.net/stemming-nltk-tutorial/\u003e`__\n\n.. code:: python\n\n    from nltk.stem import PorterStemmer\n    from nltk.tokenize import sent_tokenize, word_tokenize\n\n    ps = PorterStemmer()\n\n    example_words = [\"python\",\"pythoner\",\"pythoning\",\"pythoned\",\"pythonly\"]\n    \n    for w in example_words:\n    print(ps.stem(w))\n\n\nResult:\n\n.. code::\n\n  python\n  python\n  python\n  python\n  pythonli\n\n-------------\nLemmatization\n-------------\n\n\nText lemmatization is the process of eliminating redundant prefix or suffix of a word and extract the base word (lemma).\n\n\n.. code:: python\n\n  from nltk.stem import WordNetLemmatizer\n\n  lemmatizer = WordNetLemmatizer()\n\n  print(lemmatizer.lemmatize(\"cats\"))\n\n~~~~~~~~~~~~~~\nWord Embedding\n~~~~~~~~~~~~~~\n\nDifferent word embedding procedures have been proposed to translate these unigrams into consummable input for machine learning algorithms. A very simple way to perform such embedding is term-frequency~(TF) where each word will be mapped to a number corresponding to the number of occurrence of that word in the whole corpora. The other term frequency functions have been also used that represent word-frequency as Boolean or logarithmically scaled number. Here, each document will be converted to a vector of same length containing the frequency of the words in that document. Although such approach may seem very intuitive but it suffers from the fact that particular words that are used very commonly in language literature might dominate this sort of word representations.\n\n.. image:: docs/pic/CBOW.png\n\n\n--------\nWord2Vec\n--------\n\nOriginal from https://code.google.com/p/word2vec/\n\nI’ve copied it to a github project so that I can apply and track community\npatches (starting with capability for Mac OS X\ncompilation).\n\n-  **makefile and some source has been modified for Mac OS X\n   compilation** See\n   https://code.google.com/p/word2vec/issues/detail?id=1#c5\n-  **memory patch for word2vec has been applied** See\n   https://code.google.com/p/word2vec/issues/detail?id=2\n-  Project file layout altered\n\nThere seems to be a segfault in the compute-accuracy utility.\n\nTo get started:\n\n::\n\n   cd scripts \u0026\u0026 ./demo-word.sh\n\nOriginal README text follows:\n\nThis tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research purposes. \n\n\nthis code provides an implementation of the Continuous Bag-of-Words (CBOW) and\nthe Skip-gram model (SG), as well as several demo scripts.\n\nGiven a text corpus, the word2vec tool learns a vector for every word in\nthe vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural\nnetwork architectures. The user should specify the following: -\ndesired vector dimensionality (size of the context window for\neither the Skip-Gram or the Continuous Bag-of-Words model),  training\nalgorithm (hierarchical softmax and / or negative sampling), threshold\nfor downsampling the frequent words, number of threads to use,\nformat of the output word vector file (text or binary).\n\nUsually, other hyper-parameters, such as the learning rate do not\nneed to be tuned for different training sets.\n\nThe script demo-word.sh downloads a small (100MB) text corpus from the\nweb, and trains a small word vector model. After the training is\nfinished, users can interactively explore the similarity of the\nwords.\n\nMore information about the scripts is provided at\nhttps://code.google.com/p/word2vec/\n\n\n----------------------------------------------\nGlobal Vectors for Word Representation (GloVe)\n----------------------------------------------\n\n.. image:: /docs/pic/Glove.PNG\n\nAn implementation of the GloVe model for learning word representations is provided, and describe how to download web-dataset vectors or train your own. See the  `project page \u003chttp://nlp.stanford.edu/projects/glove/\u003e`__  or the   `paper \u003chttp://nlp.stanford.edu/pubs/glove.pdf\u003e`__  for more information on glove vectors.\n\n\n------------------------------------\nContextualized Word Representations\n------------------------------------\n\nELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.\n\n\n**ELMo representations are:**\n\n-  **Contextual:** The representation for each word depends on the entire context in which it is used.\n-  **Deep:** The word representations combine all layers of a deep pre-trained neural network.\n-  **Character based:** ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.\n\n\n**Tensorflow implementation**\n\nTensorflow implementation of the pretrained biLM used to compute ELMo representations from `\"Deep contextualized word representations\" \u003chttp://arxiv.org/abs/1802.05365\u003e`__.\n\nThis repository supports both training biLMs and using pre-trained models for prediction.\n\nWe also have a pytorch implementation available in `AllenNLP \u003chttp://allennlp.org/\u003e`__.\n\nYou may also find it easier to use the version provided in `Tensorflow Hub \u003chttps://www.tensorflow.org/hub/modules/google/elmo/2\u003e`__ if you just like to make predictions.\n\n**pre-trained models:**\n\nWe have got several pre-trained English language biLMs available for use. Each model is specified with two separate files, a JSON formatted \"options\" file with hyperparameters and a hdf5 formatted file with the model weights. Links to the pre-trained models are available `here \u003chttps://allennlp.org/elmo\u003e`__.\n\nThere are three ways to integrate ELMo representations into a downstream task, depending on your use case.\n\n1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.\n2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.\n3. Precompute the representations for your entire dataset and save to a file.\n\nWe have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you'd like to use ELMo in other frameworks.\n\nIn all cases, the process roughly follows the same steps. First, create a ``Batcher`` (or ``TokenBatcher`` for #2) to translate tokenized strings to numpy arrays of character (or token) ids. Then, load the pretrained ELMo model (class ``BidirectionalLanguageModel``). Finally, for steps #1 and #2 use ``weight_layers`` to compute the final ELMo representations. For #3, use ``BidirectionalLanguageModel`` to write all the intermediate layers to a file.\n\n\n\n.. figure:: docs/pic/ngram_cnn_highway_1.png \nArchitecture of the language model applied to an example sentence [Reference:  `arXiv paper \u003chttps://arxiv.org/pdf/1508.06615.pdf\u003e`__]. \n\n\n.. figure:: docs/pic/Glove_VS_DCWE.png \n\n--------\nFastText\n--------\n\n.. figure:: docs/pic/fasttext-logo-color-web.png\n\nfastText is a library for efficient learning of word representations and sentence classification.\n\n**Github:**  `facebookresearch/fastText \u003chttps://github.com/facebookresearch/fastText\u003e`__\n\n**Models**\n\n-  Recent state-of-the-art `English word vectors \u003chttps://fasttext.cc/docs/en/english-vectors.html\u003e`__.\n-  Word vectors for `157 languages trained on Wikipedia and Crawl \u003chttps://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md\u003e`__.\n-  Models for `language identification \u003chttps://fasttext.cc/docs/en/language-identification.html#content\u003e`__ and `various supervised tasks \u003chttps://fasttext.cc/docs/en/supervised-models.html#content\u003e`__.\n\n**Supplementary data :**\n\n\n-  The preprocessed `YFCC100M data \u003chttps://fasttext.cc/docs/en/dataset.html#content\u003e`__ .\n\n**FAQ**\n\nYou can find `answers to frequently asked questions \u003chttps://fasttext.cc/docs/en/faqs.html#content\u003e`__ on Their project `website \u003chttps://fasttext.cc/\u003e`__.\n\n**Cheatsheet**\n\nAlso a `cheatsheet \u003chttps://fasttext.cc/docs/en/cheatsheet.html#content\u003e`__ is provided full of useful one-liners.\n\n\n\n~~~~~~~~~~~~~~\nWeighted Words\n~~~~~~~~~~~~~~\n\n\n--------------\nTerm frequency\n--------------\n\nTerm frequency is Bag of words that is one of the simplest techniques of text feature extraction. This method is based on counting number of the words in each document and assign it to feature space.\n\n\n-----------------------------------------\nTerm Frequency-Inverse Document Frequency\n-----------------------------------------\nThe mathematical representation of weight of a term in a document by Tf-idf is given:\n\n.. image:: docs/eq/tf-idf.gif\n   :width: 10px\n   \nWhere N is number of documents and df(t) is the number of documents containing the term t in the corpus. The first part would improve recall and the later would improve the precision of the word embedding. Although tf-idf tries to overcome the problem of common terms in document, it still suffers from some other descriptive limitations. Namely, tf-idf cannot account for the similarity between words in the document since each word is presented as an index. In the recent years, with development of more complex models, such as neural nets, new methods has been presented that can incorporate concepts, such as similarity of words and part of speech tagging. This work uses, word2vec and Glove, two of the most common methods that have been successfully used for deep learning techniques.\n\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    def loadData(X_train, X_test,MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\",str(np.array(X_train).shape[1]),\"features\")\n        return (X_train,X_test)\n   \n   \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nComparison of Feature Extraction Techniques\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|                **Model**              |                                                                        **Advantages**                                                                    |                                                   **Limitation**                                               |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|            **Weighted Words**         |  * Easy to compute                                                                                                                                       |  * It does not capture the position in the text (syntactic)                                                    |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Easy to compute the similarity between 2 documents using it                                                                                           |  * It does not capture meaning in the text (semantics)                                                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Basic metric to extract the most descriptive terms in a document                                                                                      |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Common words effect on the results (e.g., “am”, “is”, etc.)                                                 |\n|                                       |  * Works with an unknown word (e.g., New words in languages)                                                                                             |                                                                                                                |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|            **TF-IDF**                 |  * Easy to compute                                                                                                                                       |  * It does not capture the position in the text (syntactic)                                                    |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Easy to compute the similarity between 2 documents using it                                                                                           |  * It does not capture meaning in the text (semantics)                                                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Basic metric to extract the most descriptive terms in a document                                                                                      |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.)                                                                            |                                                                                                                |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|               **Word2Vec**            |  * It captures the position of the words in the text (syntactic)                                                                                         |  * It cannot capture the meaning of the word from the text (fails to capture polysemy)                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * It captures meaning in the words (semantics)                                                                                                          |  * It cannot capture out-of-vocabulary words from corpus                                                       |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|         **GloVe (Pre-Trained)**       |  * It captures the position of the words in the text (syntactic)                                                                                         |  * It cannot capture the meaning of the word from  the text (fails to capture polysemy)                        |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * It captures meaning in the words (semantics)                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Memory consumption for storage                                                                              |\n|                                       |  * Trained on huge corpus                                                                                                                                |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from corpus                                                       |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|           **GloVe (Trained)**         |  * It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec) |  * Memory consumption for storage                                                                              |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Lower weight for highly frequent word pairs, such as stop words like “am”, “is”, etc. Will not dominate training progress                             |  * Needs huge corpus to learn                                                                                  |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from the corpus                                                   |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture the meaning of the word from  the text (fails to capture polysemy)                        |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|               **FastText**            |  * Works for rare words (rare in their character n-grams which are still shared with other words                                                         |  * It cannot capture the meaning of the word from the text (fails to capture polysemy)                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Memory consumption for storage                                                                              |\n|                                       |  * Solves out of vocabulary words with n-gram in character level                                                                                         |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Computationally is more expensive in comparing with GloVe and Word2Vec                                      |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|**Contextualized Word Representations**|  * It captures the meaning of the word from the text (incorporates context, handling polysemy)                                                           |  * Memory consumption for storage                                                                              |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Needs another word embedding for all LSTM and feedforward layers                                            |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from a corpus                                                     |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Works only sentence and document level (it cannot work for individual word level)                           |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n\n\n========================\nDimensionality Reduction\n========================\n\n----\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nPrincipal Component Analysis (PCA)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nPrinciple component analysis~(PCA) is the most popular technique in multivariate analysis and dimensionality reduction. PCA is a method to identify a subspace in which the data approximately lies. This means finding new variables that are uncorrelated and maximizing the variance to preserve as much variability as possible.\n\n\nExample of PCA on text dataset (20newsgroups) from  tf-idf with 75000 features to 2000 components:\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n    from sklearn.decomposition import PCA\n    pca = PCA(n_components=2000)\n    X_train_new = pca.fit_transform(X_train)\n    X_test_new = pca.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n    \n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new).shape)\n\noutput:\n\n.. code:: python\n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nLinear Discriminant Analysis (LDA)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nLinear Discriminant Analysis (LDA) is another commonly used technique for data classification and dimensionality reduction. LDA is particularly helpful where the within-class frequencies are unequal and their performances have been evaluated on randomly generated test data. Class-dependent and class-independent transformation are two approaches in LDA where the ratio of between-class-variance to within-class-variance and the ratio of the overall-variance to within-class-variance are used respectively. \n\n\n\n.. code:: python\n\n\n  from sklearn.feature_extraction.text import TfidfVectorizer\n  import numpy as np\n  from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n\n\n  def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n      vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n      X_train = vectorizer_x.fit_transform(X_train).toarray()\n      X_test = vectorizer_x.transform(X_test).toarray()\n      print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n      return (X_train, X_test)\n\n\n  from sklearn.datasets import fetch_20newsgroups\n\n  newsgroups_train = fetch_20newsgroups(subset='train')\n  newsgroups_test = fetch_20newsgroups(subset='test')\n  X_train = newsgroups_train.data\n  X_test = newsgroups_test.data\n  y_train = newsgroups_train.target\n  y_test = newsgroups_test.target\n\n  X_train,X_test = TFIDF(X_train,X_test)\n\n\n\n  LDA = LinearDiscriminantAnalysis(n_components=15)\n  X_train_new = LDA.fit(X_train,y_train)\n  X_train_new =  LDA.transform(X_train)\n  X_test_new = LDA.transform(X_test)\n\n  print(\"train with old features: \",np.array(X_train).shape)\n  print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n  print(\"test with old features: \",np.array(X_test).shape)\n  print(\"test with new features:\" ,np.array(X_test_new).shape)\n\n\noutput:\n\n.. code:: \n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 15)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 15)\n    \n    \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nNon-negative Matrix Factorization (NMF)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n.. code:: python\n\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn.decomposition import NMF\n\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n\n\n    NMF_ = NMF(n_components=2000)\n    X_train_new = NMF_.fit(X_train)\n    X_train_new =  NMF_.transform(X_train)\n    X_test_new = NMF_.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new))\n\noutput:\n\n.. code:: \n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n    \n    \n\n~~~~~~~~~~~~~~~~~\nRandom Projection\n~~~~~~~~~~~~~~~~~\nRandom projection or random feature is a dimensionality reduction technique mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, can contain a huge number of underlying features.\nMany researchers addressed Random Projection for text data for text mining, text classification and/or dimensionality reduction.\nWe start to review some random projection techniques. \n\n\n.. image:: docs/pic/Random%20Projection.png\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n    from sklearn import random_projection\n\n    RandomProjection = random_projection.GaussianRandomProjection(n_components=2000)\n    X_train_new = RandomProjection.fit_transform(X_train)\n    X_test_new = RandomProjection.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new).shape)\n\noutput:\n\n.. code:: python\n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n    \n~~~~~~~~~~~\nAutoencoder\n~~~~~~~~~~~\n\n\nAutoencoder is a neural network technique that is trained to attempt to map its input to its output. The autoencoder as dimensional reduction methods have achieved great success via the powerful reprehensibility of neural networks. The main idea is, one hidden layer between the input and output layers with fewer neurons can be used to reduce the dimension of feature space. Specially for texts, documents, and sequences that contains many features, autoencoder could help to process data faster and more efficiently.\n\n\n.. image:: docs/pic/Autoencoder.png\n\n\n\n.. code:: python\n\n  from keras.layers import Input, Dense\n  from keras.models import Model\n\n  # this is the size of our encoded representations\n  encoding_dim = 1500  \n\n  # this is our input placeholder\n  input = Input(shape=(n,))\n  # \"encoded\" is the encoded representation of the input\n  encoded = Dense(encoding_dim, activation='relu')(input)\n  # \"decoded\" is the lossy reconstruction of the input\n  decoded = Dense(n, activation='sigmoid')(encoded)\n\n  # this model maps an input to its reconstruction\n  autoencoder = Model(input, decoded)\n\n  # this model maps an input to its encoded representation\n  encoder = Model(input, encoded)\n  \n\n  encoded_input = Input(shape=(encoding_dim,))\n  # retrieve the last layer of the autoencoder model\n  decoder_layer = autoencoder.layers[-1]\n  # create the decoder model\n  decoder = Model(encoded_input, decoder_layer(encoded_input))\n  \n  autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')\n  \n  \n\nLoad data:\n\n\n.. code:: python\n\n  autoencoder.fit(x_train, x_train,\n                  epochs=50,\n                  batch_size=256,\n                  shuffle=True,\n                  validation_data=(x_test, x_test))\n                  \n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nT-distributed Stochastic Neighbor Embedding (T-SNE)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n\nT-distributed Stochastic Neighbor Embedding (T-SNE) is a nonlinear dimensionality reduction technique for embedding high-dimensional data which is mostly used for visualization in a low-dimensional space. This approach is based on `G. Hinton and ST. Roweis \u003chttps://www.cs.toronto.edu/~fritz/absps/sne.pdf\u003e`__ . SNE works by converting the high dimensional Euclidean distances into conditional probabilities which represent similarities.\n\n `Example \u003chttp://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html\u003e`__:\n\n\n.. code:: python\n\n   import numpy as np\n   from sklearn.manifold import TSNE\n   X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])\n   X_embedded = TSNE(n_components=2).fit_transform(X)\n   X_embedded.shape\n\n\nExample of Glove and T-SNE for text:\n\n.. image:: docs/pic/TSNE.png\n\n===============================\nText Classification Techniques\n===============================\n\n----\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nRocchio classification\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe first version of Rocchio algorithm is introduced by rocchio in 1971 to use relevance feedback in querying full-text databases. Since then many researchers have addressed and developed this technique for text and document classification. This method uses TF-IDF weights for each informative word instead of a set of Boolean features. Using a training set of documents, Rocchio's algorithm builds a prototype vector for each class which is an average vector over all training document vectors that belongs to a certain class. Then, it will assign each test document to a class with maximum similarity that between test document and each of the prototype vectors.\n\n\nWhen in nearest centroid classifier, we used for text as input data for classification with tf-idf vectors, this classifier is known as the Rocchio classifier.\n\n.. code:: python\n\n    from sklearn.neighbors.nearest_centroid import NearestCentroid\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', NearestCentroid()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n\n\nOutput:\n\n.. code:: python\n\n                  precision    recall  f1-score   support\n\n              0       0.75      0.49      0.60       319\n              1       0.44      0.76      0.56       389\n              2       0.75      0.68      0.71       394\n              3       0.71      0.59      0.65       392\n              4       0.81      0.71      0.76       385\n              5       0.83      0.66      0.74       395\n              6       0.49      0.88      0.63       390\n              7       0.86      0.76      0.80       396\n              8       0.91      0.86      0.89       398\n              9       0.85      0.79      0.82       397\n             10       0.95      0.80      0.87       399\n             11       0.94      0.66      0.78       396\n             12       0.40      0.70      0.51       393\n             13       0.84      0.49      0.62       396\n             14       0.89      0.72      0.80       394\n             15       0.55      0.73      0.63       398\n             16       0.68      0.76      0.71       364\n             17       0.97      0.70      0.81       376\n             18       0.54      0.53      0.53       310\n             19       0.58      0.39      0.47       251\n\n    avg / total       0.74      0.69      0.70      7532\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nBoosting and Bagging\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n---------\nBoosting\n---------\n\n.. image:: docs/pic/Boosting.PNG\n\n\n**Boosting** is a Ensemble learning meta-algorithm for primarily reducing variance in supervised learning. It is basically a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by `Michael Kearns \u003chttps://en.wikipedia.org/wiki/Michael_Kearns_(computer_scientist)\u003e`__  and Leslie Valiant (1988, 1989) Can a set of weak learners create a single strong learner? A weak learner is defined to be a Classification that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.\n\n\n\n\n.. code:: python\n\n  from sklearn.ensemble import GradientBoostingClassifier\n  from sklearn.pipeline import Pipeline\n  from sklearn import metrics\n  from sklearn.feature_extraction.text import CountVectorizer\n  from sklearn.feature_extraction.text import TfidfTransformer\n  from sklearn.datasets import fetch_20newsgroups\n\n  newsgroups_train = fetch_20newsgroups(subset='train')\n  newsgroups_test = fetch_20newsgroups(subset='test')\n  X_train = newsgroups_train.data\n  X_test = newsgroups_test.data\n  y_train = newsgroups_train.target\n  y_test = newsgroups_test.target\n\n  text_clf = Pipeline([('vect', CountVectorizer()),\n                       ('tfidf', TfidfTransformer()),\n                       ('clf', GradientBoostingClassifier(n_estimators=100)),\n                       ])\n\n  text_clf.fit(X_train, y_train)\n\n\n  predicted = text_clf.predict(X_test)\n\n  print(metrics.classification_report(y_test, predicted))\n\n\nOutput:\n \n.. code:: python\n\n               precision    recall  f1-score   support\n            0       0.81      0.66      0.73       319\n            1       0.69      0.70      0.69       389\n            2       0.70      0.68      0.69       394\n            3       0.64      0.72      0.68       392\n            4       0.79      0.79      0.79       385\n            5       0.83      0.64      0.72       395\n            6       0.81      0.84      0.82       390\n            7       0.84      0.75      0.79       396\n            8       0.90      0.86      0.88       398\n            9       0.90      0.85      0.88       397\n           10       0.93      0.86      0.90       399\n           11       0.90      0.81      0.85       396\n           12       0.33      0.69      0.45       393\n           13       0.87      0.72      0.79       396\n           14       0.87      0.84      0.85       394\n           15       0.85      0.87      0.86       398\n           16       0.65      0.78      0.71       364\n           17       0.96      0.74      0.84       376\n           18       0.70      0.55      0.62       310\n           19       0.62      0.56      0.59       251\n\n  avg / total       0.78      0.75      0.76      7532\n\n  \n-------\nBagging\n-------\n\n.. image:: docs/pic/Bagging.PNG\n\n\n.. code:: python\n\n    from sklearn.ensemble import BaggingClassifier\n    from sklearn.neighbors import KNeighborsClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', BaggingClassifier(KNeighborsClassifier())),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nOutput:\n \n.. code:: python\n\n               precision    recall  f1-score   support\n            0       0.57      0.74      0.65       319\n            1       0.60      0.56      0.58       389\n            2       0.62      0.54      0.58       394\n            3       0.54      0.57      0.55       392\n            4       0.63      0.54      0.58       385\n            5       0.68      0.62      0.65       395\n            6       0.55      0.46      0.50       390\n            7       0.77      0.67      0.72       396\n            8       0.79      0.82      0.80       398\n            9       0.74      0.77      0.76       397\n           10       0.81      0.86      0.83       399\n           11       0.74      0.85      0.79       396\n           12       0.67      0.49      0.57       393\n           13       0.78      0.51      0.62       396\n           14       0.76      0.78      0.77       394\n           15       0.71      0.81      0.76       398\n           16       0.73      0.73      0.73       364\n           17       0.64      0.79      0.71       376\n           18       0.45      0.69      0.54       310\n           19       0.61      0.54      0.57       251\n\n  avg / total       0.67      0.67      0.67      7532\n  \n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nNaive Bayes Classifier\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nNaïve Bayes text classification has been used in industry\nand academia for a long time (introduced by Thomas Bayes\nbetween 1701-1761). However, this technique\nis being studied since the 1950s for text and document categorization. Naive Bayes Classifier (NBC) is generative\nmodel which is widely used in Information Retrieval. Many researchers addressed and developed this technique\nfor their applications. We start with the most basic version\nof NBC which developed by using term-frequency (Bag of\nWord) fetaure extraction technique by counting number of\nwords in documents\n\n\n.. code:: python\n\n    from sklearn.naive_bayes import MultinomialNB\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', MultinomialNB()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n \n \nOutput:\n \n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.80      0.52      0.63       319\n              1       0.81      0.65      0.72       389\n              2       0.82      0.65      0.73       394\n              3       0.67      0.78      0.72       392\n              4       0.86      0.77      0.81       385\n              5       0.89      0.75      0.82       395\n              6       0.93      0.69      0.80       390\n              7       0.85      0.92      0.88       396\n              8       0.94      0.93      0.93       398\n              9       0.92      0.90      0.91       397\n             10       0.89      0.97      0.93       399\n             11       0.59      0.97      0.74       396\n             12       0.84      0.60      0.70       393\n             13       0.92      0.74      0.82       396\n             14       0.84      0.89      0.87       394\n             15       0.44      0.98      0.61       398\n             16       0.64      0.94      0.76       364\n             17       0.93      0.91      0.92       376\n             18       0.96      0.42      0.58       310\n             19       0.97      0.14      0.24       251\n\n    avg / total       0.82      0.77      0.77      7532\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nK-nearest Neighbor\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nR\nIn machine learning, the k-nearest neighbors algorithm (kNN)\nis a non-parametric technique used for classification.\nThis method is used in Natural-language processing (NLP)\nas a text classification technique in many researches in the past\ndecades.\n\n.. image:: docs/pic/KNN.png\n\n.. code:: python\n\n    from sklearn.neighbors import KNeighborsClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', KNeighborsClassifier()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\nOutput:\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.43      0.76      0.55       319\n              1       0.50      0.61      0.55       389\n              2       0.56      0.57      0.57       394\n              3       0.53      0.58      0.56       392\n              4       0.59      0.56      0.57       385\n              5       0.69      0.60      0.64       395\n              6       0.58      0.45      0.51       390\n              7       0.75      0.69      0.72       396\n              8       0.84      0.81      0.82       398\n              9       0.77      0.72      0.74       397\n             10       0.85      0.84      0.84       399\n             11       0.76      0.84      0.80       396\n             12       0.70      0.50      0.58       393\n             13       0.82      0.49      0.62       396\n             14       0.79      0.76      0.78       394\n             15       0.75      0.76      0.76       398\n             16       0.70      0.73      0.72       364\n             17       0.62      0.76      0.69       376\n             18       0.55      0.61      0.58       310\n             19       0.56      0.49      0.52       251\n\n    avg / total       0.67      0.66      0.66      7532\n\n\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nSupport Vector Machine (SVM)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nThe original version of SVM was introduced by Vapnik and  Chervonenkis in 1963. The early 1990s, nonlinear version was addressed by BE. Boser et al.. Original version of SVM was designed for binary classification problem, but Many researchers have worked on multi-class problem using this authoritative technique.\n\n\nThe advantages of support vector machines are based on scikit-learn page:\n\n* Effective in high dimensional spaces.\n* Still effective in cases where number of dimensions is greater than the number of samples.\n* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.\n* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.\n\n\nThe disadvantages of support vector machines include:\n\n* If the number of features is much greater than the number of samples, avoiding over-fitting via choosing kernel functions and regularization term is crucial.\n* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).\n\n\n\n.. image:: docs/pic/SVM.png\n\n\n.. code:: python\n\n\n    from sklearn.svm import LinearSVC\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', LinearSVC()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\noutput:\n\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.82      0.80      0.81       319\n              1       0.76      0.80      0.78       389\n              2       0.77      0.73      0.75       394\n              3       0.71      0.76      0.74       392\n              4       0.84      0.86      0.85       385\n              5       0.87      0.76      0.81       395\n              6       0.83      0.91      0.87       390\n              7       0.92      0.91      0.91       396\n              8       0.95      0.95      0.95       398\n              9       0.92      0.95      0.93       397\n             10       0.96      0.98      0.97       399\n             11       0.93      0.94      0.93       396\n             12       0.81      0.79      0.80       393\n             13       0.90      0.87      0.88       396\n             14       0.90      0.93      0.92       394\n             15       0.84      0.93      0.88       398\n             16       0.75      0.92      0.82       364\n             17       0.97      0.89      0.93       376\n             18       0.82      0.62      0.71       310\n             19       0.75      0.61      0.68       251\n\n    avg / total       0.85      0.85      0.85      7532\n\n\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nDecision Tree\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nOne of earlier classification algorithm for text and data mining is decision tree. Decision tree classifiers (DTC's) are used successfully in many diverse areas of classification. The structure of this technique includes a hierarchical decomposition of the data space (only train dataset). Decision tree as classification task was introduced by `D. Morgan \u003chttp://www.aclweb.org/anthology/P95-1037\u003e`__ and developed by `JR. Quinlan \u003chttps://courses.cs.ut.ee/2009/bayesian-networks/extras/quinlan1986.pdf\u003e`__. The main idea is creating trees based on the attributes of the data points, but the challenge is determining which attribute should be in parent level and which one should be in child level. To solve this problem, `De Mantaras \u003chttps://link.springer.com/article/10.1023/A:1022694001379\u003e`__ introduced statistical modeling for feature selection in tree.\n\n\n.. code:: python\n\n    from sklearn import tree\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', tree.DecisionTreeClassifier()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\noutput:\n\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.51      0.48      0.49       319\n              1       0.42      0.42      0.42       389\n              2       0.51      0.56      0.53       394\n              3       0.46      0.42      0.44       392\n              4       0.50      0.56      0.53       385\n              5       0.50      0.47      0.48       395\n              6       0.66      0.73      0.69       390\n              7       0.60      0.59      0.59       396\n              8       0.66      0.72      0.69       398\n              9       0.53      0.55      0.54       397\n             10       0.68      0.66      0.67       399\n             11       0.73      0.69      0.71       396\n             12       0.34      0.33      0.33       393\n             13       0.52      0.42      0.46       396\n             14       0.65      0.62      0.63       394\n             15       0.68      0.72      0.70       398\n             16       0.49      0.62      0.55       364\n             17       0.78      0.60      0.68       376\n             18       0.38      0.38      0.38       310\n             19       0.32      0.32      0.32       251\n\n    avg / total       0.55      0.55      0.55      7532\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nRandom Forest\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nRandom forests or random decision forests technique is an ensemble learning method for text classification. This method was introduced by `T. Kam Ho \u003chttps://doi.org/10.1109/ICDAR.1995.598994\u003e`__ in 1995 for first time which used t trees in parallel. This technique was later developed by `L. Breiman \u003chttps://link.springer.com/article/10.1023/A:1010933404324\u003e`__ in 1999 that they found converged for RF as a margin measure.\n\n\n.. image:: docs/pic/RF.png\n\n.. code:: python\n\n    from sklearn.ensemble import RandomForestClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', RandomForestClassifier(n_estimators=100)),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\noutput:\n\n\n.. code:: python\n\n\n                    precision    recall  f1-score   support\n\n              0       0.69      0.63      0.66       319\n              1       0.56      0.69      0.62       389\n              2       0.67      0.78      0.72       394\n              3       0.67      0.67      0.67       392\n              4       0.71      0.78      0.74       385\n              5       0.78      0.68      0.73       395\n              6       0.74      0.92      0.82       390\n              7       0.81      0.79      0.80       396\n              8       0.90      0.89      0.90       398\n              9       0.80      0.89      0.84       397\n             10       0.90      0.93      0.91       399\n             11       0.89      0.91      0.90       396\n             12       0.68      0.49      0.57       393\n             13       0.83      0.65      0.73       396\n             14       0.81      0.88      0.84       394\n             15       0.68      0.91      0.78       398\n             16       0.67      0.86      0.75       364\n             17       0.93      0.78      0.85       376\n             18       0.86      0.48      0.61       310\n             19       0.79      0.31      0.45       251\n\n    avg / total       0.77      0.76      0.75      7532\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nConditional Random Field (CRF)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nConditional Random Field (CRF) is an undirected graphical model as shown in figure. CRFs state the conditional probability of a label sequence *Y* give a sequence of observation *X* *i.e.* P(Y|X). CRFs can incorporate complex features of observation sequence without violating the independence assumption by modeling the conditional probability of the label sequences rather than the joint probability P(X,Y). The concept of clique which is a fully connected subgraph and clique potential are used for computing P(X|Y). Considering one potential function for each clique of the graph, the probability of a variable configuration corresponds to the product of a series of non-negative potential function. The value computed by each potential function is equivalent to the probability of the variables in its corresponding clique taken on a particular configuration.\n\n\n.. image:: docs/pic/CRF.png\n\n\nExample from `Here \u003chttp://sklearn-crfsuite.readthedocs.io/en/latest/tutorial.html\u003e`__\nLet’s use CoNLL 2002 data to build a NER system\nCoNLL2002 corpus is available in NLTK. We use Spanish data.\n\n\n.. code:: python\n\n      import nltk\n      import sklearn_crfsuite\n      from sklearn_crfsuite import metrics\n      nltk.corpus.conll2002.fileids()\n      train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))\n      test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))\n      \n      \nsklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.\n\n.. code:: python\n\n      def word2features(sent, i):\n          word = sent[i][0]\n          postag = sent[i][1]\n\n          features = {\n              'bias': 1.0,\n              'word.lower()': word.lower(),\n              'word[-3:]': word[-3:],\n              'word[-2:]': word[-2:],\n              'word.isupper()': word.isupper(),\n              'word.istitle()': word.istitle(),\n              'word.isdigit()': word.isdigit(),\n              'postag': postag,\n              'postag[:2]': postag[:2],\n          }\n          if i \u003e 0:\n              word1 = sent[i-1][0]\n              postag1 = sent[i-1][1]\n              features.update({\n                  '-1:word.lower()': word1.lower(),\n                  '-1:word.istitle()': word1.istitle(),\n                  '-1:word.isupper()': word1.isupper(),\n                  '-1:postag': postag1,\n                  '-1:postag[:2]': postag1[:2],\n              })\n          else:\n              features['BOS'] = True\n\n          if i \u003c len(sent)-1:\n              word1 = sent[i+1][0]\n              postag1 = sent[i+1][1]\n              features.update({\n                  '+1:word.lower()': word1.lower(),\n                  '+1:word.istitle()': word1.istitle(),\n                  '+1:word.isupper()': word1.isupper(),\n                  '+1:postag': postag1,\n                  '+1:postag[:2]': postag1[:2],\n              })\n          else:\n              features['EOS'] = True\n\n          return features\n\n\n      def sent2features(sent):\n          return [word2features(sent, i) for i in range(len(sent))]\n\n      def sent2labels(sent):\n          return [label for token, postag, label in sent]\n\n      def sent2tokens(sent):\n          return [token for token, postag, label in sent]\n\n      X_train = [sent2features(s) for s in train_sents]\n      y_train = [sent2labels(s) for s in train_sents]\n\n      X_test = [sent2features(s) for s in test_sents]\n      y_test = [sent2labels(s) for s in test_sents]\n\n\nTo see all possible CRF parameters check its docstring. Here we are useing L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.\n\n\n\n.. code:: python\n\n      crf = sklearn_crfsuite.CRF(\n          algorithm='lbfgs',\n          c1=0.1,\n          c2=0.1,\n          max_iterations=100,\n          all_possible_transitions=True\n      )\n      crf.fit(X_train, y_train)\n\n\nEvaluation\n\n\n.. code:: python\n\n      y_pred = crf.predict(X_test)\n      print(metrics.flat_classification_report(\n          y_test, y_pred,  digits=3\n      ))\n\n\nOutput:\n\n.. code:: python\n\n                     precision    recall  f1-score   support\n\n            B-LOC      0.810     0.784     0.797      1084\n           B-MISC      0.731     0.569     0.640       339\n            B-ORG      0.807     0.832     0.820      1400\n            B-PER      0.850     0.884     0.867       735\n            I-LOC      0.690     0.637     0.662       325\n           I-MISC      0.699     0.589     0.639       557\n            I-ORG      0.852     0.786     0.818      1104\n            I-PER      0.893     0.943     0.917       634\n                O      0.992     0.997     0.994     45355\n\n      avg / total      0.970     0.971     0.971     51533\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nDeep Learning\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n-----------------------------------------\nDeep Neural Networks\n-----------------------------------------\n\nDeep Neural Networks architectures are designed to learn through multiple connection of layers where each single layer only receives connection from previous and provides connections only to the next layer in hidden part. The input is a connection of feature space (As discussed in Section Feature_extraction with first hidden layer. For Deep Neural Networks (DNN), input layer could be tf-ifd, word embedding, or etc. as shown in standard DNN in Figure. The output layer houses neurons equal to the number of classes for multi-class classification and only one neuron for binary classification. But our main contribution in this paper is that we have many trained DNNs to serve different purposes. Here, we have multi-class DNNs where each learning model is generated randomly (number of nodes in each layer as well as the number of layers are randomly assigned). Our implementation of Deep Neural Network (DNN) is basically a discriminatively trained model that uses standard back-propagation algorithm and sigmoid or ReLU as activation functions. The output layer for multi-class classification should use Softmax.\n\n\n.. image:: docs/pic/DNN.png\n\nimport packages:\n\n.. code:: python\n\n    from sklearn.datasets import fetch_20newsgroups\n    from keras.layers import  Dropout, Dense\n    from keras.models import Sequential\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n\n\nconvert text to TF-IDF:\n\n.. code:: python\n\n    def TFIDF(X_train, X_test,MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\",str(np.array(X_train).shape[1]),\"features\")\n        return (X_train,X_test)\n\n\nBuild a DNN Model for Text:\n\n.. code:: python\n\n    def Build_Model_DNN_Text(shape, nClasses, dropout=0.5):\n        \"\"\"\n        buildModel_DNN_Tex(shape, nClasses,dropout)\n        Build Deep neural networks Model for text classification\n        Shape is input feature space\n        nClasses is number of classes\n        \"\"\"\n        model = Sequential()\n        node = 512 # number of nodes\n        nLayers = 4 # number of  hidden layer\n\n        model.add(Dense(node,input_dim=shape,activation='relu'))\n        model.add(Dropout(dropout))\n        for i in range(0,nLayers):\n            model.add(Dense(node,input_dim=node,activation='relu'))\n            model.add(Dropout(dropout))\n        model.add(Dense(nClasses, activation='softmax'))\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                      optimizer='adam',\n                      metrics=['accuracy'])\n\n        return model\n\n\n\nLoad text dataset (20newsgroups):\n\n.. code:: python\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n\n\nrun DNN and see our result:\n\n\n.. code:: python\n\n    X_train_tfidf,X_test_tfidf = TFIDF(X_train,X_test)\n    model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 20)\n    model_DNN.fit(X_train_tfidf, y_train,\n                                  validation_data=(X_test_tfidf, y_test),\n                                  epochs=10,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_DNN.predict_class(X_test_tfidf)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nModel summary:\n\n.. code:: python \n\n    _________________________________________________________________\n    Layer (type)                 Output Shape              Param #   \n    =================================================================\n    dense_1 (Dense)              (None, 512)               38400512  \n    _________________________________________________________________\n    dropout_1 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_2 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_2 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_3 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_3 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_4 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_4 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_5 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_5 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_6 (Dense)              (None, 20)                10260     \n    =================================================================\n    Total params: 39,461,396\n    Trainable params: 39,461,396\n    Non-trainable params: 0\n    _________________________________________________________________\n\n\n\nOutput:\n\n.. code:: python \n\n        Train on 11314 samples, validate on 7532 samples\n        Epoch 1/10\n         - 16s - loss: 2.7553 - acc: 0.1090 - val_loss: 1.9330 - val_acc: 0.3184\n        Epoch 2/10\n         - 15s - loss: 1.5330 - acc: 0.4222 - val_loss: 1.1546 - val_acc: 0.6204\n        Epoch 3/10\n         - 15s - loss: 0.7438 - acc: 0.7257 - val_loss: 0.8405 - val_acc: 0.7499\n        Epoch 4/10\n         - 15s - loss: 0.2967 - acc: 0.9020 - val_loss: 0.9214 - val_acc: 0.7767\n        Epoch 5/10\n         - 15s - loss: 0.1557 - acc: 0.9543 - val_loss: 0.8965 - val_acc: 0.7917\n        Epoch 6/10\n         - 15s - loss: 0.1015 - acc: 0.9705 - val_loss: 0.9427 - val_acc: 0.7949\n        Epoch 7/10\n         - 15s - loss: 0.0595 - acc: 0.9835 - val_loss: 0.9893 - val_acc: 0.7995\n        Epoch 8/10\n         - 15s - loss: 0.0495 - acc: 0.9866 - val_loss: 0.9512 - val_acc: 0.8079\n        Epoch 9/10\n         - 15s - loss: 0.0437 - acc: 0.9867 - val_loss: 0.9690 - val_acc: 0.8117\n        Epoch 10/10\n         - 15s - loss: 0.0443 - acc: 0.9880 - val_loss: 1.0004 - val_acc: 0.8070\n\n\n                       precision    recall  f1-score   support\n\n                  0       0.76      0.78      0.77       319\n                  1       0.67      0.80      0.73       389\n                  2       0.82      0.63      0.71       394\n                  3       0.76      0.69      0.72       392\n                  4       0.65      0.86      0.74       385\n                  5       0.84      0.75      0.79       395\n                  6       0.82      0.87      0.84       390\n                  7       0.86      0.90      0.88       396\n                  8       0.95      0.91      0.93       398\n                  9       0.91      0.92      0.92       397\n                 10       0.98      0.92      0.95       399\n                 11       0.96      0.85      0.90       396\n                 12       0.71      0.69      0.70       393\n                 13       0.95      0.70      0.81       396\n                 14       0.86      0.91      0.88       394\n                 15       0.85      0.90      0.87       398\n                 16       0.79      0.84      0.81       364\n                 17       0.99      0.77      0.87       376\n                 18       0.58      0.75      0.65       310\n                 19       0.52      0.60      0.55       251\n\n        avg / total       0.82      0.81      0.81      7532\n\n\n-----------------------------------------\nRecurrent Neural Networks (RNN)\n-----------------------------------------\n\n.. image:: docs/pic/RNN.png\n\nAnother neural network architecture that is addressed by the researchers for text miming and classification is Recurrent Neural Networks (RNN). RNN assigns more weights to the previous data points of sequence. Therefore, this technique is a powerful method for text, string and sequential data classification. Moreover, this technique could be used for image classification as we did in this work. In RNN, the neural net considers the information of previous nodes in a very sophisticated method which allows for better semantic analysis of the structures in the dataset. \n\n\nGated Recurrent Unit (GRU)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nGated Recurrent Unit (GRU) is a gating mechanism for RNN which was introduced by  `J. Chung et al. \u003chttps://arxiv.org/abs/1412.3555\u003e`__ and `K.Cho et al. \u003chttps://arxiv.org/abs/1406.1078\u003e`__. GRU is a simplified variant of the LSTM architecture, but there are differences as follows: GRU contains two gates and does not possess any internal memory (as shown in Figure; and finally, a second non-linearity is not applied (tanh in Figure).\n\n.. image:: docs/pic/LSTM.png\n\nLong Short-Term Memory (LSTM)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nLong Short-Term Memory~(LSTM) was introduced by `S. Hochreiter and J. Schmidhuber \u003chttps://www.mitpressjournals.org/doi/abs/10.1162/neco.1997.9.8.1735\u003e`__  and developed by many research scientists.\n\nTo deal with these problems Long Short-Term Memory (LSTM) is a special type of RNN that preserves long term dependency in a more effective way compared to the basic RNNs. This is particularly useful to overcome vanishing gradient problem. Although LSTM has a chain-like structure similar to RNN, LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. Figure shows the basic cell of a LSTM model.\n\n\n\nimport packages:\n\n.. code:: python\n\n\n    from keras.layers import Dropout, Dense, GRU, Embedding\n    from keras.models import Sequential\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n    from keras.preprocessing.text import Tokenizer\n    from keras.preprocessing.sequence import pad_sequences\n    from sklearn.datasets import fetch_20newsgroups\n\nconvert text to word embedding (Using GloVe):\n\n.. code:: python\n\n    def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n        np.random.seed(7)\n        text = np.concatenate((X_train, X_test), axis=0)\n        text = np.array(text)\n        tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n        tokenizer.fit_on_texts(text)\n        sequences = tokenizer.texts_to_sequences(text)\n        word_index = tokenizer.word_index\n        text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n        print('Found %s unique tokens.' % len(word_index))\n        indices = np.arange(text.shape[0])\n        # np.random.shuffle(indices)\n        text = text[indices]\n        print(text.shape)\n        X_train = text[0:len(X_train), ]\n        X_test = text[len(X_train):, ]\n        embeddings_index = {}\n        f = open(\".\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n        for line in f:\n\n            values = line.split()\n            word = values[0]\n            try:\n                coefs = np.asarray(values[1:], dtype='float32')\n            except:\n                pass\n            embeddings_index[word] = coefs\n        f.close()\n        print('Total %s word vectors.' % len(embeddings_index))\n        return (X_train, X_test, word_index,embeddings_index)\n\nBuild a RNN Model for Text:\n\n.. code:: python\n\n\n    def Build_Model_RNN_Text(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n        \"\"\"\n        def buildModel_RNN(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n        word_index in word index ,\n        embeddings_index is embeddings index, look at data_helper.py\n        nClasses is number of classes,\n        MAX_SEQUENCE_LENGTH is maximum lenght of text sequences\n        \"\"\"\n\n        model = Sequential()\n        hidden_layer = 3\n        gru_node = 32\n\n        embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n        for word, i in word_index.items():\n            embedding_vector = embeddings_index.get(word)\n            if embedding_vector is not None:\n                # words not found in embedding index will be all-zeros.\n                if len(embedding_matrix[i]) != len(embedding_vector):\n                    print(\"could not broadcast input array from shape\", str(len(embedding_matrix[i])),\n                          \"into shape\", str(len(embedding_vector)), \" Please make sure your\"\n                                                                    \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                    exit(1)\n                embedding_matrix[i] = embedding_vector\n        model.add(Embedding(len(word_index) + 1,\n                                    EMBEDDING_DIM,\n                                    weights=[embedding_matrix],\n                                    input_length=MAX_SEQUENCE_LENGTH,\n                                    trainable=True))\n\n\n        print(gru_node)\n        for i in range(0,hidden_layer):\n            model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))\n            model.add(Dropout(dropout))\n        model.add(GRU(gru_node, recurrent_dropout=0.2))\n        model.add(Dropout(dropout))\n        model.add(Dense(256, activation='relu'))\n        model.add(Dense(nclasses, activation='softmax'))\n\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                          optimizer='adam',\n                          metrics=['accuracy'])\n        return model\n\n\n\n\nrun RNN and see our result:\n\n\n.. code:: python\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\n    model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)\n\n    model_RNN.fit(X_train_Glove, y_train,\n                                  validation_data=(X_test_Glove, y_test),\n                                  epochs=10,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_RNN.predict_classes(X_test_Glove)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nModel summary:\n\n.. code:: python \n\n    _________________________________________________________________\n    Layer (type)                 Output Shape              Param #   \n    =================================================================\n    embedding_1 (Embedding)      (None, 500, 50)           8960500   \n    _________________________________________________________________\n    gru_1 (GRU)                  (None, 500, 256)          235776    \n    _________________________________________________________________\n    dropout_1 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_2 (GRU)                  (None, 500, 256)          393984    \n    _________________________________________________________________\n    dropout_2 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_3 (GRU)                  (None, 500, 256)          393984    \n    _________________________________________________________________\n    dropout_3 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_4 (GRU)                  (None, 256)               393984    \n    _________________________________________________________________\n    dense_1 (Dense)              (None, 20)                5140      \n    =================================================================\n    Total params: 10,383,368\n    Trainable params: 10,383,368\n    Non-trainable params: 0\n    _________________________________________________________________\n\n\n\nOutput:\n\n.. code:: python \n\n    Train on 11314 samples, validate on 7532 samples\n    Epoch 1/20\n     - 268s - loss: 2.5347 - acc: 0.1792 - val_loss: 2.2857 - val_acc: 0.2460\n    Epoch 2/20\n     - 271s - loss: 1.6751 - acc: 0.3999 - val_loss: 1.4972 - val_acc: 0.4660\n    Epoch 3/20\n     - 270s - loss: 1.0945 - acc: 0.6072 - val_loss: 1.3232 - val_acc: 0.5483\n    Epoch 4/20\n     - 269s - loss: 0.7761 - acc: 0.7312 - val_loss: 1.1009 - val_acc: 0.6452\n    Epoch 5/20\n     - 269s - loss: 0.5513 - acc: 0.8112 - val_loss: 1.0395 - val_acc: 0.6832\n    Epoch 6/20\n     - 269s - loss: 0.3765 - acc: 0.8754 - val_loss: 0.9977 - val_acc: 0.7086\n    Epoch 7/20\n     - 270s - loss: 0.2481 - acc: 0.9202 - val_loss: 1.0485 - val_acc: 0.7270\n    Epoch 8/20\n     - 269s - loss: 0.1717 - acc: 0.9463 - val_loss: 1.0269 - val_acc: 0.7394\n    Epoch 9/20\n     - 269s - loss: 0.1130 - acc: 0.9644 - val_loss: 1.1498 - val_acc: 0.7369\n    Epoch 10/20\n     - 269s - loss: 0.0640 - acc: 0.9808 - val_loss: 1.1442 - val_acc: 0.7508\n    Epoch 11/20\n     - 269s - loss: 0.0567 - acc: 0.9828 - val_loss: 1.2318 - val_acc: 0.7414\n    Epoch 12/20\n     - 268s - loss: 0.0472 - acc: 0.9858 - val_loss: 1.2204 - val_acc: 0.7496\n    Epoch 13/20\n     - 269s - loss: 0.0319 - acc: 0.9910 - val_loss: 1.1895 - val_acc: 0.7657\n    Epoch 14/20\n     - 268s - loss: 0.0466 - acc: 0.9853 - val_loss: 1.2821 - val_acc: 0.7517\n    Epoch 15/20\n     - 271s - loss: 0.0269 - acc: 0.9917 - val_loss: 1.2869 - val_acc: 0.7557\n    Epoch 16/20\n     - 271s - loss: 0.0187 - acc: 0.9950 - val_loss: 1.3037 - val_acc: 0.7598\n    Epoch 17/20\n     - 268s - loss: 0.0157 - acc: 0.9959 - val_loss: 1.2974 - val_acc: 0.7638\n    Epoch 18/20\n     - 270s - loss: 0.0121 - acc: 0.9966 - val_loss: 1.3526 - val_acc: 0.7602\n    Epoch 19/20\n     - 269s - loss: 0.0262 - acc: 0.9926 - val_loss: 1.4182 - val_acc: 0.7517\n    Epoch 20/20\n     - 269s - loss: 0.0249 - acc: 0.9918 - val_loss: 1.3453 - val_acc: 0.7638\n\n\n                   precision    recall  f1-score   support\n\n              0       0.71      0.71      0.71       319\n              1       0.72      0.68      0.70       389\n              2       0.76      0.62      0.69       394\n              3       0.67      0.58      0.62       392\n              4       0.68      0.67      0.68       385\n              5       0.75      0.73      0.74       395\n              6       0.82      0.74      0.78       390\n              7       0.83      0.83      0.83       396\n              8       0.81      0.90      0.86       398\n              9       0.92      0.90      0.91       397\n             10       0.91      0.94      0.93       399\n             11       0.87      0.76      0.81       396\n             12       0.57      0.70      0.63       393\n             13       0.81      0.85      0.83       396\n             14       0.74      0.93      0.82       394\n             15       0.82      0.83      0.83       398\n             16       0.74      0.78      0.76       364\n             17       0.96      0.83      0.89       376\n             18       0.64      0.60      0.62       310\n             19       0.48      0.56      0.52       251\n\n    avg / total       0.77      0.76      0.76      7532\n\n-----------------------------------------\nConvolutional Neural Networks (CNN)\n-----------------------------------------\n\nAnother deep learning architecture that is employed for hierarchical document classification is  Convolutional Neural Networks (CNN) . Although originally built for image processing  with architecture similar to the visual cortex, CNNs have also been effectively used for text classification. In a basic CNN for image processing, an image tensor is convolved with a set of kernels of size *d by d*. These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. To reduce the computational complexity, CNNs use pooling which reduces the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features.\n\nThe most common pooling method is max pooling where the maximum element is selected from the pooling window. In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected dense layers.\nIn general, during the back-propagation step of a convolutional neural network not only the weights are adjusted but also the feature detector filters. A potential problem of CNN used for text is the number of 'channels', *Sigma* (size of the feature space). This might be very large (e.g. 50K), for text but for images this is less of a problem (e.g. only 3 channels of RGB). This means the dimensionality of the CNN for text is very high.\n\n\n.. image:: docs/pic/CNN.png\n\nimport packages:\n\n.. code:: python\n\n\n    from keras.layers import Dropout, Dense,Input,Embedding,Flatten, MaxPooling1D, Conv1D\n    from keras.models import Sequential,Model\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n    from keras.preprocessing.text import Tokenizer\n    from keras.preprocessing.sequence import pad_sequences\n    from sklearn.datasets import fetch_20newsgroups\n    from keras.layers.merge import Concatenate\n\n\n\nconvert text to word embedding (Using GloVe):\n\n.. code:: python\n\n    def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n        np.random.seed(7)\n        text = np.concatenate((X_train, X_test), axis=0)\n        text = np.array(text)\n        tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n        tokenizer.fit_on_texts(text)\n        sequences = tokenizer.texts_to_sequences(text)\n        word_index = tokenizer.word_index\n        text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n        print('Found %s unique tokens.' % len(word_index))\n        indices = np.arange(text.shape[0])\n        # np.random.shuffle(indices)\n        text = text[indices]\n        print(text.shape)\n        X_train = text[0:len(X_train), ]\n        X_test = text[len(X_train):, ]\n        embeddings_index = {}\n        f = open(\".\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n        for line in f:\n            values = line.split()\n            word = values[0]\n            try:\n                coefs = np.asarray(values[1:], dtype='float32')\n            except:\n                pass\n            embeddings_index[word] = coefs\n        f.close()\n        print('Total %s word vectors.' % len(embeddings_index))\n        return (X_train, X_test, word_index,embeddings_index)\n\n\nBuild a CNN Model for Text:\n\n.. code:: python\n\n    def Build_Model_CNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n\n        \"\"\"\n            def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n            word_index in word index ,\n            embeddings_index is embeddings index, look at data_helper.py\n            nClasses is number of classes,\n            MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,\n            EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py\n        \"\"\"\n\n        model = Sequential()\n        embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n        for word, i in word_index.items():\n            embedding_vector = embeddings_index.get(word)\n            if embedding_vector is not None:\n                # words not found in embedding index will be all-zeros.\n                if len(embedding_matrix[i]) !=len(embedding_vector):\n                    print(\"could not broadcast input array from shape\",str(len(embedding_matrix[i])),\n                                     \"into shape\",str(len(embedding_vector)),\" Please make sure your\"\n                                     \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n         ","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkk7nc%2Ftext_classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkk7nc%2Ftext_classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkk7nc%2Ftext_classification/lists"}