https://github.com/iamkankan/natural-language-processing-nlp-tutorial

NLP tutorials and guidelines to learn efficiently
https://github.com/iamkankan/natural-language-processing-nlp-tutorial

bigrams bow cbow glove lemmatization one-hot-encoding stemming stopwords tf-idf-vectorizer tokenization unigram word-embeddings word2vec

Last synced: 4 months ago
JSON representation

NLP tutorials and guidelines to learn efficiently

Host: GitHub
URL: https://github.com/iamkankan/natural-language-processing-nlp-tutorial
Owner: iAmKankan
Created: 2019-09-01T07:50:29.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2023-01-17T09:19:26.000Z (over 2 years ago)
Last Synced: 2025-02-02T18:23:27.917Z (5 months ago)
Topics: bigrams, bow, cbow, glove, lemmatization, one-hot-encoding, stemming, stopwords, tf-idf-vectorizer, tokenization, unigram, word-embeddings, word2vec
Homepage:
Size: 123 KB
Stars: 8
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        ## Index

![deep](https://user-images.githubusercontent.com/12748752/150695343-8977b5d0-3cd4-4959-b90e-9fe72d336d42.png)

### _NLP(Natural language processing)_

 * [What is NLP?](#natural-language-processing-nlp)

 * [Why we learn NLP?](#why-learn-nlp)

 * [General NLP tasks](#typical-nlp-tasks)

 * [Why NLP is so hard](#why-it-is-hard)

 * [Main NLP Approaches](#main-approaches-of-nlp)

 * [NLP Roadmap](#nlp-roadmap)

 

### [Text Preprocessing Level 1- Stopwords, Tokens, Stemming, Lammetization](https://github.com/iAmKankan/NaturalLanguageProcessing-NLP/tree/master/Text%20Preprocessing%20Level%23%201)

### [Text Preprocessing Level 2- Bag Of Words, TFIDF, Unigrams, Bigram](https://github.com/iAmKankan/NaturalLanguageProcessing-NLP/tree/master/Text%20Preprocessing%20Level%23%202)

### [Text Preprocessing Level 3- _Word Embeddings(Word2vec, One-hot, The Skip-Gram Mode, CBOW, GLOVE)_](https://github.com/iAmKankan/Natural-Language-Processing-NLP-Tutorial/blob/master/word_embedding.md)

### _NLP and Probability_

 * [_Why we need _Probability_ in NLP_?](https://github.com/iAmKankan/Mathematics/blob/main/probability/README.md)

### [_References_](#references-1)

## Natural Language Processing (NLP)

![deep](https://user-images.githubusercontent.com/12748752/150695343-8977b5d0-3cd4-4959-b90e-9fe72d336d42.png)

#### _**Natural language processing**_ is a subset of Artificial intelligence that helps computers to understand, interpret, and utilize the human languages. 



* NLP allows computers to communicate with peoples using human languages. 

* NLP also provides computers with the ability to read text, hear speech, and try to intrepret it. 

* NLP draws several disciplines, including Computational linguistics and computer science, as this attempts to fill the gap in between human and computer communication.

* NLP breaks down language into shorter, more basic pieces, called **_tokens_**(period, words, etc), and attempts to understand the relationships of tokens. 

* This approach often uses higher-level NLP features, such as:

  - **Sentiment analysis:** It identifies the general mood, or subjective opinions, which is stored in large amount of texts, It is more useful for opinion mining.

  - **Contextual Extraction:** Extract structured data from text-base sources.

  - **Text-to-Speech and Speech-to-text:** It transforms the voice into text and vice a versa.

  - **Document Summarization:** Automatically creates a synopsis, condensing large amounts of text.

  - **Machine Translation:** Translates the text or speech of one language into another language.

## Typical NLP tasks

![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)

| ${\color{Purple}\textrm{Information Retrieval}}$ | ${\color{Purple}\textrm{Find documents based on keywords}}$                                                                    |

|:------------------------|:-------------------------------------------------------------------------------------------------------|

| ${\color{Purple}\textrm{Information Extraction}}$   |${\color{Purple}\textrm{ Identify and extract personal name, date, company name, city..}}$                                        |

| ${\color{Purple}\textrm{Language generation}}$      |${\color{Purple}\textrm{ Description based on a photograph, Title for a photograph}}$                                             |

| ${\color{Purple}\textrm{Text classification}}$      |${\color{Purple}\textrm{ Assigning predefined categorization to documents. Identify Spam emails and move them to a Spam folder}}$ |

| ${\color{Purple}\textrm{Machine Translation}}$      |${\color{Purple}\textrm{ Translate any language Text to another}}$                                                                |

| ${\color{Purple}\textrm{Grammar checkers}}$         |${\color{Purple}\textrm{ Check the grammar for any language}}$ 

### Why learn NLP?

![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)

#### Some example like they are built with the use of NLP :

   1. Spell Correction(MS Word/any other editor)

   2. Search engines(Google, Bing, Yahoo)

   3. Speech engines(like Siri, Google assistant)

   4. Spam classifiers(All e-mails services)

   5. News feeds(Google, Yahoo!, and so on)

   6. Machine Translation(Google translation)

   7. IBM Watson

#### Some NLP tools

Most of the tools are written in Java and have similar functionalities. **GATE**, **Mallet**, **Open NLP**, **UIMA**, **Standford toolkit**, **Gensim**, **NLTK(Natural Language Tool kit)**.

## Why it is Hard?

![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)

* Multiple ways of representation of the same scenario

* Includes common sense and contextual representation

* Complex representation information (simple to hard vocabulary)

* Mixing of visual cues

* Ambiguous in nature

* Idioms, metaphors, sarcasm (Yeah! right), double negatives, etc. make it difficult for automatic processing

* Human language interpretation depends on real world, common sense, and contextual knowledge

### NLP Roadmap

![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)



### Main Approaches of NLP

![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)



### Natural Language Processing and Deep Learning

![light](https://user-images.githubusercontent.com/12748752/134754235-ae8efaf0-a27a-46f0-b439-b114cbb8cf3e.png)

* The developments in the field of deep learning that have led to massive increases in performance in NLP.

* **Before deep learning**, the main techniques used in NLP were the _**bag-of-words**_ model and techniques like _**TF-IDF**_, _**Naive Bays**_ and _**Support Vector Machine(SVM)**_.

* In fact, this is a quick, robust, simple system in today standerd.



* **Nowadays**, in advanced areas of NLP, we use techniques like _**Hidden Markov Models**_ to do things like 

  * ***speech recognition*** and 

  * ***parts of speech tagging***.

> #### **Problem with Bag of Words**

   * Consider the phrases - ***dog toy*** and ***toy dog***

   * These are different things, but in a Bag of Words Model ordered does not matter, and so these would be treated the same.

> #### **Solution**

   * **Neural Network**- Modeling sentences as sequences and as hierarchy(**_LSTM_**) has led to state of the art improvements over previous go to techniques.

   * **Word Embeddings**- These give words a neural representation so that words can be plugged into a Neural Network just like any other feature vector. 

### Sequence Learning

![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)

Sequence learning is the study of machine learning algorithms designed for applications that require **sequential data** or **temporal data**.

#### RECURRENT NEURAL NETWORK

* Sequential data prediction is considered as a key problem in machine learning and artificial intelligence

* Unlike images where we look at the entire image, we read **text documents sequentially** to understand the content. 

* The likelihood of any sentence can be determined from everyday use of language.

* The earlier sequence of words (int time) is important to predict the next word, sentence, paragraph or chapter.

* If a word occurs twice in a sentence, but could not be accommodated in the sliding window, then the word is learned twice

* An architecture that does not impose a fixed-length limit on the prior context

**RNN-Language Model**-_Encoding a sentence into a fixed sized vector_-Exploding and vanishing gradients-LSTM-GRU

## References

![deep](https://user-images.githubusercontent.com/12748752/150695343-8977b5d0-3cd4-4959-b90e-9fe72d336d42.png)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/iamkankan/natural-language-processing-nlp-tutorial

Awesome Lists containing this project

README