https://github.com/iamkankan/natural-language-processing-nlp-tutorial
NLP tutorials and guidelines to learn efficiently
https://github.com/iamkankan/natural-language-processing-nlp-tutorial
bigrams bow cbow glove lemmatization one-hot-encoding stemming stopwords tf-idf-vectorizer tokenization unigram word-embeddings word2vec
Last synced: about 2 months ago
JSON representation
NLP tutorials and guidelines to learn efficiently
- Host: GitHub
- URL: https://github.com/iamkankan/natural-language-processing-nlp-tutorial
- Owner: iAmKankan
- Created: 2019-09-01T07:50:29.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-01-17T09:19:26.000Z (over 2 years ago)
- Last Synced: 2025-02-02T18:23:27.917Z (4 months ago)
- Topics: bigrams, bow, cbow, glove, lemmatization, one-hot-encoding, stemming, stopwords, tf-idf-vectorizer, tokenization, unigram, word-embeddings, word2vec
- Homepage:
- Size: 123 KB
- Stars: 8
- Watchers: 2
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## Index

### _NLP(Natural language processing)_
* [What is NLP?](#natural-language-processing-nlp)
* [Why we learn NLP?](#why-learn-nlp)
* [General NLP tasks](#typical-nlp-tasks)
* [Why NLP is so hard](#why-it-is-hard)
* [Main NLP Approaches](#main-approaches-of-nlp)
* [NLP Roadmap](#nlp-roadmap)
### [Text Preprocessing Level 1- Stopwords, Tokens, Stemming, Lammetization](https://github.com/iAmKankan/NaturalLanguageProcessing-NLP/tree/master/Text%20Preprocessing%20Level%23%201)
### [Text Preprocessing Level 2- Bag Of Words, TFIDF, Unigrams, Bigram](https://github.com/iAmKankan/NaturalLanguageProcessing-NLP/tree/master/Text%20Preprocessing%20Level%23%202)
### [Text Preprocessing Level 3- _Word Embeddings(Word2vec, One-hot, The Skip-Gram Mode, CBOW, GLOVE)_](https://github.com/iAmKankan/Natural-Language-Processing-NLP-Tutorial/blob/master/word_embedding.md)
### _NLP and Probability_
* [_Why we need _Probability_ in NLP_?](https://github.com/iAmKankan/Mathematics/blob/main/probability/README.md)
### [_References_](#references-1)## Natural Language Processing (NLP)

#### _**Natural language processing**_ is a subset of Artificial intelligence that helps computers to understand, interpret, and utilize the human languages.* NLP allows computers to communicate with peoples using human languages.
* NLP also provides computers with the ability to read text, hear speech, and try to intrepret it.
* NLP draws several disciplines, including Computational linguistics and computer science, as this attempts to fill the gap in between human and computer communication.
* NLP breaks down language into shorter, more basic pieces, called **_tokens_**(period, words, etc), and attempts to understand the relationships of tokens.
* This approach often uses higher-level NLP features, such as:
- **Sentiment analysis:** It identifies the general mood, or subjective opinions, which is stored in large amount of texts, It is more useful for opinion mining.
- **Contextual Extraction:** Extract structured data from text-base sources.
- **Text-to-Speech and Speech-to-text:** It transforms the voice into text and vice a versa.
- **Document Summarization:** Automatically creates a synopsis, condensing large amounts of text.
- **Machine Translation:** Translates the text or speech of one language into another language.## Typical NLP tasks

| ${\color{Purple}\textrm{Information Retrieval}}$ | ${\color{Purple}\textrm{Find documents based on keywords}}$ |
|:------------------------|:-------------------------------------------------------------------------------------------------------|
| ${\color{Purple}\textrm{Information Extraction}}$ |${\color{Purple}\textrm{ Identify and extract personal name, date, company name, city..}}$ |
| ${\color{Purple}\textrm{Language generation}}$ |${\color{Purple}\textrm{ Description based on a photograph, Title for a photograph}}$ |
| ${\color{Purple}\textrm{Text classification}}$ |${\color{Purple}\textrm{ Assigning predefined categorization to documents. Identify Spam emails and move them to a Spam folder}}$ |
| ${\color{Purple}\textrm{Machine Translation}}$ |${\color{Purple}\textrm{ Translate any language Text to another}}$ |
| ${\color{Purple}\textrm{Grammar checkers}}$ |${\color{Purple}\textrm{ Check the grammar for any language}}$### Why learn NLP?

#### Some example like they are built with the use of NLP :1. Spell Correction(MS Word/any other editor)
2. Search engines(Google, Bing, Yahoo)
3. Speech engines(like Siri, Google assistant)
4. Spam classifiers(All e-mails services)
5. News feeds(Google, Yahoo!, and so on)
6. Machine Translation(Google translation)
7. IBM Watson#### Some NLP tools
Most of the tools are written in Java and have similar functionalities. **GATE**, **Mallet**, **Open NLP**, **UIMA**, **Standford toolkit**, **Gensim**, **NLTK(Natural Language Tool kit)**.## Why it is Hard?

* Multiple ways of representation of the same scenario
* Includes common sense and contextual representation
* Complex representation information (simple to hard vocabulary)
* Mixing of visual cues
* Ambiguous in nature
* Idioms, metaphors, sarcasm (Yeah! right), double negatives, etc. make it difficult for automatic processing
* Human language interpretation depends on real world, common sense, and contextual knowledge### NLP Roadmap
### Main Approaches of NLP

### Natural Language Processing and Deep Learning

* The developments in the field of deep learning that have led to massive increases in performance in NLP.
* **Before deep learning**, the main techniques used in NLP were the _**bag-of-words**_ model and techniques like _**TF-IDF**_, _**Naive Bays**_ and _**Support Vector Machine(SVM)**_.
* In fact, this is a quick, robust, simple system in today standerd.* **Nowadays**, in advanced areas of NLP, we use techniques like _**Hidden Markov Models**_ to do things like
* ***speech recognition*** and
* ***parts of speech tagging***.> #### **Problem with Bag of Words**
* Consider the phrases - ***dog toy*** and ***toy dog***
* These are different things, but in a Bag of Words Model ordered does not matter, and so these would be treated the same.
> #### **Solution**
* **Neural Network**- Modeling sentences as sequences and as hierarchy(**_LSTM_**) has led to state of the art improvements over previous go to techniques.
* **Word Embeddings**- These give words a neural representation so that words can be plugged into a Neural Network just like any other feature vector.### Sequence Learning

Sequence learning is the study of machine learning algorithms designed for applications that require **sequential data** or **temporal data**.
#### RECURRENT NEURAL NETWORK
* Sequential data prediction is considered as a key problem in machine learning and artificial intelligence
* Unlike images where we look at the entire image, we read **text documents sequentially** to understand the content.
* The likelihood of any sentence can be determined from everyday use of language.
* The earlier sequence of words (int time) is important to predict the next word, sentence, paragraph or chapter.
* If a word occurs twice in a sentence, but could not be accommodated in the sliding window, then the word is learned twice
* An architecture that does not impose a fixed-length limit on the prior context**RNN-Language Model**-_Encoding a sentence into a fixed sized vector_-Exploding and vanishing gradients-LSTM-GRU
## References
