{"id":22588633,"url":"https://github.com/iamkankan/natural-language-processing-nlp-tutorial","last_synced_at":"2026-01-08T01:02:55.990Z","repository":{"id":51205149,"uuid":"205646010","full_name":"iAmKankan/Natural-Language-Processing-NLP-Tutorial","owner":"iAmKankan","description":"NLP tutorials and guidelines to learn efficiently","archived":false,"fork":false,"pushed_at":"2023-01-17T09:19:26.000Z","size":126,"stargazers_count":8,"open_issues_count":1,"forks_count":1,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-02T18:23:27.917Z","etag":null,"topics":["bigrams","bow","cbow","glove","lemmatization","one-hot-encoding","stemming","stopwords","tf-idf-vectorizer","tokenization","unigram","word-embeddings","word2vec"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iAmKankan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-09-01T07:50:29.000Z","updated_at":"2024-04-04T19:28:36.000Z","dependencies_parsed_at":"2023-02-10T08:45:48.368Z","dependency_job_id":null,"html_url":"https://github.com/iAmKankan/Natural-Language-Processing-NLP-Tutorial","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iAmKankan%2FNatural-Language-Processing-NLP-Tutorial","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iAmKankan%2FNatural-Language-Processing-NLP-Tutorial/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iAmKankan%2FNatural-Language-Processing-NLP-Tutorial/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iAmKankan%2FNatural-Language-Processing-NLP-Tutorial/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iAmKankan","download_url":"https://codeload.github.com/iAmKankan/Natural-Language-Processing-NLP-Tutorial/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246072945,"owners_count":20719408,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigrams","bow","cbow","glove","lemmatization","one-hot-encoding","stemming","stopwords","tf-idf-vectorizer","tokenization","unigram","word-embeddings","word2vec"],"created_at":"2024-12-08T08:10:03.885Z","updated_at":"2026-01-08T01:02:55.951Z","avatar_url":"https://github.com/iAmKankan.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"## Index\n![deep](https://user-images.githubusercontent.com/12748752/150695343-8977b5d0-3cd4-4959-b90e-9fe72d336d42.png)\n### _NLP(Natural language processing)_\n * [What is NLP?](#natural-language-processing-nlp)\n * [Why we learn NLP?](#why-learn-nlp)\n * [General NLP tasks](#typical-nlp-tasks)\n * [Why NLP is so hard](#why-it-is-hard)\n * [Main NLP Approaches](#main-approaches-of-nlp)\n * [NLP Roadmap](#nlp-roadmap)\n \n### [Text Preprocessing Level 1- Stopwords, Tokens, Stemming, Lammetization](https://github.com/iAmKankan/NaturalLanguageProcessing-NLP/tree/master/Text%20Preprocessing%20Level%23%201)\n### [Text Preprocessing Level 2- Bag Of Words, TFIDF, Unigrams, Bigram](https://github.com/iAmKankan/NaturalLanguageProcessing-NLP/tree/master/Text%20Preprocessing%20Level%23%202)\n### [Text Preprocessing Level 3- _Word Embeddings(Word2vec, One-hot, The Skip-Gram Mode, CBOW, GLOVE)_](https://github.com/iAmKankan/Natural-Language-Processing-NLP-Tutorial/blob/master/word_embedding.md)\n### _NLP and Probability_\n * [_Why we need _Probability_ in NLP_?](https://github.com/iAmKankan/Mathematics/blob/main/probability/README.md)\n### [_References_](#references-1)\n\n\n## Natural Language Processing (NLP)\n![deep](https://user-images.githubusercontent.com/12748752/150695343-8977b5d0-3cd4-4959-b90e-9fe72d336d42.png)\n#### _**Natural language processing**_ is a subset of Artificial intelligence that helps computers to understand, interpret, and utilize the human languages. \n\u003cimg src=\"https://user-images.githubusercontent.com/12748752/150695497-d66d6dae-37e6-48d4-8144-a7b8473435c9.png\" ALIGN=\"right\" width=50% /\u003e\n\n* NLP allows computers to communicate with peoples using human languages. \n* NLP also provides computers with the ability to read text, hear speech, and try to intrepret it. \n* NLP draws several disciplines, including Computational linguistics and computer science, as this attempts to fill the gap in between human and computer communication.\n* NLP breaks down language into shorter, more basic pieces, called **_tokens_**(period, words, etc), and attempts to understand the relationships of tokens. \n* This approach often uses higher-level NLP features, such as:\n  - **Sentiment analysis:** It identifies the general mood, or subjective opinions, which is stored in large amount of texts, It is more useful for opinion mining.\n  - **Contextual Extraction:** Extract structured data from text-base sources.\n  - **Text-to-Speech and Speech-to-text:** It transforms the voice into text and vice a versa.\n  - **Document Summarization:** Automatically creates a synopsis, condensing large amounts of text.\n  - **Machine Translation:** Translates the text or speech of one language into another language.\n\n## Typical NLP tasks\n![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)\n| ${\\color{Purple}\\textrm{Information Retrieval}}$ | ${\\color{Purple}\\textrm{Find documents based on keywords}}$                                                                    |\n|:------------------------|:-------------------------------------------------------------------------------------------------------|\n| ${\\color{Purple}\\textrm{Information Extraction}}$   |${\\color{Purple}\\textrm{ Identify and extract personal name, date, company name, city..}}$                                        |\n| ${\\color{Purple}\\textrm{Language generation}}$      |${\\color{Purple}\\textrm{ Description based on a photograph, Title for a photograph}}$                                             |\n| ${\\color{Purple}\\textrm{Text classification}}$      |${\\color{Purple}\\textrm{ Assigning predefined categorization to documents. Identify Spam emails and move them to a Spam folder}}$ |\n| ${\\color{Purple}\\textrm{Machine Translation}}$      |${\\color{Purple}\\textrm{ Translate any language Text to another}}$                                                                |\n| ${\\color{Purple}\\textrm{Grammar checkers}}$         |${\\color{Purple}\\textrm{ Check the grammar for any language}}$ \n\n### Why learn NLP?\n![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)\n#### Some example like they are built with the use of NLP :\n\n   1. Spell Correction(MS Word/any other editor)\n   2. Search engines(Google, Bing, Yahoo)\n   3. Speech engines(like Siri, Google assistant)\n   4. Spam classifiers(All e-mails services)\n   5. News feeds(Google, Yahoo!, and so on)\n   6. Machine Translation(Google translation)\n   7. IBM Watson\n\n#### Some NLP tools\nMost of the tools are written in Java and have similar functionalities. **GATE**, **Mallet**, **Open NLP**, **UIMA**, **Standford toolkit**, **Gensim**, **NLTK(Natural Language Tool kit)**.\n\n## Why it is Hard?\n![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)\n* Multiple ways of representation of the same scenario\n* Includes common sense and contextual representation\n* Complex representation information (simple to hard vocabulary)\n* Mixing of visual cues\n* Ambiguous in nature\n* Idioms, metaphors, sarcasm (Yeah! right), double negatives, etc. make it difficult for automatic processing\n* Human language interpretation depends on real world, common sense, and contextual knowledge\n\n### NLP Roadmap\n![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)\n\u003cimg src=\"https://user-images.githubusercontent.com/12748752/182266135-83fed5d6-1cb4-4f99-83f1-332f5bde9a7a.png\" width=60%/\u003e\n\n### Main Approaches of NLP\n![light](https://user-images.githubusercontent.com/12748752/150695340-c086876c-1e29-4493-b03b-cbff51dba02a.png)\n\n\u003cimg src=\"https://user-images.githubusercontent.com/12748752/150872225-4f7b267c-0672-44b7-aac3-0851c029250f.png\" width=100% /\u003e\n\n### Natural Language Processing and Deep Learning\n![light](https://user-images.githubusercontent.com/12748752/134754235-ae8efaf0-a27a-46f0-b439-b114cbb8cf3e.png)\n* The developments in the field of deep learning that have led to massive increases in performance in NLP.\n* **Before deep learning**, the main techniques used in NLP were the _**bag-of-words**_ model and techniques like _**TF-IDF**_, _**Naive Bays**_ and _**Support Vector Machine(SVM)**_.\n* In fact, this is a quick, robust, simple system in today standerd.\n\u003cimg src=\"https://user-images.githubusercontent.com/12748752/151645612-ce7511a7-13e7-44b2-afe2-30eab70b9980.png\" width=60% /\u003e\n\n* **Nowadays**, in advanced areas of NLP, we use techniques like _**Hidden Markov Models**_ to do things like \n  * ***speech recognition*** and \n  * ***parts of speech tagging***.\n\n\u003e #### **Problem with Bag of Words**\n   * Consider the phrases - ***dog toy*** and ***toy dog***\n   * These are different things, but in a Bag of Words Model ordered does not matter, and so these would be treated the same.\n\u003e #### **Solution**\n   * **Neural Network**- Modeling sentences as sequences and as hierarchy(**_LSTM_**) has led to state of the art improvements over previous go to techniques.\n   * **Word Embeddings**- These give words a neural representation so that words can be plugged into a Neural Network just like any other feature vector. \n\n### Sequence Learning\n![plum](https://user-images.githubusercontent.com/12748752/126882596-b9ba4645-7001-435e-9a3c-d4416a2543c1.png)\nSequence learning is the study of machine learning algorithms designed for applications that require **sequential data** or **temporal data**.\n#### RECURRENT NEURAL NETWORK\n* Sequential data prediction is considered as a key problem in machine learning and artificial intelligence\n* Unlike images where we look at the entire image, we read **text documents sequentially** to understand the content. \n* The likelihood of any sentence can be determined from everyday use of language.\n* The earlier sequence of words (int time) is important to predict the next word, sentence, paragraph or chapter.\n* If a word occurs twice in a sentence, but could not be accommodated in the sliding window, then the word is learned twice\n* An architecture that does not impose a fixed-length limit on the prior context\n\n**RNN-Language Model**-_Encoding a sentence into a fixed sized vector_-Exploding and vanishing gradients-LSTM-GRU\n\n## References\n![deep](https://user-images.githubusercontent.com/12748752/150695343-8977b5d0-3cd4-4959-b90e-9fe72d336d42.png)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamkankan%2Fnatural-language-processing-nlp-tutorial","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiamkankan%2Fnatural-language-processing-nlp-tutorial","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiamkankan%2Fnatural-language-processing-nlp-tutorial/lists"}