https://github.com/camara94/python-text-mining
In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar with the numpy and pandas Python libraries for data manipulation, and scikit-learn toolkit for machine learning algorithms.
https://github.com/camara94/python-text-mining
nltk nltk-python python3 text-mining text-mining-analysis text-mining-in-python
Last synced: 6 months ago
JSON representation
In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar with the numpy and pandas Python libraries for data manipulation, and scikit-learn toolkit for machine learning algorithms.
- Host: GitHub
- URL: https://github.com/camara94/python-text-mining
- Owner: camara94
- License: mit
- Created: 2021-11-08T03:37:43.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2021-12-04T01:42:19.000Z (almost 4 years ago)
- Last Synced: 2025-02-15T08:43:34.999Z (8 months ago)
- Topics: nltk, nltk-python, python3, text-mining, text-mining-analysis, text-mining-in-python
- Language: Jupyter Notebook
- Homepage:
- Size: 17.1 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Python text mining
In order to be successful in this course, you will need to know how to program in Python. The expectation is that you have completed the first three courses in this Applied Data Science with Python series, specifically Course 1 on Introduction to Data Science in Python and Course 3 on Applied Machine Learning in Python, so that you are familiar with the numpy and pandas Python libraries for data manipulation, and scikit-learn toolkit for machine learning algorithms.
## Primitive constructs in Text
* Sentences / input strings
* Words or Tokens
* Characters
* Document, larger filesIn this course in python we are tolking about all this concepts and their properties
## Week 1
Les liens utils:
1. https://docs.python.org/3/library/re.html
2. https://www.analyticsvidhya.com/blog/2014/11/text-data-cleaning-steps-python/
3. https://ieva.rocks/2016/08/07/cleaning-text-for-nlp/
4. https://chrisalbon.com/python/cleaning_text.html
## Week 2
In this module, we will tolk to **Natural Language**### What is Natural Language ?
* Language used for everyday communication by humans
* English
* Chinese
* spanishcompared to the artificial computer language
* Any computation, manipulation of natural language
* Natural language evolve
* new words get added
* old words lose popularity
* language rules themselves may change.## NLT Task: A Broad Spectrum
* Computing words, counting frequency of words
* Finding sentence boundaries
* Part of speech tagging
* Parsing the sentence structure
* Identifying semantic roles
* Identifying entities in a sentences
* Finding which pronoun refers to which entity## An Introduction to NLTK
* NLTK: Natural Language Toolkit
* Open source library in Python
* Has support for most NLP tasks
* Also provides access to numerous text corpora## Usage of NLTK
* Importation
import nltk
* Let's get some text corpora
nltk.download()
from nltk.dowload()
for more information see lab week2
## Tokenization
* Recall splitting a sentence into words / tokens
## Part-of-speech (POS) Tagging
* Recall high school grammar: nouns, verbs, adjectives,...
## Ambiguity in POS Tagging

## Parsing Sentence Structure
## Ambiguity in Parsing


## POS Tagging & Parsing Complexity

## Task Home Concepts

## Examples of Text Classification

## Supervised Learning

## Supervised Classification Step

## Supervised Classification Model

## Divide Dataset in two parts

## Classification paradigms

## Questions to ask in Supervised Learning

## Why is textual data unique ?

## Types of textual features (1)

## Types of textual features (2)

## Types of textual features (3)

## Naive Bayes Classifiers
## Case study: Classifying text search queries


## Probabilistic model

## Bayes' Rule

## Naive Bayes Classification


## Example classification

## Naïve Bayes: Learning parameters


## Naïve Bayes: Smoothing

## Take Home Concepts

## Two Naïve Bayes Variants For Text

## Support Vector Machine
## Decision Boundaries

## Choosing a Decision Boundary

## Finding a Linear Boundary



## SVM: Multi-class classification






## SVM Parameters (1): Parameter C

## SVM Parameters (2): Others Params

## Take Home Messages

## Using Sklearn's NaiveBayesClassifier

## Using Sklearn's SVM Classifier

## Model Selection in Scikit-learn


## Supervised Text Classification in NLTK

## Using NLTK's NaiveBayesClassifier

## Using NLTK's SkearnClassifier

## Take Home Concept

## Application of semantic similarity
## WordNet

## Semantic similarity using WordNet

## Coming back to our deer example

## Similarity with NLP in Python

## Distributional Similarity: Context

## Strength of association between words

## Take Home Concepts
## What is Topic Modeling ?



## Generative Models for Text

## Generative Model can be complex

## Latent Dirichlet Allocation (LDA)

## Topic Modeling in Practice

## Topic Modeling: Summary

## Working with LDA in Python


## Take Home Concepts

## Information is hidden in free-text

## Information Extraction

## Fields of Interest

## Named Entity Recognition

## Approche to identify named entities


## Relation extraction

## Co-reference resolution

## Question Answering

## Take Home Concepts

## Additional Resources & Readings
* [http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf](http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf)
* [https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)
* [https://en.wikipedia.org/wiki/Plate_notation](https://en.wikipedia.org/wiki/Plate_notation)
* [https://www.nltk.org/howto/wordnet.html](https://www.nltk.org/howto/wordnet.html)