Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gonzaferreiro/NLP_with_20newsgroups
In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.
https://github.com/gonzaferreiro/NLP_with_20newsgroups
Last synced: 9 days ago
JSON representation
In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.
- Host: GitHub
- URL: https://github.com/gonzaferreiro/NLP_with_20newsgroups
- Owner: gonzaferreiro
- Created: 2019-07-08T13:41:19.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-07-08T16:17:11.000Z (over 5 years ago)
- Last Synced: 2024-08-01T13:36:03.118Z (3 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 1.41 MB
- Stars: 10
- Watchers: 0
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Introduction to Natural Lenguage Processin
## introduction
In this brief project we're gonna explore a few NLP tools using a Sklearn dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.
In this project we'll only work trying to predict four categories of the Sklearn dataset:
* alt.atheism
* talk.religion.misc
* comp.graphics
* sci.spaceFeel free to check the [dataset documentation](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to know more about it.
## What you'll find in this repository
* Introduction to the dataset and its exploration
* Bag of words model: what it is and application
* Exploring most common words in several ways
* Looking at the confusion matrix out of our model
* Using Hashing and TF-IDF: theoretical introduction and application
* A classifiers comparison