Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gonzaferreiro/NLP_with_20newsgroups

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.
https://github.com/gonzaferreiro/NLP_with_20newsgroups

Last synced: 9 days ago
JSON representation

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.

Awesome Lists containing this project

README

        

# Introduction to Natural Lenguage Processin

## introduction

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In this project we'll only work trying to predict four categories of the Sklearn dataset:

* alt.atheism
* talk.religion.misc
* comp.graphics
* sci.space

Feel free to check the [dataset documentation](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to know more about it.

## What you'll find in this repository

* Introduction to the dataset and its exploration
* Bag of words model: what it is and application
* Exploring most common words in several ways
* Looking at the confusion matrix out of our model
* Using Hashing and TF-IDF: theoretical introduction and application
* A classifiers comparison