Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gonzaferreiro/NLP_with_20newsgroups

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.
https://github.com/gonzaferreiro/NLP_with_20newsgroups

Last synced: 4 months ago
JSON representation

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.

Host: GitHub
URL: https://github.com/gonzaferreiro/NLP_with_20newsgroups
Owner: gonzaferreiro
Created: 2019-07-08T13:41:19.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-07-08T16:17:11.000Z (over 5 years ago)
Last Synced: 2024-08-01T13:36:03.118Z (7 months ago)
Language: Jupyter Notebook
Homepage:
Size: 1.41 MB
Stars: 10
Watchers: 0
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Introduction to Natural Lenguage Processin

## introduction

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In this project we'll only work trying to predict four categories of the Sklearn dataset:

* alt.atheism
* talk.religion.misc
* comp.graphics
* sci.space

Feel free to check the [dataset documentation](https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html) to know more about it.

## What you'll find in this repository

* Introduction to the dataset and its exploration
* Bag of words model: what it is and application
* Exploring most common words in several ways
* Looking at the confusion matrix out of our model
* Using Hashing and TF-IDF: theoretical introduction and application
* A classifiers comparison