
An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

NLP News Classification

cnn exploratory-data-analysis jupyter-notebook multinomial-naive-bayes neural-networks python rnn streamlit

Last synced: about 20 hours ago
JSON representation

NLP News Classification

Awesome Lists containing this project




This repository hosts a notebook featuring an in-depth analysis of several **Neural Networks** models (RNN, CNN, feed-forward) and **Multinomial Naive Bayes** along with an app deployment using Streamlit. The following models were meticulously evaluated:

- Basic Multinomial Naive Bayes
- Basic Keras Model
- LSTM Model
- LSTM GRU Model
- LSTM Bidirectional Model
- TextVectorization + Keras Embedding
- Text_to_word_sequence + Word2Vec Embedding
- Basic CNN Model

The dataset used has been downloaded from [Kaggle]( and contains a set of Fake and Real News.

The app can be tested following this [link]( Feel free to ⭐ and clone this repo 😉

## 👨‍💻 **Tech Stack**

![Visual Studio Code](
![Jupyter Notebook](

## 📐 Set Up

In the initial project phase, a set of essential helper functions was created to streamline data analysis and model evaluation. These functions include:

- **Plot Word Cloud**: Generates a word cloud for a specific label value and displays it in a subplot.
- **Plot Confusion Matrix**: Visualizes classification results using a confusion matrix.
- **Plot Precision/Recall Results**: Computes model accuracy, precision, recall, and F1-score for binary classification models, returning the results in a DataFrame.

## 👨‍🔬 Data Analysis

The first step of the project involved a comprehensive analysis of the dataset, including its columns and distribution. The dataset consists of two files (fake and true), each with the following columns:

- Title
- Text
- Subject
- Date

### Labels Distribution

Upon merging the datasets, it became apparent that the labels are well-balanced, with both fake and true labels at approximately 50%, negating the need for oversampling or undersampling. The dataset initially contained 23,481 fake and 21,417 true news articles, with 209 duplicate rows removed.

### Subjects Distribution
The subjects column revealed eight different topics, with true news and fake news being allocated in different subjects. This indicates a clear separation of labels within subjects.

### WordCloud

A word cloud visualization showed that the terms "Trump" and "US" were among the most common words in both label categories.

## 📶 Data Preprocessing

In parallel with data analysis, several preprocessing steps were undertaken to create a clean dataset for further modeling:

- Removal of duplicate rows
- Elimination of rows with empty cells
- Merging of the text and title columns into a single column
- Dataframe cleaning, including punctuation removal, elimination of numbers, special character removal, stopword removal, and lemmatization

These steps resulted in approximately 6,000 duplicated rows, which were subsequently removed, resulting in a final dataset of 38,835 rows while maintaining a balanced label distribution.

### Final Labels Distribution

## 👨‍🔬 Modeling

The project involved training several models with varying configurations, primarily consisting of five CNN models, one CNN model combined with Multinomial Naive Bayes.

### Model Results

### Model Performance Evaluation

All models demonstrated impressive performance, consistently achieving high accuracies, frequently surpassing the 90% mark. The model evaluation process involved several steps:

1. **Baseline Model with GridSearch:**
- A Multinomial Naive Bayes model was established using the TfidfVectorizer.
- Despite being a basic model, it set the initial benchmark for performance.

2. **Advanced Models with TextVectorization and Keras Embedding:**
- A series of models were tested with advanced text vectorization and embedding techniques.
- These models consistently reached accuracies exceeding 99%.
- The enhanced vectorization and embedding significantly improved model performance.

3. **Best-Performing Model: LSTM Bidirectional with Tokenization and Word Embedding:**
- The LSTM Bidirectional model, known for its sequence modeling capabilities, was identified as the best performer.
- It was further evaluated with a different tokenizer and embedding, specifically using `text_to_word_sequence` and Word2Vec embedding.
- While the performance remained impressive, it exhibited a slightly lower accuracy compared to the other models.

## 👏 App Deployment

The last step was to deploy an app using Gradio. The app can be tested following this [link](