https://github.com/anthippi/naive-bayes-imdb-classification

A custom Naive Bayes classifier for sentiment analysis of movie reviews from the IMDb dataset, utilizing feature selection based on Information Gain and comparing its performance with scikit-learn's BernoulliNB.
https://github.com/anthippi/naive-bayes-imdb-classification

classification imdb matplotlib naive-bayes-classifier numpy pandas scikit-learn sklearn

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/anthippi/naive-bayes-imdb-classification
Owner: Anthippi
License: mit
Created: 2025-01-05T15:29:44.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-08T23:06:19.000Z (7 months ago)
Last Synced: 2025-07-09T00:23:21.234Z (7 months ago)
Topics: classification, imdb, matplotlib, naive-bayes-classifier, numpy, pandas, scikit-learn, sklearn
Language: Jupyter Notebook
Homepage:
Size: 143 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.en.md
- License: LICENSE

Awesome Lists containing this project

README

          # Naive Bayes Classifier for Sentiment Analysis (IMDB Dataset)

This project implements a Naive Bayes classifier for sentiment analysis on movie reviews from the IMDb dataset. The classifier is built from scratch to demonstrate fundamental concepts of natural language processing (NLP), feature selection, and probabilistic modeling.

---

## Contents

- Vocabulary creation using `Information Gain`  

- Text-to-feature vector conversion (bag-of-words)  

- Custom `Naive Bayes` classifier implementation  

- Comparison with `Scikit-learn BernoulliNB`  

- Model evaluation with `precision`, `recall`, `F1`, and `accuracy metrics`  

---

## Code Structure

### `WordStat`

Tracks word statistics: counts in positive/negative documents.

### `VocabularyBuilder`

Implements vocabulary creation:

- Filters words (removes most frequent and least frequent)

- Calculates Information Gain (IG) for feature selection

- Generates final vocabulary with `m` features

### `FeatureVector`

Converts documents to feature vectors:

- Binary features (1 = word present, 0 = absent)

- Appends sentiment label (1 = positive, 0 = negative)

### `NaiveBayes`

Custom Naive Bayes implementation:

- Uses log probabilities with Laplace smoothing

- Calculates log-likelihood for each feature

- Predicts most probable class for new documents

---

## Customization

Modify these parameters in the notebook:

```python

# Vocabulary parameters

n = 50    # Ignore top-n frequent words

k = 80    # Ignore bottom-k rare words

m = 500   # Select top-m features

# Training parameters

train_size = 25000  # Training samples per class

test_size = 25000   # Test samples per class

```

---

## Execution Instructions

1. Download the IMDB dataset  

2. Unzip the dataset to your project directory  

3. Run the Jupyter notebook:

```bash

jupyter notebook NaiveBayesClassifier.ipynb

```

## Execution Pipeline

### Vocabulary Creation

 Parameters: `n=50`, `k=80`, `m=500`

- Removes top-50 most frequent words and 80 least frequent words

- Selects top-500 words based on Information Gain

### Feature Vector Creation

- Converts all reviews into binary feature vectors

### Training & Evaluation

- Trains custom Naive Bayes classifier

- Trains Scikit-learn BernoulliNB for comparison

- Prints precision, recall, F1-score, and accuracy metrics

## Model Performance Comparison

 |Model                 | Accuracy | Precision | Recall | F1 Score |

|----------------------------|----------|-----------|--------|----------|

| Custom NaiveBayes          | 0.85     | 0.84      | 0.85   | 0.84     |

| Scikit-Learn NaiveBayes    | 0.86     | 0.85      | 0.86   | 0.85     |

## Requirements

- Python 3.7+

- Libraries:

    - `numpy`

    - `pandas`

    - `matplotlib`

    - `sklearn`

 

Install dependencies:

```bash

pip install numpy pandas matplotlib scikit-learn seaborn jupyter

```

--- 

## Key Features

### Information Gain Calculation

Statistical method for selecting the most discriminative features

### Laplace Smoothing

Handles unseen words during classification

### Efficient Vectorization

Binary features optimized for memory and performance

### Visual Analytics

Learning curves, confusion matrices, and feature importance plots

---

## Sample Output

```text

Vocabulary creation completed (500 words)  

First 10 vocabulary words: ['excellent', 'wonderful', 'best', 'perfect', 'great', ...]

Training set size: 25,000 vectors  

Test set size: 25,000 vectors

Custom NaiveBayes Results:

              precision    recall  f1-score   support

    Negative       0.84      0.85      0.84     12500

    Positive       0.85      0.84      0.84     12500

    accuracy                           0.85     25000

Scikit-Learn Results:

              precision    recall  f1-score   support

    Negative       0.85      0.86      0.85     12500

    Positive       0.86      0.85      0.85     12500

    accuracy                           0.86     25000

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anthippi/naive-bayes-imdb-classification

Awesome Lists containing this project

README