https://github.com/imran-sony/sentiment-analysis-imdb
Sentiment Analysis of IMDB Review Dataset
https://github.com/imran-sony/sentiment-analysis-imdb
beautifulsoup bert-embeddings gensim nltk pytorch tf-idf transformers word2vec-embeddinngs
Last synced: 30 days ago
JSON representation
Sentiment Analysis of IMDB Review Dataset
- Host: GitHub
- URL: https://github.com/imran-sony/sentiment-analysis-imdb
- Owner: imran-sony
- Created: 2025-11-03T16:48:18.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-03T17:03:32.000Z (7 months ago)
- Last Synced: 2025-11-03T19:09:08.723Z (7 months ago)
- Topics: beautifulsoup, bert-embeddings, gensim, nltk, pytorch, tf-idf, transformers, word2vec-embeddinngs
- Language: Jupyter Notebook
- Homepage:
- Size: 24.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🎬 Sentiment Analysis on IMDB Dataset
This project compares three different text-representation techniques — **TF-IDF, Word2Vec, and BERT embeddings** — for sentiment classification on the IMDB movie reviews dataset using Logistic Regression as the classifier.
## 📚 Project Overview
The goal is to evaluate how classical and modern NLP techniques perform on sentiment analysis tasks.
We use:
**TF-IDF** → traditional statistical feature representation
**Word2Vec** → word embeddings capturing semantic meaning
**BERT (DistilBERT)** → transformer-based contextual embeddings
Each representation is trained and evaluated using Logistic Regression, and results are compared using standard classification metrics.
## 🧠 Workflow
### 1️⃣ Load Dataset
The IMDB dataset is used from the datasets library:
from datasets import load_dataset
dataset = load_dataset('imdb')
The dataset is automatically split into train and test sets.
### 2️⃣ Preprocessing
Steps include:
Lowercasing
Removing HTML tags
Removing punctuation and numbers
Tokenization with NLTK
Stopword removal
### 3️⃣ Feature Extraction Methods
🔹 TF-IDF
Represent text as numerical vectors using term frequency–inverse document frequency.
Trained with Logistic Regression.
🔹 Word2Vec
Train a Word2Vec model on tokenized text.
Represent each document as the average of its word vectors.
🔹 BERT (DistilBERT)
Use DistilBERT embeddings for contextual representation.
Extract token embeddings from the last hidden state.
### 4️⃣ Classification
A Logistic Regression classifier is trained on each feature representation.
### 5️⃣ Evaluation Metrics
Evaluate performance using:
Accuracy
Precision
Recall
F1-score
## 📊 Comparison

## 🧩 Technologies Used
Python
PyTorch
Hugging Face Transformers
scikit-learn
NLTK
Gensim
BeautifulSoup
Datasets Library
## 🚀 How to Run
Clone this repository:
git clone https://github.com/imran-sony/sentiment-analysis-imdb.git
cd sentiment-analysis-imdb
or open IMDB.ipynb