https://github.com/saksham-jain177/text_classification

ML pipeline for classifying IMDb reviews as positive or negative using TF-IDF and Logistic Regression. Features an interactive Streamlit UI with caching for efficient predictions.
https://github.com/saksham-jain177/text_classification

machine-learning nlp sentiment-analysis streamlit text-classification

Last synced: about 1 month ago
JSON representation

ML pipeline for classifying IMDb reviews as positive or negative using TF-IDF and Logistic Regression. Features an interactive Streamlit UI with caching for efficient predictions.

Host: GitHub
URL: https://github.com/saksham-jain177/text_classification
Owner: saksham-jain177
Created: 2025-02-11T14:54:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-11T14:56:48.000Z (over 1 year ago)
Last Synced: 2025-08-18T04:06:53.311Z (10 months ago)
Topics: machine-learning, nlp, sentiment-analysis, streamlit, text-classification
Language: Python
Homepage:
Size: 6.84 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Customer Review Sentiment Classification

## Overview
This project implements a machine learning pipeline for classifying customer reviews from the IMDb dataset as positive or negative. The solution covers data loading, text preprocessing, TF-IDF feature extraction, model training, evaluation, and an interactive Streamlit interface for real-time predictions.**The training pipeline also uses caching and persistent model saving to avoid retraining on every run.**

## Objectives
- **Data Collection:** Load reviews from the [aclImdb dataset](https://ai.stanford.edu/~amaas/data/sentiment/).
- **Preprocessing:** Clean and tokenize reviews.
- **Feature Extraction:** Convert text to TF-IDF features.
- **Model Training:** Train a classifier (Logistic Regression) to predict sentiment.
- **Evaluation:** Assess model performance using standard metrics.
- **User Interface:** Provide an interactive UI for evaluation and review classification.

## Components

### Data Collection & Preprocessing
- Load the [aclImdb dataset](https://ai.stanford.edu/~amaas/data/sentiment/) (organized into train/test with positive and negative reviews).
- Clean and tokenize review texts.

### Feature Extraction
- Use TF-IDF vectorization to convert reviews into numerical features.

### Model Training & Evaluation
- Split the data into training and test sets.
- Train a Logistic Regression model.
- Evaluate the model using accuracy, precision, recall, F1-score, and a confusion matrix.

### User Interface
- A Streamlit app to run the entire pipeline, display evaluation metrics, and classify new reviews.
- Caching and model persistence to avoid retraining on every run.

## How to Run

1. **Clone the Repository:**

```
git clone https://github.com/saksham-jain177/text_classification.git
cd text-classification
```
2. **Install Dependencies:**
```
pip install -r requirements.txt
```
3. **Run the Application:**
```
streamlit run app.py
```

## Directory Structure
text_classification/
├── app.py # Main application file (Streamlit interface)
├── data/
│ └── aclImdb/ # IMDb dataset organized into train/test with pos/neg reviews
├── evaluation.py # metrics and visualization
├── feature_extraction.py # TF-IDF feature extraction
├── model.py # model training
├── preprocessing.py # data loading and text preprocessing
├── requirements.txt
└── README.md
## Challenges and Insights
- Balancing data cleaning and feature extraction to capture meaningful signals.
- Tuning the TF-IDF vectorizer for effective text representation.
- Achieving robust model performance given the variability in customer reviews.

## Future Improvements
- Experimenting with alternative classifiers (e.g., Naive Bayes, SVM) and ensemble methods.
- Integrate hyperparameter tuning for optimized performance.
- Enhance the UI with additional visualizations and batch prediction capabilities.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/saksham-jain177/text_classification

Awesome Lists containing this project

README