https://github.com/ankitwalimbe/sentiment-analysis
Sentiment analysis of Amazon Fashion reviews using VADER and a baseline ML model (TF-IDF + SGDClassifier). Includes visualizations, reproducible notebook, and recruiter-ready documentation.
https://github.com/ankitwalimbe/sentiment-analysis
data-analysis machine-learning matplotlib nlp pandas python seaborn sentiment-analysis sklearn
Last synced: about 2 months ago
JSON representation
Sentiment analysis of Amazon Fashion reviews using VADER and a baseline ML model (TF-IDF + SGDClassifier). Includes visualizations, reproducible notebook, and recruiter-ready documentation.
- Host: GitHub
- URL: https://github.com/ankitwalimbe/sentiment-analysis
- Owner: AnkitWalimbe
- Created: 2025-09-19T20:34:33.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2025-09-19T21:24:32.000Z (9 months ago)
- Last Synced: 2025-09-19T23:57:17.324Z (9 months ago)
- Topics: data-analysis, machine-learning, matplotlib, nlp, pandas, python, seaborn, sentiment-analysis, sklearn
- Language: Jupyter Notebook
- Homepage:
- Size: 883 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Sentiment Analysis on Amazon Fashion Reviews
This project applies Natural Language Processing (NLP) to analyze customer sentiment in **Amazon Fashion reviews**.
It combines **VADER sentiment analysis** with a basic **machine learning model (SGDClassifier + TF-IDF)** to classify reviews into **positive, neutral, or negative** categories.
---
## Project Overview
- **Dataset**: Amazon Fashion Reviews (JSONL format, >1M reviews).
The dataset is too large to host here – see [data/README.md](data/README.md) for download instructions.
- **Goal**: Explore review text, extract sentiment, and compare rule-based (VADER) vs machine learning approaches.
- **Why it matters**: Customer sentiment is a key driver for **returns reduction**, **CX insights**, and **product feedback loops**.
---
## Project Highlights
- Processed **~23K Amazon Fashion reviews**.
- Applied **VADER sentiment analysis** to classify reviews.
- Built a baseline **ML model** using TF-IDF + SGDClassifier.
- Visualized sentiment distributions and agreement between VADER and ML.
- Saved key plots and reports for reproducibility.
---
## Repository Structure
```plaintext
sentiment-analysis/
│── data/ <- raw data (ignored), dataset instructions in README.md
│── notebooks/ <- Jupyter notebooks
│── outputs/ <- figures and saved results
│── reports/ <- generated CSVs (ignored in Git)
│── .gitignore
│── README.md <- project documentation
```
---
## Workflow
1. **Data Preprocessing**
- Converted timestamps into structured `date` and `time`.
- Selected relevant fields: `rating`, `title`, `text`, `asin`, `user_id`.
- Handled missing values.
2. **Exploratory Data Analysis**
- Visualized rating distributions.
- Checked review counts by star ratings.
3. **Sentiment Analysis with VADER**
- Applied **NLTK VADER** to calculate `pos`, `neg`, `neu`, and `compound` scores.
- Classified reviews into **positive, negative, neutral** buckets.
4. **Machine Learning Model (Baseline)**
- Used **TF-IDF Vectorizer** to transform review text.
- Trained **SGDClassifier** (linear SVM) on balanced subsets.
- Evaluated with accuracy, classification report, and confusion matrix.
5. **Comparison: VADER vs ML**
- Created a comparison dataset of predictions.
- Observed significant **bias toward positive reviews** due to data imbalance.
---
## Key Outputs
- **Figures**:
- Rating distribution plots.
- Sentiment distribution (VADER).
- Confusion matrices (ML model & VADER vs ML).
- **Reports**:
- Sentiment comparison CSV.
All outputs are saved in the `outputs/figures/` and `reports/` directories.
---
## Learnings
- VADER works well as a quick, rule-based baseline for sentiment.
- ML models trained on imbalanced data struggle to classify negative/neutral reviews.
- TF-IDF is limited in capturing context — embeddings like Word2Vec, BERT, or GloVe could improve results.
- Undersampling balances training data but can **lose information**. Alternative approaches (e.g., SMOTE) may perform better.
---
## Next Steps
- Experiment with **Word2Vec / BERT embeddings**.
- Test **Random Forests / Neural Networks** for better classification.
- Fine-tune sentiment thresholds in VADER.
- Automate preprocessing and evaluation via a pipeline.
---
## Tech Stack
- **Python** (pandas, NumPy, matplotlib, seaborn, tqdm)
- **NLP**: NLTK, VADER Sentiment
- **ML**: scikit-learn, imbalanced-learn
- **Visualization**: seaborn, matplotlib
---
## Citation
Dataset reference:
Hou, Yupeng et al. (2024).
*Bridging Language and Items for Retrieval and Recommendation.*
[arXiv:2403.03952](https://arxiv.org/abs/2403.03952)
---