https://github.com/securdrgorp/bigdata-amazon-reviews
Build a system capable of predicting sentiment (positive, neutral, negative) of comments in real time (online) and displaying the results in an offline dashboard.
https://github.com/securdrgorp/bigdata-amazon-reviews
amazon-reviews big-data docker docker-compose flask kafka mongodb shell-scripts spark spark-streaming
Last synced: 3 months ago
JSON representation
Build a system capable of predicting sentiment (positive, neutral, negative) of comments in real time (online) and displaying the results in an offline dashboard.
- Host: GitHub
- URL: https://github.com/securdrgorp/bigdata-amazon-reviews
- Owner: SecurDrgorP
- Created: 2025-05-13T03:41:09.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-01T21:28:47.000Z (about 1 year ago)
- Last Synced: 2025-06-02T07:27:48.323Z (about 1 year ago)
- Topics: amazon-reviews, big-data, docker, docker-compose, flask, kafka, mongodb, shell-scripts, spark, spark-streaming
- Language: Python
- Homepage:
- Size: 10.5 MB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Amazon Reviews Sentiment Analysis - Big Data Project
## Overview
This project implements an end-to-end big data pipeline for analyzing Amazon product reviews. It uses a combination of data preprocessing, machine learning for sentiment analysis, and real-time streaming technologies to process reviews and visualize insights through an interactive dashboard.
## Architecture
The system architecture consists of the following components:
- **Data Preprocessing**: Cleans and transforms raw Amazon review data
- **Machine Learning**: Trains and deploys a sentiment analysis model
- **Streaming Pipeline**: Processes reviews in real-time using Kafka and Spark
- **Storage Layer**: Stores processed data and results in MongoDB
- **Web Dashboard**: Visualizes insights through a Flask web application
## Technologies
- **Apache Kafka**: Message streaming platform
- **Apache Spark**: Distributed data processing
- **MongoDB**: NoSQL database for storing reviews and results
- **Flask**: Web framework for the dashboard
- **Docker**: Containerization for easy deployment
- **Python**: Primary programming language
- **SpaCy**: NLP library for text processing
## Installation & Setup
### Prerequisites
- Docker and Docker Compose
- Python 3.7+
### Steps
1. Clone the repository:
```bash
git clone https://github.com/yourusername/bigdata-amazon-reviews.git
cd bigdata-amazon-reviews
```
2. Create and activate a virtual environment:
```bash
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/macOS
source venv/bin/activate
# On Windows
venv\Scripts\activate
```
3. Create and configure environment variables:
```bash
cp .env.example .env
# Edit .env with appropriate values
```
4. Install Python dependencies and the spaCy language model:
```bash
pip install -r requirements.txt
python -m spacy download en_core_web_sm
```
5. Start Docker containers:
```bash
docker-compose up -d
```
## Usage
1. Run the Spark consumer to process the data:
```bash
./run_consumer.sh
```
2. Start the Kafka producer to ingest review data:
```bash
./run_producer.sh
```
3. Access the dashboard at http://localhost:5000
## Data Pipeline
1. **Data Preparation**: Raw Amazon review data is cleaned and preprocessed
2. **Producer**: Kafka producer streams review data into the pipeline
3. **Consumer**: Spark processes the streams and performs sentiment analysis
4. **Storage**: Results are stored in MongoDB
5. **Visualization**: Flask application renders insights through a web dashboard
## Model Training
The sentiment analysis model can be retrained using:
```bash
cd model
python evaluate_model.py
```
Alternatively, examine the training process:
```bash
jupyter notebook model/train_model.ipynb
```
## Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## Acknowledgments
- Amazon review dataset providers
- The open source community for the amazing tools used in this project