https://github.com/abdellatif-laghjaj/hespress-comments-analysis
This project performs sentiment analysis on comments scraped from Hespress articles. It uses a big data pipeline consisting of Apache Kafka, Apache Spark, HDFS, and MongoDB to process and store the data.
https://github.com/abdellatif-laghjaj/hespress-comments-analysis
big-data comments-system deep-learning lstm sentiment-analysis
Last synced: 8 months ago
JSON representation
This project performs sentiment analysis on comments scraped from Hespress articles. It uses a big data pipeline consisting of Apache Kafka, Apache Spark, HDFS, and MongoDB to process and store the data.
- Host: GitHub
- URL: https://github.com/abdellatif-laghjaj/hespress-comments-analysis
- Owner: abdellatif-laghjaj
- Created: 2024-11-24T11:15:34.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-12-16T20:58:53.000Z (10 months ago)
- Last Synced: 2025-01-27T06:55:37.224Z (9 months ago)
- Topics: big-data, comments-system, deep-learning, lstm, sentiment-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 215 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hespress Comment Sentiment Analysis
This project performs sentiment analysis on comments scraped from Hespress articles. It uses a big data pipeline
consisting of Apache Kafka, Apache Spark, HDFS, and MongoDB to process and store the data.## Contributers
- **LAGHAJ ABDELLATIF:** https://github.com/abdellatif-laghjaj
- **EL BETTIOUI REDA:** https://github.com/XredaX
- **Adriouich abdelilah:** https://github.com/isMeXar## Architecture
The project follows a hybrid batch and real-time processing architecture:
1. **Data Source (Hesspress):** Comments are scraped from Hespress articles using a custom scraper.
2. **Data Ingestion (Kafka):** Scraped comments are streamed into a Kafka topic.
3. **Batch Processing (Spark):**
- Spark reads comments from the Kafka topic in batches.
- Preprocessing steps (cleaning, normalization) are applied.
- Sentiment is predicted using a pre-trained deep learning model.
- Processed comments, including sentiment, are stored in MongoDB.4. **Real-time Processing (Spark Streaming):**
- Spark Streaming consumes comments from the same Kafka topic in real time.
- Preprocessing and sentiment prediction are performed similarly to the batch layer.
- Real-time sentiment results are stored in MongoDB.5. **Storage (MongoDB):** MongoDB stores both batch and real-time processed comments.
6. **Persistent Storage (HDFS):** The raw comments ingested from Kafka are stored on HDFS for data durability and
potential replay/reprocessing.## Project Structure
```
hespress-comments-analysis/
├── config/ # Configuration files
│ ├── kafka_config.py
│ └── mongodb_config.py
├── models/ # Data models
│ └── comment.py
├── processors/ # Data processing logic
│ ├── batch_processor.py
│ └── spark_processor.py
├── storage/ # Data storage handlers
│ ├── hdfs_handler.py
│ ├── kafka_handler.py
│ └── mongodb_handler.py
├── utils/ # Utility functions
│ └── scrapper.py
│ └── sentiments_processor.py
├── model/ # Sentiment Analysis Model files
│ ├── sentiment_model.h5
│ ├── tokenizer.json
│ └── label_encoder.pkl
├── dashboard/ # Flask dashboard
└── main.py # Main application entry point
└── requirements.txt # Project dependencies
└── README.md # This file
```## Getting Started
### Prerequisites
- **Python 3.7+:** Make sure you have a compatible Python version installed.
- **Java:** Required for Kafka and Spark.
- **Hadoop and HDFS:** Install and configure Hadoop and HDFS.
- **Apache Kafka:** Install and configure Kafka.
- **Apache Spark:** Install and configure Spark.
- **MongoDB:** Install and run MongoDB.
- **Flask:** Install Flask for the dashboard.### Installation
1. Clone the repository:
```bash
git clone https://github.com/abdellatif-laghjaj/hespress-comments-analysis.git
cd hespress-comments-analysis
```2. Create a virtual environment (recommended):
```bash
python3 -m venv .venv
source .venv/bin/activate
```3. Install the required packages:
```bash
pip install -r requirements.txt
```4. **Set up Model Files:** Make sure to include the pre-trained model files (`sentiment_model.h5`, `tokenizer.json`, and
`label_encoder.pkl`) in the `model/` directory.5. **Configuration:**
- Update `config/kafka_config.py` with your Kafka bootstrap servers and topic information.
- Update `config/mongodb_config.py` with your MongoDB connection details.
- Update HDFS details in your `main.py` and `hdfs_handler.py`.### Running the Application
1. **Start ZooKeeper and Kafka:** Use the appropriate commands to start your ZooKeeper and Kafka servers.
```bash
sudo systemctl start zookeeper
``````bash
sudo systemctl start kafka
```2. **Start MongoDB:** Make sure your MongoDB server is running.
```bash
sudo systemctl start mongod
```3. **Start HDFS:** Ensure your Hadoop and HDFS services are running.
```bash
start-all.sh
```4. **Create Kafka Topic:** Create a Kafka topic to store the comments.
```bash
kafka-topics.sh
--create --topic hespress_comments \
--bootstrap-server localhost:9092 \
--partitions 1 --replication-factor 1
```5. **Run the main application:**
```bash
python main.py
```## Contributing
Contributions are welcome! Please open an issue or submit a pull request.