https://github.com/aymane-maghouti/big-data-project

This project aims to predict smartphone prices using a combination of batch and stream processing techniques in a Big Data environment. The architecture follows the Lambda Architecture pattern, providing both real-time and batch processing capabilities to users.
https://github.com/aymane-maghouti/big-data-project
apache-airflow apache-kafka apache-spark batch-processing big-data-projects hbase hdfs ingestion java lambda-architecture machine-learning postgresql-database powerbi pyspark python spring-boot streaming
Last synced: 4 months ago
JSON representation
Host: GitHub
URL: https://github.com/aymane-maghouti/big-data-project
Owner: aymane-maghouti
Created: 2024-04-02T02:33:48.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-04-15T09:08:08.000Z (about 1 year ago)
Last Synced: 2025-02-02T01:31:47.798Z (5 months ago)
Topics: apache-airflow, apache-kafka, apache-spark, batch-processing, big-data-projects, hbase, hdfs, ingestion, java, lambda-architecture, machine-learning, postgresql-database, powerbi, pyspark, python, spring-boot, streaming
Language: Python
Homepage: https://youtu.be/iClZyC_TZyA
Size: 960 KB
Stars: 10
Watchers: 1
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Smartphone Price Prediction in Big Data Environment

## Table of Contents

1. [Project Overview](#1-project-overview)

2. [Technologies Used](#2-technologies-used)

3. [Architecture](#3-architecture)

4. [Repository Structure](#4-repository-structure)

5. [Software Requirements for Running the Project](#5-software-requirements-for-running-the-project)

6. [How to Run](#6-how-to-run)

7. [Dashboards](#7-dashboards)

8. [Acknowledgments](#8-acknowledgments)

9. [Conclusion](#9-conclusion)

10. [Contacts](#10-contacts)

## 1. Project Overview

This project aims to predict smartphone prices using a combination of batch and stream processing techniques in a Big Data environment. The architecture follows the Lambda Architecture pattern, providing both real-time and batch processing capabilities to users.

## 2. Technologies Used

* **Ingestion Layer:** Apache Kafka (message broker)

* **Stream Layer:** XGBoost (machine learning model), Apache HBase (real-time View)

* **Batch Layer:** Apache Spark (data processing framework), Apache Airflow (workflow orchestration), PostgreSQL (data warehouse (Batch View))

* **Visualization:** Spring Boot (web application framework), Power BI (interactive dashboards)

## 3. Architecture

- Here is the architecture :

 ![architecture](images/architecture.png)

The project architecture consists of five main layers: the ingestion layer, the batch layer, the stream layer, the serving layer and the visualization layer.

### Ingestion Layer

- **Apache Kafka**: Utilized for real-time data ingestion from an API providing smartphone data.

   - **Consumer**: Collects data from the API and feeds it into the stream and batch layer.

### Stream Layer

- **Producer**: A machine learning model developed using XGBoost to estimate smartphone prices. This model runs in real-time and stores predictions in a realtime view. (details about the model here )

### Batch Layer

- **HDFS**: Data from the API is stored in HDFS as part of the data lake solution.

 - **PySpark**: Performs data transformation on stored data using PySpark.

 - **Apache Airflow**: Orchestrates the batch processing workflow.

### Serving Layer

- **Realtime View**: Implemented using HBase to provide real-time access to predicted smartphone prices.

- **Batch View**: Transformed data is stored in PostgreSQL, as the data warehouse solution.

### Visualization Layer

- **Spring Boot Web Application**: Provides a user interface to view real-time smartphone prices.

- **Power BI Dashboard**: Provides batch users with a visualization of processed data.

## 4. Repository Structure

The repository is organized as follows:

```batch 

Big-Data-Project:.

|   README.md

|

+---images

|       architecture.png

|       dashboard_phone.png

|       run_web_app.png

|       spring_boot_web_app.png

|

\---Main

    |   commands.sh

    |   Dashboard.pbix

    |

    +---.idea

    |       workspace.xml

    |

    +---Lambda

    |   |   docker-compose.yaml

    |   |   producer.py

    |   |   transform.py

    |   |

    |   +---.idea

    |   |   |   .gitignore

    |   |   |   .name

    |   |   |   misc.xml

    |   |   |   modules.xml

    |   |   |   price prediction (big data envirnment).iml

    |   |   |   vcs.xml

    |   |   |   workspace.xml

    |   |   |

    |   |   \---inspectionProfiles

    |   |           profiles_settings.xml

    |   |

    |   +---Batch_layer

    |   |   |   batch_layer.py

    |   |   |   batch_pipeline.py

    |   |   |   HDFS_consumer.py

    |   |   |   put_data_hdfs.py

    |   |   |   save_data_postgresql.py

    |   |   |   spark_tranformation.py

    |   |   |   __init__.py

    |   |   |

    |   |   +---dags

    |   |   |       syc_with_Airflow.py

    |   |   |       __init__.py

    |   |   |

    |   |   \---__pycache__

    |   |           batch_layer.cpython-310.pyc

    |   |           HDFS_consumer.cpython-310.pyc

    |   |           put_data_hdfs.cpython-310.pyc

    |   |           save_data_postgresql.cpython-310.pyc

    |   |           spark_tranformation.cpython-310.pyc

    |   |           __init__.cpython-310.pyc

    |   |

    |   +---ML_operations

    |   |   |   xgb_model.pkl

    |   |   |

    |   |   \---__pycache__

    |   +---real_time_web_app(Flask)

    |   |   |   app.py

    |   |   |   get_Data_from_hbase.py

    |   |   |

    |   |   +---static

    |   |   |   +---css

    |   |   |   |       style.css

    |   |   |   |

    |   |   |   \---js

    |   |   |           script.js

    |   |   |

    |   |   +---templates

    |   |   |       index.html

    |   |   |

    |   |   \---__pycache__

    |   |           get_Data_from_hbase.cpython-310.pyc

    |   |

    |   +---Stream_data

    |   |   |   stream_data.csv

    |   |   |   stream_data.py

    |   |   |

    |   |   \---__pycache__

    |   +---Stream_layer

    |   |       insert_data_hbase.py

    |   |       ML_consumer.py

    |   |       stream_pipeline.py

    |   |       __init__.py

    |   |

    |   \---__pycache__

    |           producer.cpython-310.pyc

    |           transform.cpython-310.pyc

    |

    \---real_time_app(Spring boot)

        |   .classpath

        |   .gitignore

        |   .project

        |   HELP.md

        |   mvnw

        |   mvnw.cmd

        |   pom.xml

        |

        +---.mvn

        |   \---wrapper

        |           maven-wrapper.jar

        |           maven-wrapper.properties

        |

        +---.settings

        |       org.eclipse.core.resources.prefs

        |       org.eclipse.jdt.core.prefs

        |       org.eclipse.m2e.core.prefs

        |

        +---src

        |   +---main

        |   |   +---java

        |   |   |   \---com

        |   |   |       \---example

        |   |   |           \---demo

        |   |   |               |   RealTimeAppApplication.java

        |   |   |               |

        |   |   |               +---controller

        |   |   |               |       IndexController.java

        |   |   |               |

        |   |   |               \---service

        |   |   |                       HbaseService.java

        |   |   |

        |   |   \---resources

        |   |       |   application.properties

        |   |       |

        |   |       +---static

        |   |       |   +---css

        |   |       |   |       style.css

        |   |       |   |

        |   |       |   \---js

        |   |       |           script.js

        |   |       |

        |   |       \---templates

        |   |               index.html

        |   |

        |   \---test

        |       \---java

        |           \---com

        |               \---example

        |                   \---demo

        |                           RealTimeAppApplicationTests.java

        |

        \---target

            +---classes

            |   |   application.properties

            |   |

            |   +---com

            |   |   \---example

            |   |       \---demo

            |   |           |   RealTimeAppApplication.class

            |   |           |

            |   |           +---controller

            |   |           |       IndexController.class

            |   |           |

            |   |           \---service

            |   |                   HbaseService.class

            |   |

            |   +---META-INF

            |   |   |   MANIFEST.MF

            |   |   |

            |   |   \---maven

            |   |       \---com.example

            |   |           \---real_time_app

            |   |                   pom.properties

            |   |                   pom.xml

            |   |

            |   +---static

            |   |   +---css

            |   |   |       style.css

            |   |   |

            |   |   \---js

            |   |           script.js

            |   |

            |   \---templates

            |           index.html

            |

            \---test-classes

                \---com

                    \---example

                        \---demo

                                RealTimeAppApplicationTests.class

```

## 5. Software Requirements for Running the Project

This project requires the following software to be installed and configured on your system:

**Big Data Stack:**

* **Apache Kafka (version 2.6.0)**

* **Apache HBase (version 1.2.6)** 

* **Apache Hadoop (version 2.7.0)** 

* **Apache Spark (version 3.3.4)** 

* **PostgreSQL database**

**Programming Languages and Frameworks:**

* **Python (version 3.10.x or later)** 

* **Java 17 (or compatible version)** 

* **Spring Boot** 

**Machine Learning Library:**

* **XGBoost** 

**Additional Tools:**

* **Apache Airflow** 

* **Power BI Desktop** 

By installing and configuring these tools, you will have the necessary environment to run this project and leverage its real-time and batch processing capabilities for smartphone price prediction and analysis.

## 6. How to Run

To set up and run the project locally, follow these steps:

  - Clone the repository:

   ```bash

   git clone https://github.com/aymane-maghouti/Big-Data-Project

   ```

#### **1. Stream Layer**

   - Start Apache zookeeper

   ```batch 

zookeeper-server-start.bat C:/kafka_2.13_2.6.0/config/zookeeper.properties

```

   - Start Kafka server

   ```batch 

kafka-server-start.bat C:/kafka_2.13_2.6.0/config/server.properties

```

   - Create Kafka topic

   ```batch 

kafka-topics.bat --create --topic smartphoneTopic --bootstrap-server localhost:9092

```

  - Run the kafka producer

   ```batch 

kafka-console-producer.bat --topic smartphoneTopic --bootstrap-server localhost:9092

```

  - Run the kafka consumer

   ```batch 

kafka-console-consumer.bat --topic smartphoneTopic --from-beginning --bootstrap-server localhost:9092

```

  - Start HDFS and yarn (start-all or start-dfs and start-yarn)

   ```batch 

start-all  

```

   - Start Hbase

   ```batch 

start-hbase  

```

   - Run thrift server (for Hbase)

   ```batch 

hbase thrift start

```

after all this run `stream_pipeline.py` script.

and then open the spring boot appliation in your idea and run  it (you can access to the web app locally on  `localhost:8081/`)

---

![spring_boot](images/run_web_app.png)

note that there is another version of the web app developed using Flask micro-framework(watch the demo video for mor details)

#### **2. Batch Layer**

   - Start the Apache Airflow instance: 

   ```batch 

docker-compose up -d

```

   Access the Apache Airflow web UI (localhost:8080) and run the DAG

   - Start Apache Spark

   ```batch 

spark-shell

```

   - Start Apache zookeeper

   ```batch 

zookeeper-server-start.bat C:/kafka_2.13_2.6.0/config/zookeeper.properties

```

   - Start Kafka server

   ```batch 

kafka-server-start.bat C:/kafka_2.13_2.6.0/config/server.properties

```

  - Run the kafka producer

   ```batch 

kafka-console-producer.bat --topic smartphoneTopic --bootstrap-server localhost:9092

```

  - Run the kafka consumer

   ```batch 

kafka-console-consumer.bat --topic smartphoneTopic --from-beginning --bootstrap-server localhost:9092

```

  - Run HDFS and yarn (start-all or start-dfs and start-yarn)

   ```batch 

start-all  

```

   - Open power BI file `dashboard.pbix` attached with this project 

after all this run `syc_with_Airflow.py` script.

## 7. Dashboards

This project utilizes two dashboards to visualize smartphone price predictions and historical data:

#### **1. Real-Time Dashboard (Spring Boot Application):**

- This dashboard is built using a Spring Boot web application.

- It displays the **predicted price of smartphones in real-time**.

- Users can access this dashboard through a web interface. 

Here is the UI of th Spring Boot web application:

![spring_boot_web_ap](images/spring_boot_web_app.png)

#### **2. Batch Dashboard (Power BI):**

- This dashboard leverages Power BI for interactive data exploration.

- It provides insights into **historical smartphone price trends**.

- This dashboard is designed for batch users interested in historical analysis.

Here is the  Dashboard created in Power BI:

![Phone Dashboard](images/dashboard_phone.png)

## 8. Acknowledgments

- Special thanks to the open-source communities behind `Python`, `Kafka`, `HDFS` , `Spark`,`Hbase`,`Spring Boot`and `Airflow`

## 9. Conclusion

- This big data architecture effectively predicts smartphone prices in real-time and provides historical analysis capabilities. The Lambda architecture facilitates efficient stream processing for real-time predictions using XGBoost and HBase, while Apache Airflow orchestrates batch processing with Spark to populate the PostgreSQL data warehouse for historical insights. This solution empowers real-time and batch users with valuable price information, enabling data-driven decision-making.

you can watch the demo video here 

## 10. Contacts

For any inquiries or further information, please contact:

- **Name:** Aymane Maghouti

- **Email:** [email protected]

- **LinkedIn:** Aymane Maghouti
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aymane-maghouti/big-data-project

Awesome Lists containing this project

README