https://github.com/fa3001/streaming-data-processing
Streaming Data Processing for rooms properties.
https://github.com/fa3001/streaming-data-processing
airflow docker elasticsearch hdfs kafka kibana makfile minio spark-streaming yarn zookeeper
Last synced: about 2 months ago
JSON representation
Streaming Data Processing for rooms properties.
- Host: GitHub
- URL: https://github.com/fa3001/streaming-data-processing
- Owner: FA3001
- Created: 2024-07-19T07:42:47.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-07-27T15:45:55.000Z (10 months ago)
- Last Synced: 2024-12-23T01:15:08.659Z (5 months ago)
- Topics: airflow, docker, elasticsearch, hdfs, kafka, kibana, makfile, minio, spark-streaming, yarn, zookeeper
- Language: Python
- Homepage:
- Size: 1.31 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Streaming Data Processing
## Used Technologies and Services
- Apache Airflow
- Apache Zookeeper
- Apache Kafka
- Apache Hadoop HDFS
- Apache Spark (PySpark)
- Apache Hadoop YARN
- Elasticsearch
- Kibana
- MinIO
- Docker## Overview
- Take a compressed data source from a URL
- Process the raw data with **PySpark**, and use **HDFS** as file storage, check resources with **Apache Hadoop YARN**.
- Use **data-generator** to simulate streaming data, and send the data to **Apache Kafka**.
- Read the streaming data from Kafka topic using **PySpark (Spark Streaming)**.
- Write the streaming data to **Elasticsearch**, and visualize it using **Kibana**.
- Write the streaming data to **MinIO (AWS Object Storage)**.
- Use **Apache Airflow** to orchestrate the whole data pipeline.
## Steps of the Project
- All services used via Docker and Makefile.
- All steps of the data pipeline can be seen via Airflow DAG. They are all explained here as well.
To run all sevices use:
```bash
make start-all
```### Download the Data:
We should first download the data via the command:
```bash
wget -O //sensors.zip https://github.com/dogukannulu/datasets/raw/master/sensors_instrumented_in_an_office_building_dataset.zip
```
This zip file contains a folder named `KETI`. Each folder inside this main folder represents
a room number. Each room contains five `csv` files, and each represents a property belonging to
these rooms. These properties are:- CO2
- Humidity
- Light
- Temperature
- PIR (Passive Infrared Sensor Data)Each csv also includes timestamp column.
### Unzip the Downloaded Data and Remove README.txt:
We should then unzip this data via the following command:```bash
unzip //sensors_instrumented_in_an_office_building_dataset.zip -d //KETI/README.txt
```### Put data to HDFS:
`KETI` folder is now installed to our local successfully.
Since PySpark gets the data from HDFS, we should put the local folder to HDFS
as well using the following command:```bash
docker exec -it bash
hdfs dfs -mkdir -p /user/hadoop/keti/
hdfs dfs -put /path/in/container/KETI /user/hadoop/keti/
```
We can browse for the HDFS location we put the data in via `localhost:9000`### Running the Read-Write PySpark/Pandas Script:
Both `read_and_write_pandas.py` and `read_and_write_spark.py` can be used to modify the initial
data. They both do the same job.All the methods and operations are described with comments and docstrings in both scripts.
We can check `localhost:8088` to see the resource usage (**YARN**) of the running jobs while Spark script is running.
Written data:

**_NOTE:_** With this step, we have our data ready. You can see it as `sensors.csv` in this repo.
### Creating the Kafka Topic:
The script `kafka_admin_client.py` under the folder `kafka_admin_client` can be used to
create a Kafka topic or prints the `already_exists` message if there is already a Kafka topic
with that name.We can check if topic has been created as follows:
```
kafka-topics.sh --bootstrap-server localhost:9092 --list
```Streaming data example:

### Writing data to Elasticsearch using Spark Streaming:
We can access to Elasticsearch UI via `localhost:5601`
All the methods and operations are described with comments and docstrings in
`spark_to_elasticsearch.py`.Sample Elasticsearch data:

We can run this script by running `spark_to_elasticsearch.sh`. This script also runs the
Spark virtualenv.Logs of the script:

### Writing data to MinIO using Spark Streaming:
We can access to MinIO UI via `localhost:9001`
All the methods and operations are described with comments and docstrings in
`spark_to_minio.py`.We can run this script by running `spark_to_minio.sh`. This script also runs the
Spark virtualenv.Sample MinIO data:
**_NOTE:_** We can also check the running Spark jobs via `localhost:4040`
### Airflow DAG Trigger:
We can trigger the Airflow DAG on `localhost:1502`. Triggering the DAG will do all the above
explained data pipeline with one click.Airflow DAG:


Running streaming applications on Airflow may create some issues. In that case, we can run
bash scripts instead.### Create Dashboard on Elasticsearch/Kibana:
We can check the amount of streaming data (and the change of the amount)
in Elasticsearch by running the following command:```
GET /_cat/indices?v
```We can create a new dashboard using the data in office_input index. Here are some sample graphs:




Which contains:
- Percentage of Movement Pie Chart
- Average CO2 per room Line Chart
- Average pir per Room Absolute Value Graph
- Average Light per Movement Status Gauge
- Average pir per Room Bar Chart
- Average Temperature per Movement Bar Chart
- Average Humidity per Hour Area Chart
- Median of CO2 per Movement Status Bar Chart