https://github.com/kathrin-92/stream-processing-dlbdsede02
Spark Streaming Service for environmental sensor data. Includes a script to simulate streaming data as well as PostgreSQL Database setup.
https://github.com/kathrin-92/stream-processing-dlbdsede02
apache-spark postgres sensor-data spark-streaming streaming
Last synced: 11 months ago
JSON representation
Spark Streaming Service for environmental sensor data. Includes a script to simulate streaming data as well as PostgreSQL Database setup.
- Host: GitHub
- URL: https://github.com/kathrin-92/stream-processing-dlbdsede02
- Owner: Kathrin-92
- Created: 2025-02-05T06:55:03.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-16T17:23:31.000Z (12 months ago)
- Last Synced: 2025-04-26T07:57:13.942Z (11 months ago)
- Topics: apache-spark, postgres, sensor-data, spark-streaming, streaming
- Language: Python
- Homepage:
- Size: 52.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Use Case
Sensors that measure various environmental parameters are installed in a city to monitor air quality.
These include particulate matter, carbon monoxide, ozone, sulphur dioxide, nitrogen dioxide and other pollutants.
The measured values are obtained via an API from the German Federal Environment Agency and processed in a stream
processing pipeline with Apache Spark Structured Streaming.
The aim is to store the collected data efficiently and make it available in aggregated form.
# Project Structure
## 1. API service
The API service includes the retrieval, storage and simulation of sensor data.
A script retrieves data from the Umweltbundesamt API once a day (or at a customizable time) using a cronjob.
This data comes from various cities in Germany and is used to generate a simulated data stream.
- **fetch_data.py**: Retrieves sensor data and metadata, processes it and saves it as .csv files. Metadata is written to a PostgreSQL table.
- **stream_simulation.py**: Simulates a data stream by reading lines of the .csv file one after the other and saving them as JSON files. Older files (older than 6 minutes) are deleted regularly.
- **main.py**: Orchestrates the entire process.
> [!TIP]
> The Docker container **api_service** (container name: api_service_container) executes this service.
## 2. PostgreSQL database
The database is provided as a Docker container and contains three tables:
- airquality_metadata: Contains information on the pollutant components.
- airquality_raw: Stores the complete historical data set of the sensor measurements.
- airquality_aggregated: Contains aggregated values (min, max, average) for each pollutant component.
The database can be queried via the terminal with the following command:
```docker exec -it postgres_db psql -U postgres -d airquality_sensor_data```
> [!TIP]
> The Docker container **postgres** executes this.
## 3. Spark Streaming Job
The Spark streaming job processes the continuously incoming sensor data and stores it in the PostgreSQL database.
There are two main processes: the storage of the raw data and the aggregation of the measured values.
First, the streaming job reads in the incoming JSON files line by line.
Each new measurement is saved in the raw table as a complete historical data set.
The data is saved in append mode so that new data is added continuously without overwriting existing entries.
In parallel, the incoming sensor data is aggregated in 4-minute time windows.
The maximum, minimum and average measured values within this period are calculated for each pollutant component.
These aggregated values are then written to the aggregated table.
> [!TIP]
> The Docker container **spark-streaming** executes this.
# Using the code
> [!IMPORTANT]
> The following steps must be followed to execute the project:
- Install Docker: The entire environment runs in Docker containers.
- Start Docker-Compose: docker-compose up --build or use Docker Desktop
- Adjust cronjob: If necessary, the cronjob in the API service can be adjusted to execute the API query at a different time.
- Use healthcheck or start streaming manually (see comments in docker-compose.yml)