https://github.com/rizkipragustono/etl_data_pipelines

Practice Project: ETL and Data Pipelines with Shell, Airflow and Kafka
https://github.com/rizkipragustono/etl_data_pipelines

apache-airflow apache-kafka bash data-engineering python

Last synced: 2 months ago
JSON representation

Practice Project: ETL and Data Pipelines with Shell, Airflow and Kafka

Host: GitHub
URL: https://github.com/rizkipragustono/etl_data_pipelines
Owner: rizkipragustono
Created: 2025-01-12T09:49:52.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-01-12T10:13:56.000Z (over 1 year ago)
Last Synced: 2025-03-24T12:16:49.532Z (over 1 year ago)
Topics: apache-airflow, apache-kafka, bash, data-engineering, python
Language: Python
Homepage:
Size: 5.86 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ETL and Data Pipelines with Shell, Airflow and Kafka
You are a Data Engineer at a data analytics consulting company, assigned to a project that aims to de-congest national highways by analyzing road traffic data from different toll plazas. Each highway is operated by a different toll operator with varying IT setups and file formats. Your job is to create three data pipelines to collect, process, and store this data.

Pipeline 1: Batch Processing with Apache Airflow and BashOperator
The first pipeline uses Apache Airflow with BashOperator to automate batch data collection from different toll operators. It fetches data files in various formats (e.g., CSV, JSON, XML), processes them using bash scripts, and consolidates them into a single, unified file.

Pipeline 2: Data Processing with Apache Airflow and PythonOperator
The second pipeline uses Apache Airflow with PythonOperator to process the consolidated data. Python scripts handle data transformations, such as aggregating traffic data by toll plaza, and load it into a database for further analysis.

Pipeline 3: Real-Time Streaming with Kafka
The third pipeline collects real-time data as vehicles pass through toll plazas. Vehicle data, including vehicle_id, vehicle_type, toll_plaza_id, and timestamp, is streamed to Kafka. The data is then processed in real-time and loaded into a database for live traffic analysis.

These three pipelines together provide a solution for handling both historical and real-time traffic data to optimize highway traffic flow.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rizkipragustono/etl_data_pipelines

Awesome Lists containing this project

README