https://github.com/bdbao/etl-randuser
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra.
https://github.com/bdbao/etl-randuser
apache-airflow apache-kafka apache-spark api cassandra data-streaming docker-compose postgresql python
Last synced: 2 months ago
JSON representation
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra.
- Host: GitHub
- URL: https://github.com/bdbao/etl-randuser
- Owner: bdbao
- Created: 2025-03-07T12:45:26.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-07T13:42:37.000Z (over 1 year ago)
- Last Synced: 2025-12-27T10:28:52.713Z (6 months ago)
- Topics: apache-airflow, apache-kafka, apache-spark, api, cassandra, data-streaming, docker-compose, postgresql, python
- Language: Python
- Homepage:
- Size: 300 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# End-to-End Realtime Data Streaming Project
## Table of Contents
- [End-to-End Realtime Data Streaming Project](#end-to-end-realtime-data-streaming-project)
- [Table of Contents](#table-of-contents)
- [Introduction](#introduction)
- [System Architecture](#system-architecture)
- [Technologies](#technologies)
- [Getting Started](#getting-started)
## Introduction
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.
## System Architecture

The project is designed with the following components:
- **Data Source**: We use `randomuser.me` API to generate random user data for our pipeline.
- **Apache Airflow**: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- **Apache Kafka and Zookeeper**: Used for streaming data from PostgreSQL to the processing engine.
- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.
- **Apache Spark**: For data processing with its master and worker nodes.
- **Cassandra**: Where the processed data will be stored.
## Technologies
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker
- DataGrip
## Getting Started
```bash
git clone https://github.com/bdbao/etl-randuser.git
cd etl-randuser
chmod +x scripts/entrypoint.sh
docker compose up -d
```
- Run DAG in **Airflow UI** (http://localhost:8080, username/pw is `admin`). Waiting until the Scheduler ready to run.
- View Kafka Consumer:
```bash
docker exec -it broker kafka-console-consumer --bootstrap-server broker:29092 --topic users_created --from-beginning
```
```bash
spark-submit --version # e.g: show version 2.12.18-3.5.4.
# Then edit .config(...) of create_spark_connection() in spark_stream.py
pip install cassandra-driver # (if not installed)
spark-submit --master local[2] --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.kafka:kafka-clients:3.5.1 scripts/spark_stream.py
open http://localhost:4040/StreamingQuery
```
- (Optional) Run in Spark Standalone Mode
```bash
spark-submit --master spark://localhost:7077 --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1,org.apache.kafka:kafka-clients:3.5.1 scripts/spark_stream.py
open http://localhost:9090
```
- Open **DataGrip** (or other tools) to view Cassandra Data (username/pw is `cassandra`).
```bash
docker exec -it cassandra cqlsh -u cassandra -p cassandra localhost 9042
describe spark_streams.created_users;
SELECT * FROM spark_streams.created_users;
```
- Remove all containers:
```bash
docker compose down --volumes
```