https://github.com/thee-unruly/data-engineering-end-to-end

Data Engineering End to End Project
https://github.com/thee-unruly/data-engineering-end-to-end

Last synced: 4 months ago
JSON representation

Data Engineering End to End Project

Host: GitHub
URL: https://github.com/thee-unruly/data-engineering-end-to-end
Owner: Thee-Unruly
Created: 2024-08-12T12:47:33.000Z (10 months ago)
Default Branch: main
Last Pushed: 2024-09-13T13:34:59.000Z (9 months ago)
Last Synced: 2025-01-07T22:10:22.909Z (5 months ago)
Language: Python
Homepage:
Size: 10.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

**Automating Data Streaming with Apache Airflow, Kafka, Spark, and Cassandra!**

**Docker-Compose Breakdown:**

1. *Zookeeper*: Coordinates Kafka brokers.

2. *Kafka Broker*: Core message broker, set up with both internal (broker:29092) and external (localhost:9092) connections.

3. *Schema Registry*: Manages Kafka message schemas for data consistency and evolution.

4. *Kafka Control Center*: Web UI for monitoring Kafka.

5. *Airflow*: Handles task scheduling and workflow automation, using PostgreSQL as a metadata backend.

6. *Spark*: Processes streaming data from Kafka for real-time transformations.

7. *Cassandra*: Stores processed data from Spark as a scalable NoSQL database.

Next Steps:

1. *Kafka + Spark Integration*: Kafka streams real-time data, which is consumed by Spark's Structured Streaming to process data (e.g., users_created topic) and store it in Cassandra.

2. *Cassandra + Spark Integration*: Use Spark-Cassandra connectors to seamlessly write processed data into Cassandra.

3. *Airflow + Spark Integration*: Airflow DAGs trigger Spark jobs for data processing after Kafka ingestion tasks.

4. *Schema Registry*: Enforce Avro/Protobuf schemas for structured Kafka messages, ensuring smooth producer-consumer communication.

This full setup offers fault tolerance, scalability, and real-time analytics, perfect for live data processing use cases like user profiling or scalable streaming architectures.

**Key Highlights**:

1. *Airflow DAG*: Automates daily data collection via API, streams formatted data to Kafka topics.

2. *Kafka Streaming*: Simulates real-time user data ingestion.

3. *Spark Structured Streaming*: Transforms and maps data to schemas for downstream processes.

4. *Cassandra*: Efficiently stores and retrieves processed data from Spark.

This end-to-end pipeline is ideal for automating workflows, handling real-time data ingestion, and enabling robust data analytics.

#DataEngineering #ApacheKafka #ApacheSpark #Cassandra #ApacheAirflow #RealTimeData #BigData #Streaming

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/thee-unruly/data-engineering-end-to-end

Awesome Lists containing this project

README