https://github.com/thee-unruly/data-engineering-end-to-end
Data Engineering End to End Project
https://github.com/thee-unruly/data-engineering-end-to-end
Last synced: 4 months ago
JSON representation
Data Engineering End to End Project
- Host: GitHub
- URL: https://github.com/thee-unruly/data-engineering-end-to-end
- Owner: Thee-Unruly
- Created: 2024-08-12T12:47:33.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2024-09-13T13:34:59.000Z (9 months ago)
- Last Synced: 2025-01-07T22:10:22.909Z (5 months ago)
- Language: Python
- Homepage:
- Size: 10.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
**Automating Data Streaming with Apache Airflow, Kafka, Spark, and Cassandra!**
**Docker-Compose Breakdown:**
1. *Zookeeper*: Coordinates Kafka brokers.
2. *Kafka Broker*: Core message broker, set up with both internal (broker:29092) and external (localhost:9092) connections.
3. *Schema Registry*: Manages Kafka message schemas for data consistency and evolution.
4. *Kafka Control Center*: Web UI for monitoring Kafka.
5. *Airflow*: Handles task scheduling and workflow automation, using PostgreSQL as a metadata backend.
6. *Spark*: Processes streaming data from Kafka for real-time transformations.
7. *Cassandra*: Stores processed data from Spark as a scalable NoSQL database.
Next Steps:
1. *Kafka + Spark Integration*: Kafka streams real-time data, which is consumed by Spark's Structured Streaming to process data (e.g., users_created topic) and store it in Cassandra.
2. *Cassandra + Spark Integration*: Use Spark-Cassandra connectors to seamlessly write processed data into Cassandra.
3. *Airflow + Spark Integration*: Airflow DAGs trigger Spark jobs for data processing after Kafka ingestion tasks.
4. *Schema Registry*: Enforce Avro/Protobuf schemas for structured Kafka messages, ensuring smooth producer-consumer communication.
This full setup offers fault tolerance, scalability, and real-time analytics, perfect for live data processing use cases like user profiling or scalable streaming architectures.
**Key Highlights**:
1. *Airflow DAG*: Automates daily data collection via API, streams formatted data to Kafka topics.
2. *Kafka Streaming*: Simulates real-time user data ingestion.
3. *Spark Structured Streaming*: Transforms and maps data to schemas for downstream processes.
4. *Cassandra*: Efficiently stores and retrieves processed data from Spark.
This end-to-end pipeline is ideal for automating workflows, handling real-time data ingestion, and enabling robust data analytics.
#DataEngineering #ApacheKafka #ApacheSpark #Cassandra #ApacheAirflow #RealTimeData #BigData #Streaming