An open API service indexing awesome lists of open source software.

https://github.com/thee-unruly/data-engineering-end-to-end

Data Engineering End to End Project
https://github.com/thee-unruly/data-engineering-end-to-end

Last synced: 4 months ago
JSON representation

Data Engineering End to End Project

Awesome Lists containing this project

README

        

**Automating Data Streaming with Apache Airflow, Kafka, Spark, and Cassandra!**

**Docker-Compose Breakdown:**

1. *Zookeeper*: Coordinates Kafka brokers.

2. *Kafka Broker*: Core message broker, set up with both internal (broker:29092) and external (localhost:9092) connections.

3. *Schema Registry*: Manages Kafka message schemas for data consistency and evolution.

4. *Kafka Control Center*: Web UI for monitoring Kafka.

5. *Airflow*: Handles task scheduling and workflow automation, using PostgreSQL as a metadata backend.

6. *Spark*: Processes streaming data from Kafka for real-time transformations.

7. *Cassandra*: Stores processed data from Spark as a scalable NoSQL database.

Next Steps:

1. *Kafka + Spark Integration*: Kafka streams real-time data, which is consumed by Spark's Structured Streaming to process data (e.g., users_created topic) and store it in Cassandra.

2. *Cassandra + Spark Integration*: Use Spark-Cassandra connectors to seamlessly write processed data into Cassandra.

3. *Airflow + Spark Integration*: Airflow DAGs trigger Spark jobs for data processing after Kafka ingestion tasks.

4. *Schema Registry*: Enforce Avro/Protobuf schemas for structured Kafka messages, ensuring smooth producer-consumer communication.

This full setup offers fault tolerance, scalability, and real-time analytics, perfect for live data processing use cases like user profiling or scalable streaming architectures.

**Key Highlights**:

1. *Airflow DAG*: Automates daily data collection via API, streams formatted data to Kafka topics.

2. *Kafka Streaming*: Simulates real-time user data ingestion.

3. *Spark Structured Streaming*: Transforms and maps data to schemas for downstream processes.

4. *Cassandra*: Efficiently stores and retrieves processed data from Spark.

This end-to-end pipeline is ideal for automating workflows, handling real-time data ingestion, and enabling robust data analytics.

#DataEngineering #ApacheKafka #ApacheSpark #Cassandra #ApacheAirflow #RealTimeData #BigData #Streaming