https://github.com/nitindatta8/realtime-data-streaming
End-to-end data engineering pipeline with various technologies to ingest real time data.
https://github.com/nitindatta8/realtime-data-streaming
apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra data-engineering data-engineering-pipeline data-processing docker etl-pipeline postgresql
Last synced: 5 months ago
JSON representation
End-to-end data engineering pipeline with various technologies to ingest real time data.
- Host: GitHub
- URL: https://github.com/nitindatta8/realtime-data-streaming
- Owner: NitinDatta8
- Created: 2023-11-03T01:58:03.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-03T02:22:53.000Z (over 2 years ago)
- Last Synced: 2024-01-26T14:36:17.997Z (over 2 years ago)
- Topics: apache-airflow, apache-kafka, apache-spark, apache-zookeeper, big-data, cassandra, data-engineering, data-engineering-pipeline, data-processing, docker, etl-pipeline, postgresql
- Language: Python
- Homepage:
- Size: 284 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Realtime Data Streaming | End-to-End Data Engineering Project
## Introduction
This project employs a multifaceted technological stack to establish an end-to-end data processing pipeline. The workflow commences by fetching data from the `randomuser.me` API to generate synthetic user data. This raw data is subsequently channeled through **Apache Airflow** for data orchestration and storage in a **PostgreSQL** database.
The data is then streamed through **Apache Kafka** in conjunction with **Apache Zookeeper** to facilitate real-time data movement from PostgreSQL to the processing engine. For streamlined management and monitoring of Kafka streams, **Control Center** and **Schema Registry** are employed to handle schema configurations and ensure effective oversight of the data streams.
Subsequently, **Apache Spark** is utilized to conduct data processing tasks, following which the processed data is persisted in a **Cassandra** database, providing a durable storage solution for the refined information.
The entire pipeline is encapsulated within **Docker** containers, affording a streamlined and portable deployment mechanism.
## System Architecture

## Technologies
- **Data Source**: `randomuser.me` API is used to generate random user data for the pipeline.
- **Apache Airflow**: Helps with orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- **Apache Kafka and Zookeeper**: Used for streaming data from PostgreSQL to the processing engine.
- **Control Center and Schema Registry**: Helps in monitoring and schema management of the Kafka streams.
- **Apache Spark**: Responsible for data processing with master and worker nodes.
- **Cassandra**: Database to store the processed data.
- **Docker**: Containerize the entire pipeline.
## Things to learn
- Establishing a data pipeline using Apache Airflow for workflow orchestration and data management.
- Implementing real-time data streaming through Apache Kafka to facilitate data transfer and processing in real-time.
- Enabling distributed synchronization using Apache Zookeeper for robust coordination and reliability in a distributed system.
- Employing data processing techniques powered by Apache Spark for efficient and scalable data transformation and analysis.
- Utilizing data storage solutions with PostgreSQL and Cassandra to securely store and manage structured and unstructured data, respectively.
- Containerizing the entire data engineering infrastructure with Docker to ensure portability and ease of deployment across various environments.
## Acknowledgements
I would like to thank [Yusuf Ganiyu](https://www.linkedin.com/in/yusuf-ganiyu-b90140107/) for this amazing project.
Please follow the tutorial here to build this data engineering pipeline yourself.
[YouTube Video Tutorial](https://www.youtube.com/watch?v=GqAcTrqKcrY)