Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sadafasad/realtime-data-streaming
Realtime user streaming data pipeline
https://github.com/sadafasad/realtime-data-streaming
apache-airflow apache-kafka apache-spark api cassandra python shell-script
Last synced: 15 days ago
JSON representation
Realtime user streaming data pipeline
- Host: GitHub
- URL: https://github.com/sadafasad/realtime-data-streaming
- Owner: SadafAsad
- Created: 2024-01-21T20:03:48.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-01-25T15:15:16.000Z (about 1 year ago)
- Last Synced: 2024-11-19T22:49:26.547Z (3 months ago)
- Topics: apache-airflow, apache-kafka, apache-spark, api, cassandra, python, shell-script
- Language: Python
- Homepage:
- Size: 40 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Realtime User Data Streaming
This project aims to simulate a live data streaming pipeline for collecting, transforming, and storing data. It leverages Python, Airflow, PostgreSQL, Kafka, Spark, and Cassandra to create an end-to-end solution. Docker is employed to containerize everything, ensuring consistency and simplified deployment.## System Architecture
- Data Collection: [randomuser.me](https://randomuser.me/) API serves as the source of raw data
- Data Transformation: Python is employed for transforming raw data into the required format
- Airflow: Orchestrates the entire workflow
- PostgreSQL: Setup in conjunction with Airflow to be used for metadata storage or other purposes
- Kafka (Confluent Cloud): Central hub for data streaming
* Zookeeper:Coordinates and manages Kafka brokers
* Control Center: Provides real-time monitoring and management capabilities for Kafka
* Schema Registry: Manages schema evolution and compatibility in the Kafka topics
- Spark: Configured with a master and a worker to subscribe to Kafka consumer and process data.
- Cassandra: Used as the destination storage.## Achnowledgments
Thanks for providing inspiration and code snippets:[e2e-data-engineering](https://github.com/airscholar/e2e-data-engineering) developed by [Yusuf Ganiyu](https://github.com/airscholar)