Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bousettayounes/real-time-processing-of-users-data
Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system
https://github.com/bousettayounes/real-time-processing-of-users-data
airflow cassandra dataengineering datastreaming docker kafka postgresql spark streaming
Last synced: 30 days ago
JSON representation
Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system
- Host: GitHub
- URL: https://github.com/bousettayounes/real-time-processing-of-users-data
- Owner: bousettayounes
- Created: 2024-08-30T16:36:30.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-09-25T12:50:45.000Z (4 months ago)
- Last Synced: 2025-01-05T15:48:14.759Z (30 days ago)
- Topics: airflow, cassandra, dataengineering, datastreaming, docker, kafka, postgresql, spark, streaming
- Language: Python
- Homepage:
- Size: 4.08 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Real-time User Data Streaming
Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a distributed storage system.
In this project, I have used the following technologies:
- DOCKER
- APACHE AIRFLOW
- APACHE KAFKA with schema registry & Control Center
- APACHE CASSANDRA
- APACHE SPARK CLUSTER [PYSPARK]
- POSTGRESQL----------------------------------------------------------------------------------------------------------------------------------------------
![Project Architecture](https://github.com/user-attachments/assets/9f1ab06f-515a-4259-9a5a-914cf2393059)
----------------------------------------------------------------------------------------------------------------------------------------------
This project aims to create a data streaming pipeline using the Kappa architecture, fully deployed within Docker containers for easy management and scalability. The pipeline begins with streaming user data generated by a Random Generator API into a Kafka broker. The data is structured according to a predefined schema stored in the Schema Registry, ensuring consistency and compatibility across the pipeline.
Once the data is ingested into Kafka Topic , it is processed in real-time using a Spark cluster. Spark applies the necessary transformations to the incoming data streams. After processing, the data is loaded into a Cassandra keyspace for storage and querying.
Apache Airflow plays a crucial role by orchestrating the entire data pipeline, managing and scheduling the various tasks involved . This ensures that each component operates in the correct sequence and that dependencies between tasks are handled efficiently.
The entire data pipeline is deployed within Docker containers, providing an isolated and consistent environment for each component. By leveraging the Kappa architecture, this pipeline focuses on processing real-time data streams, ensuring that the system can efficiently handle large volumes of user-generated data.
This project showcases the integration of Kafka for distributed data streaming, Spark for real-time processing, Cassandra for scalable storage, and Apache Airflow for workflow orchestration, all orchestrated within Docker for a streamlined and easily deployable solution.