Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/bousettayounes/real-time-processing-of-users-data

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system
https://github.com/bousettayounes/real-time-processing-of-users-data

airflow cassandra dataengineering datastreaming docker kafka postgresql spark streaming

Last synced: 30 days ago
JSON representation

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a storage system

Awesome Lists containing this project

README

        

# Real-time User Data Streaming

Developing a data pipeline to stream user data from a user generator API, apply necessary transformations, and seamlessly insert the processed data into a distributed storage system.

In this project, I have used the following technologies:

- DOCKER
- APACHE AIRFLOW
- APACHE KAFKA with schema registry & Control Center
- APACHE CASSANDRA
- APACHE SPARK CLUSTER [PYSPARK]
- POSTGRESQL

----------------------------------------------------------------------------------------------------------------------------------------------

![Project Architecture](https://github.com/user-attachments/assets/9f1ab06f-515a-4259-9a5a-914cf2393059)

----------------------------------------------------------------------------------------------------------------------------------------------

This project aims to create a data streaming pipeline using the Kappa architecture, fully deployed within Docker containers for easy management and scalability. The pipeline begins with streaming user data generated by a Random Generator API into a Kafka broker. The data is structured according to a predefined schema stored in the Schema Registry, ensuring consistency and compatibility across the pipeline.

Once the data is ingested into Kafka Topic , it is processed in real-time using a Spark cluster. Spark applies the necessary transformations to the incoming data streams. After processing, the data is loaded into a Cassandra keyspace for storage and querying.

Apache Airflow plays a crucial role by orchestrating the entire data pipeline, managing and scheduling the various tasks involved . This ensures that each component operates in the correct sequence and that dependencies between tasks are handled efficiently.

The entire data pipeline is deployed within Docker containers, providing an isolated and consistent environment for each component. By leveraging the Kappa architecture, this pipeline focuses on processing real-time data streams, ensuring that the system can efficiently handle large volumes of user-generated data.

This project showcases the integration of Kafka for distributed data streaming, Spark for real-time processing, Cassandra for scalable storage, and Apache Airflow for workflow orchestration, all orchestrated within Docker for a streamlined and easily deployable solution.