https://github.com/redgerd/data_pipeline_rappelconso

Last synced: about 1 year ago
JSON representation

Host: GitHub
URL: https://github.com/redgerd/data_pipeline_rappelconso
Owner: Redgerd
License: mit
Created: 2025-05-18T19:37:58.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-25T15:10:13.000Z (about 1 year ago)
Last Synced: 2025-06-05T11:49:39.787Z (about 1 year ago)
Language: Python
Size: 126 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Product Recall Streaming Pipeline

![Kafka](https://img.shields.io/badge/Apache%20Kafka-231F20?style=for-the-badge&logo=apachekafka&logoColor=white)

![Airflow](https://img.shields.io/badge/Apache%20Airflow-017CEE?style=for-the-badge&logo=apacheairflow&logoColor=white)

![PySpark](https://img.shields.io/badge/Apache%20PySpark-FF6F00?style=for-the-badge&logo=apachehadoop&logoColor=white)

![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge&logo=docker&logoColor=white)

This project uses different tools such as kafka, airflow, spark, postgres and docker. 

![alt text](img/overview%20.png)

## Overview

The data pipeline consists of three main stages:

1. **Data Streaming:**  

   Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system.

2. **Data Processing:**  

   A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.

![image](https://github.com/user-attachments/assets/89c14408-d503-4f3b-9c3c-19a3b0a276a8)

3. **Orchestration with Airflow:**  

   The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.

![image](https://github.com/user-attachments/assets/8dbe2eea-a658-4459-9288-51b236b6e94c)

   

## Deployment

All components are containerized and managed using **Docker** and **docker-compose**, ensuring easy setup, portability, and scalability.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/redgerd/data_pipeline_rappelconso

Awesome Lists containing this project

README