{"id":28426911,"url":"https://github.com/redgerd/data_pipeline_rappelconso","last_synced_at":"2025-06-24T23:30:29.663Z","repository":{"id":294612723,"uuid":"985918087","full_name":"Redgerd/data_pipeline_rappelconso","owner":"Redgerd","description":null,"archived":false,"fork":false,"pushed_at":"2025-05-25T15:10:13.000Z","size":129,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-05T11:49:39.787Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Redgerd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-18T19:37:58.000Z","updated_at":"2025-05-25T15:10:16.000Z","dependencies_parsed_at":"2025-05-21T08:57:53.747Z","dependency_job_id":null,"html_url":"https://github.com/Redgerd/data_pipeline_rappelconso","commit_stats":null,"previous_names":["redgerd/data_pipeline_rappelconso"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Redgerd/data_pipeline_rappelconso","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Redgerd%2Fdata_pipeline_rappelconso","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Redgerd%2Fdata_pipeline_rappelconso/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Redgerd%2Fdata_pipeline_rappelconso/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Redgerd%2Fdata_pipeline_rappelconso/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Redgerd","download_url":"https://codeload.github.com/Redgerd/data_pipeline_rappelconso/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Redgerd%2Fdata_pipeline_rappelconso/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":261774426,"owners_count":23207734,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-06-05T11:37:24.229Z","updated_at":"2025-06-24T23:30:29.657Z","avatar_url":"https://github.com/Redgerd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Product Recall Streaming Pipeline\n![Kafka](https://img.shields.io/badge/Apache%20Kafka-231F20?style=for-the-badge\u0026logo=apachekafka\u0026logoColor=white)\n![Airflow](https://img.shields.io/badge/Apache%20Airflow-017CEE?style=for-the-badge\u0026logo=apacheairflow\u0026logoColor=white)\n![PySpark](https://img.shields.io/badge/Apache%20PySpark-FF6F00?style=for-the-badge\u0026logo=apachehadoop\u0026logoColor=white)\n![Docker](https://img.shields.io/badge/docker-%230db7ed.svg?style=for-the-badge\u0026logo=docker\u0026logoColor=white)\n\n\nThis project uses different tools such as kafka, airflow, spark, postgres and docker. \n\n![alt text](img/overview%20.png)\n\n## Overview\n\nThe data pipeline consists of three main stages:\n\n1. **Data Streaming:**  \n   Data is initially streamed from an external API into a Kafka topic. This simulates real-time data ingestion into the system.\n\n2. **Data Processing:**  \n   A Spark job consumes the data from the Kafka topic and processes it before saving the results into a PostgreSQL database.\n\n![image](https://github.com/user-attachments/assets/89c14408-d503-4f3b-9c3c-19a3b0a276a8)\n\n3. **Orchestration with Airflow:**  \n   The entire workflow — including the Kafka streaming task and the Spark processing job — is orchestrated using Apache Airflow.\n\n![image](https://github.com/user-attachments/assets/8dbe2eea-a658-4459-9288-51b236b6e94c)\n   \n## Deployment\n\nAll components are containerized and managed using **Docker** and **docker-compose**, ensuring easy setup, portability, and scalability.\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredgerd%2Fdata_pipeline_rappelconso","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fredgerd%2Fdata_pipeline_rappelconso","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fredgerd%2Fdata_pipeline_rappelconso/lists"}