{"id":42486888,"url":"https://github.com/nitindatta8/realtime-data-streaming","last_synced_at":"2026-01-28T11:29:30.978Z","repository":{"id":205205268,"uuid":"713669416","full_name":"NitinDatta8/realtime-data-streaming","owner":"NitinDatta8","description":"End-to-end data engineering pipeline with various technologies to ingest real time data.","archived":false,"fork":false,"pushed_at":"2023-11-03T02:22:53.000Z","size":291,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-01-26T14:36:17.997Z","etag":null,"topics":["apache-airflow","apache-kafka","apache-spark","apache-zookeeper","big-data","cassandra","data-engineering","data-engineering-pipeline","data-processing","docker","etl-pipeline","postgresql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NitinDatta8.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2023-11-03T01:58:03.000Z","updated_at":"2024-01-26T14:36:17.998Z","dependencies_parsed_at":null,"dependency_job_id":"a1a66792-9178-4ed7-b862-04a501a6199d","html_url":"https://github.com/NitinDatta8/realtime-data-streaming","commit_stats":{"total_commits":2,"total_committers":2,"mean_commits":1.0,"dds":0.5,"last_synced_commit":"f40b61c2d940fe87b27194d701143b17a98be63a"},"previous_names":["nitindatta8/realtime-data-streaming"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NitinDatta8/realtime-data-streaming","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NitinDatta8%2Frealtime-data-streaming","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NitinDatta8%2Frealtime-data-streaming/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NitinDatta8%2Frealtime-data-streaming/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NitinDatta8%2Frealtime-data-streaming/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NitinDatta8","download_url":"https://codeload.github.com/NitinDatta8/realtime-data-streaming/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NitinDatta8%2Frealtime-data-streaming/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28845088,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-28T10:53:21.605Z","status":"ssl_error","status_checked_at":"2026-01-28T10:53:20.789Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-kafka","apache-spark","apache-zookeeper","big-data","cassandra","data-engineering","data-engineering-pipeline","data-processing","docker","etl-pipeline","postgresql"],"created_at":"2026-01-28T11:29:30.351Z","updated_at":"2026-01-28T11:29:30.965Z","avatar_url":"https://github.com/NitinDatta8.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Realtime Data Streaming | End-to-End Data Engineering Project\n\n## Introduction \nThis project employs a multifaceted technological stack to establish an end-to-end data processing pipeline. The workflow commences by fetching data from the `randomuser.me` API to generate synthetic user data. This raw data is subsequently channeled through **Apache Airflow** for data orchestration and storage in a **PostgreSQL** database. \n\nThe data is then streamed through **Apache Kafka** in conjunction with **Apache Zookeeper** to facilitate real-time data movement from PostgreSQL to the processing engine. For streamlined management and monitoring of Kafka streams, **Control Center** and **Schema Registry** are employed to handle schema configurations and ensure effective oversight of the data streams.\n\nSubsequently, **Apache Spark** is utilized to conduct data processing tasks, following which the processed data is persisted in a **Cassandra** database, providing a durable storage solution for the refined information.\n\nThe entire pipeline is encapsulated within **Docker** containers, affording a streamlined and portable deployment mechanism. \n\n## System Architecture\n![System Architecture](https://github.com/NitinDatta8/realtime-data-streaming/blob/main/Data%20engineering%20architecture.png)\n\n## Technologies\n- **Data Source**:  `randomuser.me` API is used to generate random user data for the pipeline.\n- **Apache Airflow**: Helps with orchestrating the pipeline and storing fetched data in a PostgreSQL database.\n- **Apache Kafka and Zookeeper**: Used for streaming data from PostgreSQL to the processing engine.\n- **Control Center and Schema Registry**: Helps in monitoring and schema management of the Kafka streams.\n- **Apache Spark**: Responsible for data processing with master and worker nodes.\n- **Cassandra**: Database to store the processed data.\n- **Docker**: Containerize the entire pipeline.\n\n## Things to learn \n- Establishing a data pipeline using Apache Airflow for workflow orchestration and data management.\n- Implementing real-time data streaming through Apache Kafka to facilitate data transfer and processing in real-time.\n- Enabling distributed synchronization using Apache Zookeeper for robust coordination and reliability in a distributed system.\n- Employing data processing techniques powered by Apache Spark for efficient and scalable data transformation and analysis.\n- Utilizing data storage solutions with PostgreSQL and Cassandra to securely store and manage structured and unstructured data, respectively.\n- Containerizing the entire data engineering infrastructure with Docker to ensure portability and ease of deployment across various environments.\n\n## Acknowledgements\nI would like to thank [Yusuf Ganiyu](https://www.linkedin.com/in/yusuf-ganiyu-b90140107/) for this amazing project. \n\n\nPlease follow the tutorial here to build this data engineering pipeline yourself.\n[YouTube Video Tutorial](https://www.youtube.com/watch?v=GqAcTrqKcrY)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnitindatta8%2Frealtime-data-streaming","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnitindatta8%2Frealtime-data-streaming","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnitindatta8%2Frealtime-data-streaming/lists"}