{"id":15208857,"url":"https://github.com/prekshivyas/datastreamingetl","last_synced_at":"2026-01-20T21:33:09.069Z","repository":{"id":245416622,"uuid":"818062304","full_name":"prekshivyas/DataStreamingETL","owner":"prekshivyas","description":"Utilizing my background and love for Apache Airflow and Data to build a real-time data streaming pipeline","archived":false,"fork":false,"pushed_at":"2024-06-21T09:35:39.000Z","size":739,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-13T08:16:47.260Z","etag":null,"topics":["apache-airflow","apache-kafka","apache-spark","apache-zookeeper","cassandra","data-engineering","data-ingestion","data-pipeline","data-processing","data-visualization","docker","docker-compose"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/prekshivyas.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-21T02:52:15.000Z","updated_at":"2024-06-21T09:59:43.000Z","dependencies_parsed_at":"2024-06-22T02:48:37.247Z","dependency_job_id":"d27cb7b0-ed9b-4bcc-9af9-b153db79762a","html_url":"https://github.com/prekshivyas/DataStreamingETL","commit_stats":null,"previous_names":["prekshivyas/datastreamingetl"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prekshivyas%2FDataStreamingETL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prekshivyas%2FDataStreamingETL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prekshivyas%2FDataStreamingETL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/prekshivyas%2FDataStreamingETL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/prekshivyas","download_url":"https://codeload.github.com/prekshivyas/DataStreamingETL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247585796,"owners_count":20962402,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-kafka","apache-spark","apache-zookeeper","cassandra","data-engineering","data-ingestion","data-pipeline","data-processing","data-visualization","docker","docker-compose"],"created_at":"2024-09-28T07:02:38.366Z","updated_at":"2026-01-20T21:33:09.057Z","avatar_url":"https://github.com/prekshivyas.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DataStreamingETL\n\nUtilizing my background and love for Apache Airflow and data to build a real-time data streaming pipeline, covering each phase from data ingestion to processing and finally storage.\n\nThe system is built using Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra — all neatly containerized using Docker for an end to end project! \n\n\n## Architecture\n![Architecture](/image/DATAETL.png)\n\n\n- The project is designed with the following components:\n\n    - Data Source: randomuser.me API to generate random user data for the pipeline.\n    - Apache Airflow: For orchestrating the pipeline and storing fetched data in a PostgreSQL database.\n    - Apache Kafka and Zookeeper: For streaming data from PostgreSQL to the processing engine.\n    - Control Center and Schema Registry: For monitoring and schema management of Kafka streams. Control Center listens for events on the Schema Registry to visualize data directly on Kafka that is managed by Zookeeper.\n    - Apache Spark: For data processing with master and worker nodes.\n    - Cassandra: Where the processed data will be stored.\n\n\n## Steps to run the project:\n1. Clone the repository:\n\n```\ngit clone https://github.com/prekshivyas/DataStreamingETL.git\n```\n\n2. Navigate to the project directory:\n\n```\ncd DataStreamingETL\n```\n\nRun Docker Compose to spin up the services:\n\n```\ndocker-compose up\n```","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprekshivyas%2Fdatastreamingetl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprekshivyas%2Fdatastreamingetl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprekshivyas%2Fdatastreamingetl/lists"}