{"id":22802619,"url":"https://github.com/sidiahmedhabib/e2e-data-engineering","last_synced_at":"2025-07-20T15:03:19.381Z","repository":{"id":258788207,"uuid":"875705972","full_name":"SidiahmedHABIB/e2e-data-engineering","owner":"SidiahmedHABIB","description":"This project is an end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using a variety of powerful tools including Apache Airflow, Apache Kafka, Apache Spark and Cassandra. All components are containerized with Docker for easy deployment and scalability.","archived":false,"fork":false,"pushed_at":"2024-10-20T16:44:52.000Z","size":1814,"stargazers_count":3,"open_issues_count":0,"forks_count":2,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-10T09:18:01.748Z","etag":null,"topics":["apache-airflow","apache-kafka","apache-spark","big-data","cassandra","data-engineering","data-streaming"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SidiahmedHABIB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-10-20T16:36:24.000Z","updated_at":"2024-12-29T13:26:55.000Z","dependencies_parsed_at":"2024-10-20T22:08:19.992Z","dependency_job_id":null,"html_url":"https://github.com/SidiahmedHABIB/e2e-data-engineering","commit_stats":null,"previous_names":["sidiahmedhabib/e2e-data-engineering"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/SidiahmedHABIB/e2e-data-engineering","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SidiahmedHABIB%2Fe2e-data-engineering","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SidiahmedHABIB%2Fe2e-data-engineering/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SidiahmedHABIB%2Fe2e-data-engineering/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SidiahmedHABIB%2Fe2e-data-engineering/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SidiahmedHABIB","download_url":"https://codeload.github.com/SidiahmedHABIB/e2e-data-engineering/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SidiahmedHABIB%2Fe2e-data-engineering/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266143941,"owners_count":23883069,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-airflow","apache-kafka","apache-spark","big-data","cassandra","data-engineering","data-streaming"],"created_at":"2024-12-12T09:06:43.304Z","updated_at":"2025-07-20T15:03:19.364Z","avatar_url":"https://github.com/SidiahmedHABIB.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Realtime Data Streaming | End-to-End Data Engineering Project\n\n## Table of Contents\n- [Overview](#overview)\n- [System Architecture](#system-architecture)\n- [Technologies](#technologies)\n- [Getting Started](#getting-started)\n- [License](#license)\n- [MyLinks](#my-links)\n\n## Overview\nIn today's fast-paced, data-driven world, real-time data streaming is crucial for handling large volumes of\ndata efficiently and making time-sensitive decisions. Whether it's live updates, monitoring system events, \nor analyzing clickstreams, businesses rely on the ability to collect, process, and store data as it flows in real time.\n\nTo explore this further, I developed an end-to-end data engineering pipeline that automates\nthe data ingestion, processing, and storage lifecycle using a scalable, modern tech stack. \nThis project leverages a variety of technologies to streamline data workflows, making it ideal for \nboth real-time and batch processing use cases.\n\n## System Architecture\n\n![System Architecture](pics/architecture.gif)\n\n#### The project is designed with the following components:\n\n- **Data Source**: We use `randomuser.me` API to generate random user data for our pipeline.\n- **Apache Airflow**: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.\n- **Apache Kafka and Zookeeper**: Used for streaming data from PostgreSQL to the processing engine.\n- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.\n- **Apache Spark**: For data processing with its master and worker nodes.\n- **Cassandra**: Where the processed data will be stored.\n- **Doker**: For Containerizing our entire pipeline.\n\n#### We can monitor these messages being sent to Kafka topic using Control Center.\n**![Control Center](pics/controlcenter.gif)**\n\n\n## Technologies\n\n- Apache Airflow\n- Python\n- Apache Kafka\n- Apache Zookeeper\n- Apache Spark\n- Cassandra\n- PostgreSQL\n- Docker\n\n## Getting Started\n\n1. Clone the repository:\n    ```bash\n    git clone https://github.com/SidiahmedHABIB/e2e-data-engineering.git\n    ```\n\n2. Navigate to the project directory:\n    ```bash\n    cd e2e-data-engineering\n    ```\n3. Install packages:\n    ```bash\n    pip install airflow\n    pip install kafka-python\n    pip install spark pyspark\n    pip install cassandra-driver\n    ```\n4. Run Docker Compose to spin up the services:\n    ```bash\n    docker-compose up\n    ```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n\n## My Links\n[![FaceBook](https://img.shields.io/badge/Facebook-1877F2?style=for-the-badge\u0026logo=facebook\u0026logoColor=white)](https://www.facebook.com/habib.sidiahmed.5)   [![Linkedin](https://img.shields.io/badge/LinkedIn-0077B5?style=for-the-badge\u0026logo=linkedin\u0026logoColor=white)](https://www.linkedin.com/in/sidi-ahmed-habib-18163220a/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsidiahmedhabib%2Fe2e-data-engineering","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsidiahmedhabib%2Fe2e-data-engineering","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsidiahmedhabib%2Fe2e-data-engineering/lists"}