Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/airscholar/e2e-data-engineering
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
https://github.com/airscholar/e2e-data-engineering
apache-airflow apache-kafka apache-spark apache-zookeeper big-data cassandra containerization data-engineering data-pipeline data-processing data-storage docker etl-pipeline postgresql real-time-analytics
Last synced: about 10 hours ago
JSON representation
An end-to-end data engineering pipeline that orchestrates data ingestion, processing, and storage using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. All components are containerized with Docker for easy deployment and scalability.
- Host: GitHub
- URL: https://github.com/airscholar/e2e-data-engineering
- Owner: airscholar
- Created: 2023-09-06T07:38:30.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-10-05T23:33:34.000Z (over 1 year ago)
- Last Synced: 2025-01-15T09:53:47.074Z (7 days ago)
- Topics: apache-airflow, apache-kafka, apache-spark, apache-zookeeper, big-data, cassandra, containerization, data-engineering, data-pipeline, data-processing, data-storage, docker, etl-pipeline, postgresql, real-time-analytics
- Language: Python
- Homepage: https://www.youtube.com/watch?v=GqAcTrqKcrY
- Size: 285 KB
- Stars: 220
- Watchers: 4
- Forks: 101
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Realtime Data Streaming | End-to-End Data Engineering Project
## Table of Contents
- [Introduction](#introduction)
- [System Architecture](#system-architecture)
- [What You'll Learn](#what-youll-learn)
- [Technologies](#technologies)
- [Getting Started](#getting-started)
- [Watch the Video Tutorial](#watch-the-video-tutorial)## Introduction
This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.
## System Architecture
![System Architecture](https://github.com/airscholar/e2e-data-engineering/blob/main/Data%20engineering%20architecture.png)
The project is designed with the following components:
- **Data Source**: We use `randomuser.me` API to generate random user data for our pipeline.
- **Apache Airflow**: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- **Apache Kafka and Zookeeper**: Used for streaming data from PostgreSQL to the processing engine.
- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.
- **Apache Spark**: For data processing with its master and worker nodes.
- **Cassandra**: Where the processed data will be stored.## What You'll Learn
- Setting up a data pipeline with Apache Airflow
- Real-time data streaming with Apache Kafka
- Distributed synchronization with Apache Zookeeper
- Data processing techniques with Apache Spark
- Data storage solutions with Cassandra and PostgreSQL
- Containerizing your entire data engineering setup with Docker## Technologies
- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker## Getting Started
1. Clone the repository:
```bash
git clone https://github.com/airscholar/e2e-data-engineering.git
```2. Navigate to the project directory:
```bash
cd e2e-data-engineering
```3. Run Docker Compose to spin up the services:
```bash
docker-compose up
```For more detailed instructions, please check out the video tutorial linked below.
## Watch the Video Tutorial
For a complete walkthrough and practical demonstration, check out our [YouTube Video Tutorial](https://www.youtube.com/watch?v=GqAcTrqKcrY).