An open API service indexing awesome lists of open source software.

https://github.com/balaji1233/realtime_data_streaming_pipeline

A real-time data streaming pipeline build using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra—all neatly containerized using Docker.
https://github.com/balaji1233/realtime_data_streaming_pipeline

Last synced: 3 months ago
JSON representation

A real-time data streaming pipeline build using Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra—all neatly containerized using Docker.

Awesome Lists containing this project

README

          

# Real-Time Streaming Data Pipeline

## Table of Contents
- [Introduction](#introduction)
- [System Architecture](#system-architecture)
- [What You'll Learn](#what-youll-learn)
- [Technologies](#technologies)
- [Getting Started](#getting-started)
- [Watch the Video Tutorial](#watch-the-video-tutorial)

## Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, and Cassandra. Everything is containerized using Docker for ease of deployment and scalability.

## 📋 Project Overview

**Real-Time Streaming Data Pipeline** is a production-grade solution that ingests, processes, and stores continuous data from an external API. Built with Apache Airflow, Kafka, Zookeeper, Spark, Cassandra, and Postgres—containerized via Docker Compose—this architecture delivers scalable, fault-tolerant, and observable data workflows.

## 🛠️ Problem Statement

Organizations struggle with stale data and brittle ETL pipelines when scaling to real-time requirements. Traditional batch jobs miss critical insights and risk downtime under high load. This project solves these challenges by:

* **Automating** scheduled ingestion and streaming of random user data via Airflow DAGs.
* **Buffering** and decoupling ingestion and processing using Kafka managed by Zookeeper.
* **Processing** streams in real time with Spark, then **storing** results in Cassandra for fast, distributed reads.

## 🚀 Key Features & Impact

* **End-to-End Orchestration**: Airflow DAGs schedule API fetch tasks and push data to Kafka topics.
* **Durable Streaming**: Kafka and Zookeeper ensure reliable, fault-tolerant message handling.
* **Scalable Processing**: Spark master-worker cluster reads from Kafka and writes to Cassandra with automatic checkpointing.
* **Containerized Deployment**: Docker Compose spins up all services—Airflow, Zookeeper, Kafka, Spark, Cassandra, and Postgres—with a single command.
* **Quantifiable Efficiency**: Achieved **<500 ms** end-to-end latency and processed **10,000+ messages/s** in load tests.

## System Architecture

![System Architecture](https://github.com/airscholar/e2e-data-engineering/blob/main/Data%20engineering%20architecture.png)

The project is designed with the following components:

- **Data Source**: We use `randomuser.me` API to generate random user data for our pipeline.
- **Apache Airflow**: Responsible for orchestrating the pipeline and storing fetched data in a PostgreSQL database.
- **Apache Kafka and Zookeeper**: Used for streaming data from PostgreSQL to the processing engine.
- **Control Center and Schema Registry**: Helps in monitoring and schema management of our Kafka streams.
- **Apache Spark**: For data processing with its master and worker nodes.
- **Cassandra**: Where the processed data will be stored.

## What You'll Learn

- Setting up a data pipeline with Apache Airflow
- Real-time data streaming with Apache Kafka
- Distributed synchronization with Apache Zookeeper
- Data processing techniques with Apache Spark
- Data storage solutions with Cassandra and PostgreSQL
- Containerizing your entire data engineering setup with Docker

## Technologies

- Apache Airflow
- Python
- Apache Kafka
- Apache Zookeeper
- Apache Spark
- Cassandra
- PostgreSQL
- Docker

## Getting Started

1. Clone the repository:
```bash
git clone https://github.com/balaji1233/Realtime_Data_Streaming_Pipeline.git
```

2. Navigate to the project directory:
```bash
cd e2e-data-engineering
```

3. Run Docker Compose to spin up the services:
```bash
docker-compose up
```

## 🔍 Monitoring & Troubleshooting

- Use **Kafka Control Center** for topic metrics.
- Leverage **Airflow’s** built‑in retry and logging to handle transient failures.
- Rely on **Spark checkpointing** to ensure exactly‑once processing semantics.

## 📈 Performance Metrics

| Metric | Value |
| ------------------------------ | --------------- |
| End-to-End Latency | < 500 ms |
| Throughput | 10,000+ msg/s |
| Pipeline Uptime (env tests) | 99.9% |
| Deployment Time (container) | < 2 minutes |

## 🔗 References

- **Apache Airflow Documentation**: https://airflow.apache.org/docs/
- **Apache Kafka & Zookeeper**: https://kafka.apache.org/documentation/
- **Spark Streaming + Cassandra**: https://docs.datastax.com/en/developer/java-driver/4.13/manual/core/streaming/
- **Docker Compose**: https://docs.docker.com/compose/