https://github.com/shivanmathur/real-time-data-streaming-pipeline
https://github.com/shivanmathur/real-time-data-streaming-pipeline
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/shivanmathur/real-time-data-streaming-pipeline
- Owner: ShivanMathur
- License: mit
- Created: 2025-06-14T23:27:06.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-15T06:27:39.000Z (about 1 year ago)
- Last Synced: 2025-10-07T06:41:07.700Z (9 months ago)
- Language: PLpgSQL
- Size: 1.22 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Real-Time Data Streaming Pipeline 🚀
This project implements a **real-time data streaming architecture** using **Apache NiFi**, **Docker**, **AWS**, and **Snowflake**, simulating continuous customer data ingestion and enabling both current and historical data tracking through **Slowly Changing Dimensions (SCD1 & SCD2)**.
## 🧠 Project Overview
The pipeline simulates real-time streaming of synthetic customer data, processes it using NiFi workflows, stores it in AWS S3, and ingests it into Snowflake using **Snowpipe** and **Streams**. The system supports real-time tracking of inserts/updates and maintains historical versions of data.
---
## 🧰 Tech Stack & Tools
- **Data Generation:** Python `Faker` library
- **Containerization:** Docker, Docker Compose
- **Orchestration:** Apache NiFi, Apache Zookeeper
- **Cloud:** AWS EC2, S3, IAM
- **Data Warehouse:** Snowflake (Snowpipe, Streams, Tasks)
- **Data Modeling:** Slowly Changing Dimensions (SCD1 & SCD2)
---
## 🔄 Architecture Overview

## ✅ Features
- 🔁 Continuous data generation and ingestion
- ⛓️ Real-time data flow orchestration using NiFi
- ☁️ Cloud-native storage with AWS S3
- 🧊 Incremental data loading into Snowflake using Snowpipe
- 🧠 Tracks both latest values (SCD1) and historical changes (SCD2)
- 📦 Fully containerized using Docker & Docker Compose
---
## 📊 Output
- Real-time customer data automatically populated in Snowflake raw table via Snowpipe.
- Stream + Task mechanism captures changes and applies them to SCD1 and SCD2 tables.
- Enables real-time querying, analytics, and historical tracking.
## 📧 Contact
For questions or feedback, please reach out to me at [shivanmthr18@gmail.com](mailto:shivanmthr18@gmail.com).