https://github.com/hoangsonww/end-to-end-data-pipeline
π A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.
https://github.com/hoangsonww/end-to-end-data-pipeline
airflow apache docker elasticsearch flink grafana great-expectations hadoop influxdb kafka kubernetes looker minio mlflow postgresql prometheus python spark sql terraform
Last synced: about 2 months ago
JSON representation
π A scalable, production-ready data pipeline for real-time streaming & batch processing, integrating Kafka, Spark, Airflow, AWS, Kubernetes, and MLflow. Supports end-to-end data ingestion, transformation, storage, monitoring, and AI/ML serving with CI/CD automation using Terraform & GitHub Actions.
- Host: GitHub
- URL: https://github.com/hoangsonww/end-to-end-data-pipeline
- Owner: hoangsonww
- License: mit
- Created: 2025-02-15T05:26:35.000Z (4 months ago)
- Default Branch: master
- Last Pushed: 2025-04-09T18:31:41.000Z (about 2 months ago)
- Last Synced: 2025-04-09T22:53:31.055Z (about 2 months ago)
- Topics: airflow, apache, docker, elasticsearch, flink, grafana, great-expectations, hadoop, influxdb, kafka, kubernetes, looker, minio, mlflow, postgresql, prometheus, python, spark, sql, terraform
- Language: Python
- Homepage: https://hoangsonww.github.io/End-to-End-Data-Pipeline/
- Size: 2.61 MB
- Stars: 26
- Watchers: 19
- Forks: 20
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Governance: governance/atlas_stub.py
Awesome Lists containing this project
README
# End-to-End Data Pipeline with Batch & Streaming Processing
This repository contains a **fully integrated, production-ready data pipeline** that supports both **batch** and **streaming** data processing using open-source technologies. It is designed to be easily configured and deployed by any business or individual with minimal modifications.
The pipeline incorporates:
- **Data Ingestion:**
- **Batch Sources:** SQL databases (MySQL, PostgreSQL), Data Lakes (MinIO as an S3-compatible store), files (CSV, JSON, XML)
- **Streaming Sources:** Kafka for event logs, IoT sensor data, and social media streams- **Data Processing & Transformation:**
- **Batch Processing:** Apache Spark for large-scale ETL jobs, integrated with Great Expectations for data quality checks
- **Streaming Processing:** Spark Structured Streaming for real-time data processing and anomaly detection- **Data Storage:**
- **Raw Data:** Stored in MinIO (S3-compatible storage)
- **Processed Data:** Loaded into PostgreSQL for analytics and reporting- **Data Quality, Monitoring & Governance:**
- **Data Quality:** Great Expectations validates incoming data
- **Data Governance:** Apache Atlas / OpenMetadata integration (lineage registration)
- **Monitoring & Logging:** Prometheus and Grafana for system monitoring and alerting- **Data Serving & AI/ML Integration:**
- **ML Pipelines:** MLflow for model tracking and feature store integration
- **BI & Dashboarding:** Grafana dashboards provide real-time insights- **CI/CD & Deployment:**
- **CI/CD Pipelines:** GitHub Actions or Jenkins for continuous integration and deployment
- **Container Orchestration:** Kubernetes with Argo CD for GitOps deployment[](https://www.python.org/) [](https://www.mysql.com/) [](https://www.gnu.org/software/bash/) [](https://www.docker.com/) [](https://kubernetes.io/) [](https://airflow.apache.org/) [](https://spark.apache.org/) [](https://flink.apache.org/) [](https://kafka.apache.org/) [](https://hadoop.apache.org/)
[](https://www.postgresql.org/) [](https://www.mysql.com/) [](https://www.mongodb.com/) [](https://www.influxdata.com/) [](https://min.io/) [](https://aws.amazon.com/s3/) [](https://prometheus.io/) [](https://grafana.com/) [](https://www.elastic.co/) [](https://mlflow.org/) [](https://feast.dev/) [](https://greatexpectations.io/) [](https://atlas.apache.org/) [](https://www.tableau.com/) [](https://powerbi.microsoft.com/) [](https://looker.com/) [](https://redis.io/) [](https://www.terraform.io/)Read this README and follow the step-by-step guide to set up the pipeline on your local machine or cloud environment. Customize the pipeline components, configurations, and example applications to suit your data processing needs.
## Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [Directory Structure](#directory-structure)
3. [Components & Technologies](#components--technologies)
4. [Setup Instructions](#setup-instructions)
5. [Configuration & Customization](#configuration--customization)
6. [Example Applications](#example-applications)
7. [Troubleshooting & Further Considerations](#troubleshooting--further-considerations)
8. [Contributing](#contributing)
9. [License](#license)
10. [Final Notes](#final-notes)## Architecture Overview
The architecture of the end-to-end data pipeline is designed to handle both batch and streaming data processing. Below is a high-level overview of the components and their interactions:
### Flow Diagram
![]()
Basically, data will be streamed with Kafka, processed with Spark, and stored in a data warehouse using PostgreSQL. The pipeline also integrates MinIO as an object storage solution and uses Airflow to orchestrate the end-to-end data flow. Great Expectations enforces data quality checks, while Prometheus and Grafana provide monitoring and alerting capabilities. MLflow and Feast are used for machine learning model tracking and feature store integration.
> Note: The diagram(s) may not reflect ALL components in the repository, but it provides a good overview of the main components and their interactions. For instance, I added BI tools like Tableau, Power BI, and Looker to the repo for data visualization and reporting.
### Text-Based Pipeline Diagram
```
ββββββββββββββββββββββββββββββββββ
β Batch Source β
β(MySQL, Files, User Interaction)β
ββββββββββββββββββ¬ββββββββββββββββ
β
β (Extract/Validate)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Airflow Batch DAG β
β - Extracts data from MySQL β
β - Validates with Great Expectations β
β - Uploads raw data to MinIO β
βββββββββββββββββββ¬ββββββββββββββββββββ
β (spark-submit)
βΌ
ββββββββββββββββββββββββββββββββββ
β Spark Batch Job β
β - Reads raw CSV from MinIO β
β - Transforms, cleans, enriches β
β - Writes transformed data to β
β PostgreSQL & MinIO β
ββββββββββββββββ¬ββββββββββββββββββ
β (Load/Analyze)
βΌ
ββββββββββββββββββββββββββββββββββ
β Processed Data Store β
β (PostgreSQL, MongoDB, AWS S3) β
βββββββββββββββββ¬βββββββββββββββββ
β (Query/Analyze)
βΌ
ββββββββββββββββββββββββββββββββββ
β Cache & Indexing β
β (Elasticsearch, Redis) β
ββββββββββββββββββββββββββββββββββ
Streaming Side:
βββββββββββββββββββββββββββββββ
β Streaming Source β
β (Kafka) β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Spark Streaming Job β
β - Consumes Kafka messages β
β - Filters and detects anomalies β
β - Persists anomalies to β
β PostgreSQL & MinIO β
βββββββββββββββββββββββββββββββββββββMonitoring & Governance:
ββββββββββββββββββββββββββββββββββ
β Monitoring & β
β Data Governance Layer β
β - Prometheus & Grafana β
β - Apache Atlas / OpenMetadata β
ββββββββββββββββββββββββββββββββββ
ML & Serving:
ββββββββββββββββββββββββββββββββ
β AI/ML Serving β
β - Feature Store (Feast) β
β - MLflow Model Tracking β
β - Model training & serving β
β - BI Dashboards β
ββββββββββββββββββββββββββββββββ
CI/CD & Terraform:
ββββββββββββββββββββββββββββββββ
β CI/CD Pipelines β
β - GitHub Actions / Jenkins β
β - Terraform for Cloud Deploy β
ββββββββββββββββββββββββββββββββContainer Orchestration:
ββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β - Argo CD for GitOps β
β - Helm Charts for Deployment β
ββββββββββββββββββββββββββββββββ
```### Full Flow Diagram with Backend & Frontend Integration (Optional)
A more detailed flow diagram that includes backend and frontend integration is available in the `assets/` directory. This diagram illustrates how the data pipeline components interact with each other and with external systems, including data sources, storage, processing, visualization, and monitoring.
Although the frontend & backend integration is not included in this repository (since it's supposed to only contain the pipeline), you can easily integrate it with your existing frontend application or create a new one using popular frameworks like React, Angular, or Vue.js.
![]()
## Directory Structure
```
end-to-end-pipeline/
βββ .devcontainer/ # VS Code Dev Container settings
βββ docker-compose.yaml # Docker orchestration for all services
βββ docker-compose.ci.yaml # Docker Compose for CI/CD pipelines
βββ End_to_End_Data_Pipeline.ipynb # Jupyter notebook for pipeline overview
βββ requirements.txt # Python dependencies for scripts
βββ .gitignore # Standard Git ignore file
βββ README.md # Comprehensive documentation (this file)
βββ airflow/
β βββ Dockerfile # Custom Airflow image with dependencies
β βββ requirements.txt # Python dependencies for Airflow
β βββ dags/
β βββ batch_ingestion_dag.py # Batch pipeline DAG
β βββ streaming_monitoring_dag.py # Streaming monitoring DAG
βββ spark/
β βββ Dockerfile # Custom Spark image with Kafka and S3 support
β βββ spark_batch_job.py # Spark batch ETL job
β βββ spark_streaming_job.py # Spark streaming job
βββ kafka/
β βββ producer.py # Kafka producer for simulating event streams
βββ storage/
β βββ aws_s3_influxdb.py # S3-InfluxDB integration stub
β βββ hadoop_batch_processing.py # Hadoop batch processing stub
β βββ mongodb_streaming.py # MongoDB streaming integration stub
βββ great_expectations/
β βββ great_expectations.yaml # GE configuration
β βββ expectations/
β βββ raw_data_validation.py # GE suite for data quality
βββ governance/
β βββ atlas_stub.py # Dataset lineage registration with Atlas/OpenMetadata
βββ monitoring/
β βββ monitoring.py # Python script to set up Prometheus & Grafana
β βββ prometheus.yml # Prometheus configuration file
βββ ml/
β βββ feature_store_stub.py # Feature Store integration stub
β βββ mlflow_tracking.py # MLflow model tracking
βββ kubernetes/
β βββ argo-app.yaml # Argo CD application manifest
β βββ deployment.yaml # Kubernetes deployment manifest
βββ terraform/ # Terraform scripts for cloud deployment
βββ scripts/
βββ init_db.sql # SQL script to initialize MySQL and demo data
```## Components & Technologies
- **Ingestion & Orchestration:**
- [Apache Airflow](https://airflow.apache.org/) β Schedules batch and streaming jobs.
- [Kafka](https://kafka.apache.org/) β Ingests streaming events.
- [Spark](https://spark.apache.org/) β Processes batch and streaming data.- **Storage & Processing:**
- [MinIO](https://min.io/) β S3-compatible data lake.
- [PostgreSQL](https://www.postgresql.org/) β Stores transformed and processed data.
- [Great Expectations](https://greatexpectations.io/) β Enforces data quality.
- [AWS S3](https://aws.amazon.com/s3/) β Cloud storage integration.
- [InfluxDB](https://www.influxdata.com/) β Time-series data storage.
- [MongoDB](https://www.mongodb.com/) β NoSQL database integration.
- [Hadoop](https://hadoop.apache.org/) β Big data processing integration.- **Monitoring & Governance:**
- [Prometheus](https://prometheus.io/) β Metrics collection.
- [Grafana](https://grafana.com/) β Dashboard visualization.
- [Apache Atlas/OpenMetadata](https://atlas.apache.org/) β Data lineage and governance.- **ML & Data Serving:**
- [MLflow](https://mlflow.org/) β Experiment tracking.
- [Feast](https://feast.dev/) β Feature store for machine learning.
- [BI Tools](https://grafana.com/) β Real-time dashboards and insights.## Setup Instructions
### Prerequisites
- **Docker** and **Docker Compose** must be installed.
- Ensure that **Python 3.9+** is installed locally if you want to run scripts outside of Docker.
- Open ports required:
- Airflow: 8080
- MySQL: 3306
- PostgreSQL: 5432
- MinIO: 9000 (and console on 9001)
- Kafka: 9092
- Prometheus: 9090
- Grafana: 3000### Step-by-Step Guide
1. **Clone the Repository**
```bash
git clone https://github.com/hoangsonww/End-to-End-Data-Pipeline.git
cd End-to-End-Data-Pipeline
```2. **Start the Pipeline Stack**
Use Docker Compose to launch all components:
```bash
docker-compose up --build
```
This command will:
- Build custom Docker images for Airflow and Spark.
- Start MySQL, PostgreSQL, Kafka (with Zookeeper), MinIO, Prometheus, Grafana, and Airflow webserver.
- Initialize the MySQL database with demo data (via `scripts/init_db.sql`).3. **Access the Services**
- **Airflow UI:** [http://localhost:8080](http://localhost:8080)
Set up connections:
- `mysql_default` β Host: `mysql`, DB: `source_db`, User: `user`, Password: `pass`
- `postgres_default` β Host: `postgres`, DB: `processed_db`, User: `user`, Password: `pass`
- **MinIO Console:** [http://localhost:9001](http://localhost:9001) (User: `minio`, Password: `minio123`)
- **Kafka:** Accessible on port `9092`
- **Prometheus:** [http://localhost:9090](http://localhost:9090)
- **Grafana:** [http://localhost:3000](http://localhost:3000) (Default login: `admin/admin`)4. **Run Batch Pipeline**
- In the Airflow UI, enable the `batch_ingestion_dag` to run the end-to-end batch pipeline.
- This DAG extracts data from MySQL, validates it, uploads raw data to MinIO, triggers a Spark job for transformation, and loads data into PostgreSQL.5. **Run Streaming Pipeline**
- Open a terminal and start the Kafka producer:
```bash
docker-compose exec kafka python /opt/spark_jobs/../kafka/producer.py
```
- In another terminal, run the Spark streaming job:
```bash
docker-compose exec spark spark-submit --master local[2] /opt/spark_jobs/spark_streaming_job.py
```
- The streaming job consumes events from Kafka, performs real-time anomaly detection, and writes results to PostgreSQL and MinIO.6. **Monitoring & Governance**
- **Prometheus & Grafana:**
Use the `monitoring.py` script (or access Grafana) to view real-time metrics and dashboards.
- **Data Lineage:**
The `governance/atlas_stub.py` script registers lineage between datasets (can be extended for full Apache Atlas integration).7. **ML & Feature Store**
- Use `ml/mlflow_tracking.py` to simulate model training and tracking.
- Use `ml/feature_store_stub.py` to integrate with a feature store like Feast.8. **CI/CD & Deployment**
- Use the `docker-compose.ci.yaml` file to set up CI/CD pipelines.
- Use the `kubernetes/` directory for Kubernetes deployment manifests.
- Use the `terraform/` directory for cloud deployment scripts.
- Use the `.github/workflows/` directory for GitHub Actions CI/CD workflows.### Next Steps
Congratulations! You have successfully set up the end-to-end data pipeline with batch and streaming processing. However, this is a very general pipeline that needs to be customized for your specific use case.
> Note: Be sure to visit the files and scripts in the repository and change the credentials, configurations, and logic to match your environment and use case. Feel free to extend the pipeline with additional components, services, or integrations as needed.
## Configuration & Customization
- **Docker Compose:**
All services are defined in `docker-compose.yaml`. Adjust resource limits, environment variables, and service dependencies as needed.- **Airflow:**
Customize DAGs in the `airflow/dags/` directory. Use the provided PythonOperators to integrate custom processing logic.- **Spark Jobs:**
Edit transformation logic in `spark/spark_batch_job.py` and `spark/spark_streaming_job.py` to match your data and processing requirements.- **Kafka Producer:**
Modify `kafka/producer.py` to simulate different types of events or adjust the batch size and frequency using environment variables.- **Monitoring:**
Update `monitoring/monitoring.py` and `prometheus.yml` to scrape additional metrics or customize dashboards. Place Grafana dashboard JSON files in the `monitoring/grafana_dashboards/` directory.- **Governance & ML:**
Replace stub implementations in `governance/atlas_stub.py` and `ml/` with real integrations as needed.- **CI/CD & Deployment:**
Customize CI/CD workflows in `.github/workflows/` and deployment manifests in `kubernetes/` and `terraform/` for your cloud environment.- **Storage:**
Data storage options are in the `storage/` directory with AWS S3, InfluxDB, MongoDB, and Hadoop stubs. Replace these with real integrations or credentials as needed.
## Example Applications
### E-Commerce & Retail
- **Real-Time Recommendations:**
Process clickstream data to generate personalized product recommendations.
- **Fraud Detection:**
Detect unusual purchasing patterns or multiple high-value transactions in real-time.### Financial Services & Banking
- **Risk Analysis:**
Aggregate transaction data to assess customer credit risk.
- **Trade Surveillance:**
Monitor market data and employee trades for insider trading signals.### Healthcare & Life Sciences
- **Patient Monitoring:**
Process sensor data from medical devices to alert healthcare providers of critical conditions.
- **Clinical Trial Analysis:**
Analyze historical trial data for predictive analytics in treatment outcomes.### IoT & Manufacturing
- **Predictive Maintenance:**
Monitor sensor data from machinery to predict failures before they occur.
- **Supply Chain Optimization:**
Aggregate data across manufacturing processes to optimize production and logistics.### Media & Social Networks
- **Sentiment Analysis:**
Analyze social media feeds in real-time to gauge public sentiment on new releases.
- **Ad Fraud Detection:**
Identify and block fraudulent clicks on digital advertisements.Feel free to use this pipeline as a starting point for your data processing needs. Extend it with additional components, services, or integrations to build a robust, end-to-end data platform.
## Troubleshooting & Further Considerations
- **Service Not Starting:**
Check Docker logs (`docker-compose logs`) to troubleshoot errors with MySQL, Kafka, Airflow, or Spark.
- **Airflow Connection Issues:**
Verify that connection settings (host, user, password) in the Airflow UI match those in `docker-compose.yaml`.
- **Data Quality Errors:**
Inspect Great Expectations logs in the Airflow DAG runs to adjust expectations and clean data.
- **Resource Constraints:**
For production use, consider scaling out services (e.g., running Spark on a dedicated cluster, using managed Kafka).## Contributing
Contributions, issues, and feature requests are welcome!
1. Fork the Project
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`)
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the Branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
6. We will review your changes and merge them into the main branch upon approval.## License
This project is licensed under the [MIT License](https://opensource.org/licenses/MIT).
## Final Notes
This end-to-end data pipeline is designed for rapid deployment and customization. With minor configuration changes, it can be adapted to many business casesβfrom real-time analytics and fraud detection to predictive maintenance and advanced ML model training. Enjoy building a data-driven future with this pipeline!
---
Thanks for reading! If you found this repository helpful, please star it and share it with others. For questions, feedback, or suggestions, feel free to reach out to me on [GitHub](https://github.com/hoangsonww).
[**β¬οΈ Back to top**](#end-to-end-data-pipeline-with-batch--streaming-processing)