Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/fa3001/insightpipeline-airflow-mysql-grafana
This project demonstrates a complete data engineering and data analysis workflow using Airflow and Grafana with Docker.
https://github.com/fa3001/insightpipeline-airflow-mysql-grafana
airflow docker grafana mysql
Last synced: 27 days ago
JSON representation
This project demonstrates a complete data engineering and data analysis workflow using Airflow and Grafana with Docker.
- Host: GitHub
- URL: https://github.com/fa3001/insightpipeline-airflow-mysql-grafana
- Owner: FA3001
- License: mit
- Created: 2024-08-02T15:21:37.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2024-08-08T06:33:04.000Z (3 months ago)
- Last Synced: 2024-10-11T06:03:36.241Z (27 days ago)
- Topics: airflow, docker, grafana, mysql
- Language: Jupyter Notebook
- Homepage:
- Size: 548 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# InsightPipeline: ETL Workflow for BI Dashboard ๐๐
## Overview
Welcome to **InsightPipeline**! This project is all about transforming raw data into actionable insights through a seamless ETL (Extract, Transform, Load) process. By orchestrating data flows with Apache Airflow, analyzing data with Jupyter Notebooks, and visualizing results with Grafana, we bring data to life in a BI (Business Intelligence) dashboard.
![Architecture](./images/BI_Project.jpg)
This project demonstrates a complete data engineering and data analysis workflow using Airflow and Grafana with Docker. It includes the following components:
1. **MySQL** - To store raw and processed data.
2. **Jupyter Notebook** - For initial data analysis and exploratory data analysis (EDA).
3. **Airflow** - To create ETL (Extract, Transform, Load) pipelines.
4. **Grafana** - For visualizing data and creating dashboards.## Project Structure ๐
The project is divided into four main directories:
- `mysql` ๐๏ธ
- `notebook` ๐
- `airflow` ๐ฌ๏ธ
- `grafana` ๐Each directory contains a `README.md` file with detailed instructions on how to run the respective component.
## Workflow ๐
### 1. Data Sources ๐
The project uses two data sources:
- A MySQL database ๐๏ธ
- A CSV file ๐### 2. Data Analysis ๐
Data is first analyzed using Jupyter Notebooks. This step includes:
- Exploring the raw data ๐ง
- Cleaning and preprocessing the data ๐งน
- Performing initial transformations and analysis ๐ฌ### 3. ETL Pipeline ๐
Apache Airflow is used to automate the ETL process. The ETL pipeline includes:
- **Extract**: Reading data from the MySQL database and CSV file. ๐ค
- **Transform**: Cleaning, merging, and transforming the data. ๐ ๏ธ
- **Load**: Loading the processed data back into the MySQL database. ๐ฅ### 4. Data Visualization ๐
Grafana is used to visualize the processed data. Dashboards are created to provide insights and track key metrics. ๐
## Running the Project ๐โโ๏ธ
The entire project can be run using Docker and Docker Compose. This ensures a consistent and reproducible environment.
### Setup โ๏ธ
1. **Clone the repository:**
```bash
git clone https://github.com/LorenzoLaMura/InsightPipeline
cd InsightPipeline
```2. **Navigate to each component's directory and follow the instructions in its `README.md` file** to set up and run the individual services:
- **MySQL**: Set up and run the MySQL database. ๐๏ธ
- **Airflow**: Set up and run the Airflow service. ๐ฌ๏ธ
- **Grafana**: Set up and run the Grafana service. ๐## Aim ๐ฏ
The aim of this project is to test and demonstrate a complete data engineering and data analysis architecture using Airflow and Grafana with Docker. This allows me to enhance my skills in Docker, Airflow (using Python), Grafana, and data analysis in general.
## Evolution Plan ๐ค๏ธ
1. **Apache Spark & PySpark** ๐
*Idea*: Incorporate Spark for handling large-scale data transformations.
*Reason*: To leverage distributed computing for improved performance.
*Implementation*: Use PySpark to transform data before loading into the target database.2. **Kafka and CDC (Change Data Capture)** ๐ก
*Idea*: Integrate Kafka for real-time data streaming and CDC for tracking changes in MySQL.
*Reason*: To handle real-time data updates and ensure the dashboard reflects the latest information.
*Implementation*: Use Kafka with tools like Debezium or Maxwell's Daemon for CDC.3. **Monitoring Tools** ๐
*Idea*: Implement monitoring tools to track the performance and health of the ETL pipeline.
*Reason*: To ensure reliability and quickly address any issues.
*Implementation*: Use tools like Prometheus and Grafana for monitoring.## Acknowledgements ๐
- [Apache Airflow](https://airflow.apache.org/) ๐ฌ๏ธ
- [Docker](https://www.docker.com/) ๐ณ
- [Grafana](https://grafana.com/) ๐
- [Jupyter](https://jupyter.org/) ๐