Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fa3001/insightpipeline-airflow-mysql-grafana

This project demonstrates a complete data engineering and data analysis workflow using Airflow and Grafana with Docker.
https://github.com/fa3001/insightpipeline-airflow-mysql-grafana

airflow docker grafana mysql

Last synced: 1 day ago
JSON representation

This project demonstrates a complete data engineering and data analysis workflow using Airflow and Grafana with Docker.

Awesome Lists containing this project

README

        

# InsightPipeline: ETL Workflow for BI Dashboard ๐Ÿš€๐Ÿ“Š

## Overview

Welcome to **InsightPipeline**! This project is all about transforming raw data into actionable insights through a seamless ETL (Extract, Transform, Load) process. By orchestrating data flows with Apache Airflow, analyzing data with Jupyter Notebooks, and visualizing results with Grafana, we bring data to life in a BI (Business Intelligence) dashboard.

![Architecture](./images/BI_Project.jpg)

This project demonstrates a complete data engineering and data analysis workflow using Airflow and Grafana with Docker. It includes the following components:
1. **MySQL** - To store raw and processed data.
2. **Jupyter Notebook** - For initial data analysis and exploratory data analysis (EDA).
3. **Airflow** - To create ETL (Extract, Transform, Load) pipelines.
4. **Grafana** - For visualizing data and creating dashboards.

## Project Structure ๐Ÿ“

The project is divided into four main directories:

- `mysql` ๐Ÿ—„๏ธ
- `notebook` ๐Ÿ“’
- `airflow` ๐ŸŒฌ๏ธ
- `grafana` ๐Ÿ“Š

Each directory contains a `README.md` file with detailed instructions on how to run the respective component.

## Workflow ๐Ÿ”„

### 1. Data Sources ๐Ÿ“‚

The project uses two data sources:
- A MySQL database ๐Ÿ—ƒ๏ธ
- A CSV file ๐Ÿ“‘

### 2. Data Analysis ๐Ÿ”

Data is first analyzed using Jupyter Notebooks. This step includes:
- Exploring the raw data ๐Ÿง
- Cleaning and preprocessing the data ๐Ÿงน
- Performing initial transformations and analysis ๐Ÿ”ฌ

### 3. ETL Pipeline ๐Ÿšš

Apache Airflow is used to automate the ETL process. The ETL pipeline includes:
- **Extract**: Reading data from the MySQL database and CSV file. ๐Ÿ“ค
- **Transform**: Cleaning, merging, and transforming the data. ๐Ÿ› ๏ธ
- **Load**: Loading the processed data back into the MySQL database. ๐Ÿ“ฅ

### 4. Data Visualization ๐Ÿ“ˆ

Grafana is used to visualize the processed data. Dashboards are created to provide insights and track key metrics. ๐Ÿ“Š

## Running the Project ๐Ÿƒโ€โ™‚๏ธ

The entire project can be run using Docker and Docker Compose. This ensures a consistent and reproducible environment.

### Setup โš™๏ธ

1. **Clone the repository:**
```bash
git clone https://github.com/LorenzoLaMura/InsightPipeline
cd InsightPipeline
```

2. **Navigate to each component's directory and follow the instructions in its `README.md` file** to set up and run the individual services:

- **MySQL**: Set up and run the MySQL database. ๐Ÿ—„๏ธ
- **Airflow**: Set up and run the Airflow service. ๐ŸŒฌ๏ธ
- **Grafana**: Set up and run the Grafana service. ๐Ÿ“Š

## Aim ๐ŸŽฏ

The aim of this project is to test and demonstrate a complete data engineering and data analysis architecture using Airflow and Grafana with Docker. This allows me to enhance my skills in Docker, Airflow (using Python), Grafana, and data analysis in general.

## Evolution Plan ๐Ÿ›ค๏ธ

1. **Apache Spark & PySpark** ๐Ÿš€

*Idea*: Incorporate Spark for handling large-scale data transformations.

*Reason*: To leverage distributed computing for improved performance.

*Implementation*: Use PySpark to transform data before loading into the target database.

2. **Kafka and CDC (Change Data Capture)** ๐Ÿ“ก

*Idea*: Integrate Kafka for real-time data streaming and CDC for tracking changes in MySQL.

*Reason*: To handle real-time data updates and ensure the dashboard reflects the latest information.

*Implementation*: Use Kafka with tools like Debezium or Maxwell's Daemon for CDC.

3. **Monitoring Tools** ๐Ÿ”

*Idea*: Implement monitoring tools to track the performance and health of the ETL pipeline.

*Reason*: To ensure reliability and quickly address any issues.

*Implementation*: Use tools like Prometheus and Grafana for monitoring.

## Acknowledgements ๐Ÿ™

- [Apache Airflow](https://airflow.apache.org/) ๐ŸŒฌ๏ธ
- [Docker](https://www.docker.com/) ๐Ÿณ
- [Grafana](https://grafana.com/) ๐Ÿ“Š
- [Jupyter](https://jupyter.org/) ๐Ÿ“’