https://github.com/karo23361/etl_pandas_airflow
ETL Using Python and Airflow
https://github.com/karo23361/etl_pandas_airflow
airflow airflow-dags dashboard docker docker-compose etl etl-pipeline powerbi powerbi-visuals python
Last synced: 3 months ago
JSON representation
ETL Using Python and Airflow
- Host: GitHub
- URL: https://github.com/karo23361/etl_pandas_airflow
- Owner: karo23361
- Created: 2025-07-01T07:22:35.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2025-07-02T14:34:56.000Z (3 months ago)
- Last Synced: 2025-07-02T15:32:11.767Z (3 months ago)
- Topics: airflow, airflow-dags, dashboard, docker, docker-compose, etl, etl-pipeline, powerbi, powerbi-visuals, python
- Language: Jupyter Notebook
- Homepage:
- Size: 561 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: Readme.md
Awesome Lists containing this project
README
# ETL Pipeline: Pandas + Airflow + Power BI
**🚀 Project Goal:**
Automated data preparation, transformation, and visualization of sales data using Jupyter Notebook, Apache Airflow, and Power BI.---
## 🔧 Technologies
- **Pandas** – for data cleaning and transformation
- **Apache Airflow** – for scheduling and automation
- **Docker + docker-compose** – for environment setup
- **Power BI** – for dashboard and reporting
---## 🗂️ Project Structure
```
├── dags/ # Airflow DAG files
├── data/
│ └── dirty_cafe_sales.csv # Raw sales data
├── ETL_TEST.ipynb # Exploratory ETL notebook
├── Dockerfile & docker-compose.yml
├── requirements.txt
└── README.md
```---
## 1. Data Preparation (Jupyter Notebook)
In the **`ETL_TEST.ipynb`** file, the following steps were performed:
1. Loaded data from `dirty_cafe_sales.csv`
2. Initial data exploration and cleaning using Pandas:
- removed errors, handled nulls, corrected data types
3. Preliminary analysis: visualizations and descriptive statistics
4. Saved the cleaned data into a CSV file prepared for the Airflow pipeline---
## 2. ETL in Apache Airflow
Inside the `dags/` folder, the file `etl_dag.py` defines a DAG with the following logic:
- **Trigger:** Scheduled or manually triggered
- **Steps:**
1. *transform* – loading data and further cleaning/aggregation using Pandas
2. *load* – save final output CSV to a target folder (e.g., `data`)
- The pipeline runs automatically based on a defined schedule (e.g., daily)**📌 DAG diagram:**
---
## 3. Analysis & Dashboard in Power BI 📊
The final stage involves loading the cleaned data from Airflow into Power BI, where a dashboard is created with:
- **Total Sales**
- Displays the cumulative revenue for the dataset.
- Example: **88.51k PLN**- **Total Items Sold**
- Shows the total number of units sold across all products.
- Example: **29.97k items**- **Monthly Revenue vs. Target (KPI)**
- KPI visualization comparing monthly revenue with a defined target.
- Includes actual value and deviation from the goal.
- Example: **7.52k PLN**, target was **7.26k PLN** (**+3.55%**)- **Items Sold per Month**
- Line chart showing number of items sold each month from January to December.
- Highlights trends and seasonality in sales volume.- **Total Product Sales (by Revenue)**
- Horizontal bar chart ranking products by total sales value.
- Example: Top-selling product by revenue: **Salad (19.1k PLN)**- **Total Items Sold (by Quantity)**
- Vertical bar chart comparing total quantity sold by product.
- Example: Most sold item by quantity: **Juice (4.2k items)**---
**📌 Power BI dashboard:**
---
## 🖥️ How to Run the Project
1. ***Initializing Docker Containers***
```
docker compose up airflow-init
docker compose up
```2. **Access Airflow UI:**
Go to [http://localhost:8080](http://localhost:8080), trigger the DAG, check logs and outputs in the `data` folder.3. **Load data into Power BI Desktop:**
Import the output CSV from `data`, refresh the source, and build your dashboard.---
## ✅ Summary
- Data was **initially prepared in a Jupyter Notebook**
- A **DAG was then created in Apache Airflow** to automate the ETL process (extract-transform-load)
- Finally, the clean data was **analyzed in Power BI**, resulting in a professional dashboard