https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture
Sales Data Lakehouse Pipeline using Azure & Databricks
https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture
azure azuredatafactory azuresqldatabase databricks datalakehouse unitycatalog
Last synced: 3 months ago
JSON representation
Sales Data Lakehouse Pipeline using Azure & Databricks
- Host: GitHub
- URL: https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture
- Owner: Gaur4301
- Created: 2025-06-22T12:13:06.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-07-07T17:44:34.000Z (3 months ago)
- Last Synced: 2025-07-07T18:49:29.187Z (3 months ago)
- Topics: azure, azuredatafactory, azuresqldatabase, databricks, datalakehouse, unitycatalog
- Language: Python
- Homepage:
- Size: 53.7 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π Sales Data Lakehouse Pipeline using Azure & Databricks
This project demonstrates how to build a **dynamic, end-to-end sales data pipeline** using modern data engineering tools like **Azure Data Factory**, **Databricks**, **PySpark**, and **Delta Lake**. It follows the **Medallion Architecture** to organize data into Bronze, Silver, and Gold layers and supports both **initial and incremental data loads**.
---
## π§Ύ Whatβs in the Project?
Weβre working with two CSV files:
- `salesdata.csv`: Full initial data
- `incremental.csv`: New/updated recordsBoth files are stored on GitHub and loaded into **Azure SQL Database** using a **dynamic Azure Data Factory pipeline**.
---
## ποΈ Step-by-Step Workflow
### 1. Data Ingestion
- Use **ADF dynamic pipeline** to push both CSV files into Azure SQL Database.### 2. Bronze Layer β Raw Zone
- Load data from Azure SQL to **Azure Data Lake Storage Gen2**.
- Save as **Parquet files** in the Bronze layer.
- Support both full and incremental loads.### 3. Unity Catalog Setup
- Set up **Unity Catalog** in Databricks for schema and access control.
- Create schemas for Bronze, Silver, and Gold zones.### 4. Silver Layer β Cleaned Zone
- Use **Databricks notebooks** to process incremental Bronze data.
- Clean and prepare the data for analytics.
- Store cleaned data as Parquet in the Silver layer.### 5. Gold Layer β Curated Zone
- Create **fact and dimension tables** using PySpark.
- Use **MERGE statements** to handle Slowly Changing Dimensions (SCD).
- Store output as **Delta Tables** in the Gold layer.### 6. Job Orchestration
- Build a **Databricks job** that runs all notebooks in order.
- Automate the entire ETL process with a single trigger.---
## π Final Output
You get a structured and optimized **Sales Analytics Model**, built on:
- Delta Lakehouse
- Medallion architecture
- Scalable ETL designThis model can be connected to Power BI or Databricks dashboards for business reporting and analytics.
---
## π Why This Project Matters
- Covers real-world challenges like **incremental loading** and **SCD handling**
- Uses **best practices in data lakehouse architecture**
- Demonstrates **governance** with Unity Catalog
- Ideal for showcasing in **data engineering interviews or portfolios**---
## π§± Tools & Technologies
- **Azure Data Factory** (Dynamic Pipelines)
- **Azure SQL Database**
- **Azure Data Lake Storage Gen2**
- **Databricks & PySpark**
- **Unity Catalog**
- **Parquet & Delta Lake**---