https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture

Sales Data Lakehouse Pipeline using Azure & Databricks
https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture

azure azuredatafactory azuresqldatabase databricks datalakehouse unitycatalog

Last synced: 3 months ago
JSON representation

Sales Data Lakehouse Pipeline using Azure & Databricks

Host: GitHub
URL: https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture
Owner: Gaur4301
Created: 2025-06-22T12:13:06.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-07-07T17:44:34.000Z (3 months ago)
Last Synced: 2025-07-07T18:49:29.187Z (3 months ago)
Topics: azure, azuredatafactory, azuresqldatabase, databricks, datalakehouse, unitycatalog
Language: Python
Homepage:
Size: 53.7 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# 🔄 Sales Data Lakehouse Pipeline using Azure & Databricks

This project demonstrates how to build a **dynamic, end-to-end sales data pipeline** using modern data engineering tools like **Azure Data Factory**, **Databricks**, **PySpark**, and **Delta Lake**. It follows the **Medallion Architecture** to organize data into Bronze, Silver, and Gold layers and supports both **initial and incremental data loads**.

---

## 🧾 What’s in the Project?

We’re working with two CSV files:
- `salesdata.csv`: Full initial data
- `incremental.csv`: New/updated records

Both files are stored on GitHub and loaded into **Azure SQL Database** using a **dynamic Azure Data Factory pipeline**.

---

## 🏗️ Step-by-Step Workflow

### 1. Data Ingestion
- Use **ADF dynamic pipeline** to push both CSV files into Azure SQL Database.

### 2. Bronze Layer – Raw Zone
- Load data from Azure SQL to **Azure Data Lake Storage Gen2**.
- Save as **Parquet files** in the Bronze layer.
- Support both full and incremental loads.

### 3. Unity Catalog Setup
- Set up **Unity Catalog** in Databricks for schema and access control.
- Create schemas for Bronze, Silver, and Gold zones.

### 4. Silver Layer – Cleaned Zone
- Use **Databricks notebooks** to process incremental Bronze data.
- Clean and prepare the data for analytics.
- Store cleaned data as Parquet in the Silver layer.

### 5. Gold Layer – Curated Zone
- Create **fact and dimension tables** using PySpark.
- Use **MERGE statements** to handle Slowly Changing Dimensions (SCD).
- Store output as **Delta Tables** in the Gold layer.

### 6. Job Orchestration
- Build a **Databricks job** that runs all notebooks in order.
- Automate the entire ETL process with a single trigger.

---

## 📊 Final Output

You get a structured and optimized **Sales Analytics Model**, built on:
- Delta Lakehouse
- Medallion architecture
- Scalable ETL design

This model can be connected to Power BI or Databricks dashboards for business reporting and analytics.

---

## 🔍 Why This Project Matters

- Covers real-world challenges like **incremental loading** and **SCD handling**
- Uses **best practices in data lakehouse architecture**
- Demonstrates **governance** with Unity Catalog
- Ideal for showcasing in **data engineering interviews or portfolios**

---

## 🧱 Tools & Technologies

- **Azure Data Factory** (Dynamic Pipelines)
- **Azure SQL Database**
- **Azure Data Lake Storage Gen2**
- **Databricks & PySpark**
- **Unity Catalog**
- **Parquet & Delta Lake**

---

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gaur4301/salesanalyticspipeline-using-azuredatafactory-databricks-and-medallionarchitecture

Awesome Lists containing this project

README