https://github.com/evialex/adventure-works-de-project

Azure End To End Data Engineering Project
https://github.com/evialex/adventure-works-de-project

azure databricks etl etl-pipeline git powerbi spark

Last synced: 2 months ago
JSON representation

Azure End To End Data Engineering Project

Host: GitHub
URL: https://github.com/evialex/adventure-works-de-project
Owner: EviAleX
Created: 2025-07-27T14:53:55.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-08-08T11:25:15.000Z (11 months ago)
Last Synced: 2025-08-08T13:12:44.219Z (11 months ago)
Topics: azure, databricks, etl, etl-pipeline, git, powerbi, spark
Language: Jupyter Notebook
Homepage:
Size: 1.91 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Adventure-Works-DE-Project

This project demonstrates a complete end-to-end data engineering solution built on the Microsoft Azure platform. The pipeline ingests the AdventureWorks dataset from a public GitHub repository, processes it through a multi-stage architecture, and serves it to Power BI for business intelligence and analytics.

![Architecture Diagram](./images/architecture.png)

---
## 🏛️**Architecture Overview**

The solution leverages a modern data stack on Azure, following a medallion architecture (Bronze, Silver, Gold layers) to ensure data quality and scalability.

- **Orchestration**: Azure Data Factory (ADF)
- **Data Lake**: Azure Data Lake Storage (ADLS) Gen2
- **Data Transformation**: Azure Databricks (using Spark)
- **Data Warehousing**: Azure Synapse Analytics
- **Business Intelligence**: Power BI

---
## 🚀**Pipeline Flow**
The data moves through four distinct stages:

### 1. **Ingestion (Bronze Layer)**
- **Azure Data Factory** uses a dynamic copy activity to the raw datasets via an HTTP connector from Github

- The raw data is landed in the **Bronze** container in Azure Data Lake Storage without any modifications.

---
### 2. **Transformation (Silver Layer)**
- **Azure Databricks** reads the raw data from the Bronze layer
- A PySpark job perfoms key transformations, including cleaning records, normalizing data formats, and structuring the data

- The cleaned, transformed data is saved in the **Silver** container in the efficient **Parquet** format

### 3. **Warehousing (Gold Layer)**
- **Azure Synapse Analytics** connects to the Silver container using a serverless SQL pool.
- External tables and views are created on top of the Parquet files to structure the data for analysis.

- This final, curated data represents the Gold layer, ready for reporting.

---

### 4. **Visualization**
- **Power BI** connects directly to Azure Synapse Analytics.
- Simple dashboards and reports are built to provide actionable insights from the curated data. Although creating beautiful dashboard was not part of project, so it is kinda raw.

---
## **Key Takeaways** ✅
This project showcases a robust, automated, and scalable data engineering solution on Azure. It effectively transforms raw, source data into high-value business insights, demonstrating a complete data lifecycle.

---
## **Acknowledgment** 🎉
This project was inspired by the work of [Ansh Lamba](https://github.com/anshlambagit). For a detailed video walkthrough of similar project, please check out [his Youtube channel](https://www.youtube.com/watch?v=0GTZ-12hYtU&t=15907s&ab_channel=AnshLamba).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/evialex/adventure-works-de-project

Awesome Lists containing this project

README