https://github.com/evialex/adventure-works-de-project
Azure End To End Data Engineering Project
https://github.com/evialex/adventure-works-de-project
azure databricks etl etl-pipeline git powerbi spark
Last synced: 2 months ago
JSON representation
Azure End To End Data Engineering Project
- Host: GitHub
- URL: https://github.com/evialex/adventure-works-de-project
- Owner: EviAleX
- Created: 2025-07-27T14:53:55.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-08-08T11:25:15.000Z (11 months ago)
- Last Synced: 2025-08-08T13:12:44.219Z (11 months ago)
- Topics: azure, databricks, etl, etl-pipeline, git, powerbi, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 1.91 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Adventure-Works-DE-Project
This project demonstrates a complete end-to-end data engineering solution built on the Microsoft Azure platform. The pipeline ingests the AdventureWorks dataset from a public GitHub repository, processes it through a multi-stage architecture, and serves it to Power BI for business intelligence and analytics.

---
## 🏛️**Architecture Overview**
The solution leverages a modern data stack on Azure, following a medallion architecture (Bronze, Silver, Gold layers) to ensure data quality and scalability.
- **Orchestration**: Azure Data Factory (ADF)
- **Data Lake**: Azure Data Lake Storage (ADLS) Gen2
- **Data Transformation**: Azure Databricks (using Spark)
- **Data Warehousing**: Azure Synapse Analytics
- **Business Intelligence**: Power BI
---
## 🚀**Pipeline Flow**
The data moves through four distinct stages:
### 1. **Ingestion (Bronze Layer)**
- **Azure Data Factory** uses a dynamic copy activity to the raw datasets via an HTTP connector from Github

- The raw data is landed in the **Bronze** container in Azure Data Lake Storage without any modifications.

---
### 2. **Transformation (Silver Layer)**
- **Azure Databricks** reads the raw data from the Bronze layer
- A PySpark job perfoms key transformations, including cleaning records, normalizing data formats, and structuring the data

- The cleaned, transformed data is saved in the **Silver** container in the efficient **Parquet** format

### 3. **Warehousing (Gold Layer)**
- **Azure Synapse Analytics** connects to the Silver container using a serverless SQL pool.
- External tables and views are created on top of the Parquet files to structure the data for analysis.


- This final, curated data represents the Gold layer, ready for reporting.

---
### 4. **Visualization**
- **Power BI** connects directly to Azure Synapse Analytics.
- Simple dashboards and reports are built to provide actionable insights from the curated data. Although creating beautiful dashboard was not part of project, so it is kinda raw.

---
## **Key Takeaways** ✅
This project showcases a robust, automated, and scalable data engineering solution on Azure. It effectively transforms raw, source data into high-value business insights, demonstrating a complete data lifecycle.
---
## **Acknowledgment** 🎉
This project was inspired by the work of [Ansh Lamba](https://github.com/anshlambagit). For a detailed video walkthrough of similar project, please check out [his Youtube channel](https://www.youtube.com/watch?v=0GTZ-12hYtU&t=15907s&ab_channel=AnshLamba).