An open API service indexing awesome lists of open source software.

https://github.com/evialex/adventure-works-de-project

Azure End To End Data Engineering Project
https://github.com/evialex/adventure-works-de-project

azure databricks etl etl-pipeline git powerbi spark

Last synced: 2 months ago
JSON representation

Azure End To End Data Engineering Project

Awesome Lists containing this project

README

          

# Adventure-Works-DE-Project

This project demonstrates a complete end-to-end data engineering solution built on the Microsoft Azure platform. The pipeline ingests the AdventureWorks dataset from a public GitHub repository, processes it through a multi-stage architecture, and serves it to Power BI for business intelligence and analytics.

![Architecture Diagram](./images/architecture.png)

---
## 🏛️**Architecture Overview**

The solution leverages a modern data stack on Azure, following a medallion architecture (Bronze, Silver, Gold layers) to ensure data quality and scalability.

- **Orchestration**: Azure Data Factory (ADF)
- **Data Lake**: Azure Data Lake Storage (ADLS) Gen2
- **Data Transformation**: Azure Databricks (using Spark)
- **Data Warehousing**: Azure Synapse Analytics
- **Business Intelligence**: Power BI

---
## 🚀**Pipeline Flow**
The data moves through four distinct stages:

### 1. **Ingestion (Bronze Layer)**
- **Azure Data Factory** uses a dynamic copy activity to the raw datasets via an HTTP connector from Github

image

- The raw data is landed in the **Bronze** container in Azure Data Lake Storage without any modifications.

image

---
### 2. **Transformation (Silver Layer)**
- **Azure Databricks** reads the raw data from the Bronze layer
- A PySpark job perfoms key transformations, including cleaning records, normalizing data formats, and structuring the data

image

- The cleaned, transformed data is saved in the **Silver** container in the efficient **Parquet** format

image

### 3. **Warehousing (Gold Layer)**
- **Azure Synapse Analytics** connects to the Silver container using a serverless SQL pool.
- External tables and views are created on top of the Parquet files to structure the data for analysis.

image

image

- This final, curated data represents the Gold layer, ready for reporting.

image

---

### 4. **Visualization**
- **Power BI** connects directly to Azure Synapse Analytics.
- Simple dashboards and reports are built to provide actionable insights from the curated data. Although creating beautiful dashboard was not part of project, so it is kinda raw.

image

---
## **Key Takeaways** ✅
This project showcases a robust, automated, and scalable data engineering solution on Azure. It effectively transforms raw, source data into high-value business insights, demonstrating a complete data lifecycle.

---
## **Acknowledgment** 🎉
This project was inspired by the work of [Ansh Lamba](https://github.com/anshlambagit). For a detailed video walkthrough of similar project, please check out [his Youtube channel](https://www.youtube.com/watch?v=0GTZ-12hYtU&t=15907s&ab_channel=AnshLamba).