An open API service indexing awesome lists of open source software.

https://github.com/edugmenes/azure-data-engineering

This repository contains my first end-to-end Data Engineering project, built using Microsoft Azure Cloud and Azure Databricks with PySpark.
https://github.com/edugmenes/azure-data-engineering

azure cloud data data-engineering data-lakehouse data-structures databricks delta-lake etl-pipelines lakehouse lakehouse-architectures medallion-architecture microsoft-azure pyspark spark

Last synced: about 2 months ago
JSON representation

This repository contains my first end-to-end Data Engineering project, built using Microsoft Azure Cloud and Azure Databricks with PySpark.

Awesome Lists containing this project

README

          

# End-to-End Azure Data Engineering

This repository contains my **first Data Engineering project**, built using **Microsoft Azure Cloud** and **Azure Databricks**.

The project focuses on designing and implementing **ETL pipelines** using **PySpark** following the **Medallion Architecture (Bronze, Silver, Gold)**, a modern and widely adopted pattern for building scalable and reliable data platforms.

## ๐Ÿš€ Project Overview

The main objective of this project is to demonstrate how raw data can be ingested, transformed, and curated into analytics-ready datasets using cloud-native tools and best practices.

The solution covers:
- Data ingestion from raw sources
- Data transformation and cleansing
- Data modeling for analytics consumption
- Distributed data processing with PySpark

## ๐Ÿ—๏ธ Architecture
![Architecture Diagram](/archives/images/Captura%20de%20tela%202026-01-20%20123843.png)

### Azure Event Hubs
Purpose: Real-time data ingestion

Role in project:
- Captures streaming events (clicks, logs, IoT, transactions)
- Highly scalable and fault tolerant

โœจ Without Event Hubs: Youโ€™d miss live data or overload systems

### Azure Data Factory (ADF)
Purpose: Orchestration & batch ingestion

Role in project:
- Schedules pipelines
- Moves data from source โ†’ data lake
- Triggers Databricks jobs

โœจ Think of it as the control center

### Azure Databricks (Apache Spark)
Purpose: Data processing & transformation

Role in project:
- Processes huge volumes of data efficiently
- Implements Bronze โ†’ Silver โ†’ Gold logic
- Handles both batch and streaming data

โœจ This is the engine of the architecture

### Azure Data Lake Storage Gen2 (ADLS)
Purpose: Central storage layer

Role in project:
- Stores all data (Bronze, Silver, Gold)
- Cheap, scalable, secure
- Optimized for analytics

โœจ This is your single source of truth

### Delta Lake
Purpose: Reliability & governance on top of ADLS

Role in project:
- ACID transactions
- Schema enforcement
- Time travel (data versioning)
- Efficient reads/writes

โœจ Delta Lake turns โ€œfilesโ€ into real analytical tables

## ๐Ÿงฎ Data Model
The project is structured using a **Medallion Architecture**

//
โ”‚
โ””โ”€โ”€ projet1/
โ”œโ”€โ”€ resources/ # Source and target data
โ”‚ โ”œโ”€โ”€ source/ # CSV files received from the source
โ”‚ โ””โ”€โ”€ target/ # Exports files for customers
โ”‚
โ”œโ”€โ”€ 01-bronze/ # Raw data
โ”‚ โ”œโ”€โ”€ customers/
โ”‚ โ”‚ โ””โ”€โ”€ customers.parquet
โ”‚ โ”œโ”€โ”€ sales/
โ”‚ โ”‚ โ””โ”€โ”€ sales.parquet
โ”‚ ...
โ”‚
โ”œโ”€โ”€ 02-silver/ # Clean data
โ”‚ โ”œโ”€โ”€ customers/
โ”‚ โ”‚ โ””โ”€โ”€ customers.parquet
โ”‚ โ”œโ”€โ”€ sales/
โ”‚ โ”‚ โ””โ”€โ”€ sales.parquet
โ”‚ ...
โ”‚
โ”œโ”€โ”€ 03-gold/ # Aggregated data
โ”‚ โ”œโ”€โ”€ sales_per_category/
โ”‚ โ”œโ”€โ”€ sales_per_city/
โ”‚ ...
โ”‚
โ”œโ”€โ”€ metadata/ # Metadata and logs
โ”‚ โ”œโ”€โ”€ bronze/
โ”‚ โ”œโ”€โ”€ silver/
โ”‚ โ”œโ”€โ”€ gold/
โ”‚ โ”œโ”€โ”€ ddl/ # CREATE TABLE scripts
โ”‚ โ”œโ”€โ”€ logs/ # ETL execution logs
โ”‚ โ””โ”€โ”€ checkpoints/ # Autoloader checkpoints / streaming
โ”‚
โ””โ”€โ”€ tmp/ # Temporary staging

### ๐ŸŸค Bronze Layer
- Raw data ingestion
- Minimal transformation
- Preserves source data as-is

### โšช Silver Layer
- Data cleansing and normalization
- Data enrichment
- Application of business rules

### ๐ŸŸก Gold Layer
- Curated, analytics-ready datasets
- Optimized for reporting and BI use cases

## ๐Ÿ› ๏ธ Technologies Used

- Microsoft Azure
- Azure Databricks
- Apache Spark (PySpark)
- Delta Lake
- Medallion Architecture

## ๐ŸŽฏ Key Learnings

Through this project, I gained hands-on experience with:
- Cloud-based data platforms
- Distributed data processing using Spark
- Building scalable ETL pipelines
- Applying modern Data Engineering design patterns
- Managing data across multiple data layers

## ๐Ÿ“Œ Notes

This is a **learning project**, created to apply theoretical concepts in a practical environment using industry-standard tools.

Future improvements may include:
- Pipeline orchestration
- Data quality checks
- Performance optimization
- Monitoring and logging