https://github.com/edugmenes/azure-data-engineering
This repository contains my first end-to-end Data Engineering project, built using Microsoft Azure Cloud and Azure Databricks with PySpark.
https://github.com/edugmenes/azure-data-engineering
azure cloud data data-engineering data-lakehouse data-structures databricks delta-lake etl-pipelines lakehouse lakehouse-architectures medallion-architecture microsoft-azure pyspark spark
Last synced: about 2 months ago
JSON representation
This repository contains my first end-to-end Data Engineering project, built using Microsoft Azure Cloud and Azure Databricks with PySpark.
- Host: GitHub
- URL: https://github.com/edugmenes/azure-data-engineering
- Owner: edugmenes
- Created: 2026-01-20T16:19:07.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-01-23T18:43:58.000Z (2 months ago)
- Last Synced: 2026-01-24T08:37:15.063Z (2 months ago)
- Topics: azure, cloud, data, data-engineering, data-lakehouse, data-structures, databricks, delta-lake, etl-pipelines, lakehouse, lakehouse-architectures, medallion-architecture, microsoft-azure, pyspark, spark
- Language: Jupyter Notebook
- Homepage:
- Size: 230 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# End-to-End Azure Data Engineering
This repository contains my **first Data Engineering project**, built using **Microsoft Azure Cloud** and **Azure Databricks**.
The project focuses on designing and implementing **ETL pipelines** using **PySpark** following the **Medallion Architecture (Bronze, Silver, Gold)**, a modern and widely adopted pattern for building scalable and reliable data platforms.
## ๐ Project Overview
The main objective of this project is to demonstrate how raw data can be ingested, transformed, and curated into analytics-ready datasets using cloud-native tools and best practices.
The solution covers:
- Data ingestion from raw sources
- Data transformation and cleansing
- Data modeling for analytics consumption
- Distributed data processing with PySpark
## ๐๏ธ Architecture

### Azure Event Hubs
Purpose: Real-time data ingestion
Role in project:
- Captures streaming events (clicks, logs, IoT, transactions)
- Highly scalable and fault tolerant
โจ Without Event Hubs: Youโd miss live data or overload systems
### Azure Data Factory (ADF)
Purpose: Orchestration & batch ingestion
Role in project:
- Schedules pipelines
- Moves data from source โ data lake
- Triggers Databricks jobs
โจ Think of it as the control center
### Azure Databricks (Apache Spark)
Purpose: Data processing & transformation
Role in project:
- Processes huge volumes of data efficiently
- Implements Bronze โ Silver โ Gold logic
- Handles both batch and streaming data
โจ This is the engine of the architecture
### Azure Data Lake Storage Gen2 (ADLS)
Purpose: Central storage layer
Role in project:
- Stores all data (Bronze, Silver, Gold)
- Cheap, scalable, secure
- Optimized for analytics
โจ This is your single source of truth
### Delta Lake
Purpose: Reliability & governance on top of ADLS
Role in project:
- ACID transactions
- Schema enforcement
- Time travel (data versioning)
- Efficient reads/writes
โจ Delta Lake turns โfilesโ into real analytical tables
## ๐งฎ Data Model
The project is structured using a **Medallion Architecture**
//
โ
โโโ projet1/
โโโ resources/ # Source and target data
โ โโโ source/ # CSV files received from the source
โ โโโ target/ # Exports files for customers
โ
โโโ 01-bronze/ # Raw data
โ โโโ customers/
โ โ โโโ customers.parquet
โ โโโ sales/
โ โ โโโ sales.parquet
โ ...
โ
โโโ 02-silver/ # Clean data
โ โโโ customers/
โ โ โโโ customers.parquet
โ โโโ sales/
โ โ โโโ sales.parquet
โ ...
โ
โโโ 03-gold/ # Aggregated data
โ โโโ sales_per_category/
โ โโโ sales_per_city/
โ ...
โ
โโโ metadata/ # Metadata and logs
โ โโโ bronze/
โ โโโ silver/
โ โโโ gold/
โ โโโ ddl/ # CREATE TABLE scripts
โ โโโ logs/ # ETL execution logs
โ โโโ checkpoints/ # Autoloader checkpoints / streaming
โ
โโโ tmp/ # Temporary staging
### ๐ค Bronze Layer
- Raw data ingestion
- Minimal transformation
- Preserves source data as-is
### โช Silver Layer
- Data cleansing and normalization
- Data enrichment
- Application of business rules
### ๐ก Gold Layer
- Curated, analytics-ready datasets
- Optimized for reporting and BI use cases
## ๐ ๏ธ Technologies Used
- Microsoft Azure
- Azure Databricks
- Apache Spark (PySpark)
- Delta Lake
- Medallion Architecture
## ๐ฏ Key Learnings
Through this project, I gained hands-on experience with:
- Cloud-based data platforms
- Distributed data processing using Spark
- Building scalable ETL pipelines
- Applying modern Data Engineering design patterns
- Managing data across multiple data layers
## ๐ Notes
This is a **learning project**, created to apply theoretical concepts in a practical environment using industry-standard tools.
Future improvements may include:
- Pipeline orchestration
- Data quality checks
- Performance optimization
- Monitoring and logging