https://github.com/jotstolu/azure-data-engineering-end--to-end-project
An end-to-end Netflix data engineering pipeline built on Microsoft Azure. This project ingests raw Netflix data, applies PySpark transformations , enforces data quality with Delta Live Tables, and orchestrates workflows via Azure Data Factory and Databricks.
https://github.com/jotstolu/azure-data-engineering-end--to-end-project
adf adlsgen2 azuredatabricks azuredatafactory cloudcomputing dataengineering datapipeline dataquality deltalake deltalivetables medallionarchitecture pyspark
Last synced: 3 months ago
JSON representation
An end-to-end Netflix data engineering pipeline built on Microsoft Azure. This project ingests raw Netflix data, applies PySpark transformations , enforces data quality with Delta Live Tables, and orchestrates workflows via Azure Data Factory and Databricks.
- Host: GitHub
- URL: https://github.com/jotstolu/azure-data-engineering-end--to-end-project
- Owner: jotstolu
- Created: 2025-06-11T12:25:31.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-06-14T11:47:03.000Z (4 months ago)
- Last Synced: 2025-06-14T12:34:15.865Z (4 months ago)
- Topics: adf, adlsgen2, azuredatabricks, azuredatafactory, cloudcomputing, dataengineering, datapipeline, dataquality, deltalake, deltalivetables, medallionarchitecture, pyspark
- Homepage:
- Size: 5.48 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Azure-Data-Engineering-Project using Netflix dataset
An end-to-end data engineering pipeline on Microsoft Azure leveraging the publicly available Netflix dataset. This project covers:
- Data Ingestion (Bronze)
- Data Processing & Cleaning (Silver)
- Data Quality & Delivery (Gold)
- Automation & Orchestration**Medallion Layers**:
| Layer | Purpose |
| ------ | ---------------------------------------------------------- |
| Bronze | Ingest raw data (Autoloader & ADF) into Delta format |
| Silver | Clean, dedupe, enrich; enforce schemas with PySpark |
| Gold | Apply Delta Live Tables for quality checks & aggregations |---
# PROJECT ARCHITECTURE

### Phase 1: Bronze (Raw Ingestion)
- **Sources**
- `Netflix_titles.csv` in ADLS Gen2 (`rawdata/Netflix_titles.csv`)
- Lookup tables (directors, cast, categories, countries) from `github`- **Orchestration**
- Azure Data Factory pipelines using **Copy Data**, **ForEach**, **validation** and **If Condition** activities
- Parameterized datasets & pipelines for reusability- **Autoload**
- Incremental ingest of new CSV files into `bronze.netflix_titles_delta`using **Databricks Autoloader**- **Storage**
- All raw ingestions stored as Delta tables in the `bronze/` container### Phase 2: Silver (Cleansing & Enrichment)

**Compute**
- Azure Databricks PySpark notebooks- **Transformations**
- Split multi-valued columns (e.g., rating)
- Remove duplicates, filter invalid records
- filling of null values
- Cast of data types for analytics readiness- **Orchestration**
- Databricks Workflows chaining parameterized notebooks- **Output**
- Cleaned Delta tables in the `silver/` container### Phase 3: Gold (Quality & Aggregation)

- **Framework**
- Delta Live Tables (DLT) for declarative pipelines- **Data Quality**
- Define **Expectations** (e.g., `NOT NULL`, `UNIQUE` etc.)
- Configure actions: `drop`---
## Technology Stack
| Component | Purpose |
| ------------------------- | ----------------------------------------- |
| Azure Data Factory (ADF) | Data orchestration & ingestion |
| Azure Data Lake Storage | Scalable storage for Delta tables |
| Azure Databricks | Spark-based ETL & Delta Live Tables |
| Delta Lake | ACID-compliant, performant data format |
| Python / PySpark | Data transformation logic |