https://github.com/nareangk/netflix-de-project
This project demonstrates an end-to-end data engineering pipeline using Azure and Databricks, following a Medallion architecture to process and analyze Netflix data.
https://github.com/nareangk/netflix-de-project
adf adlsgen2 azure azuredatabricks pyspark python sql
Last synced: about 1 month ago
JSON representation
This project demonstrates an end-to-end data engineering pipeline using Azure and Databricks, following a Medallion architecture to process and analyze Netflix data.
- Host: GitHub
- URL: https://github.com/nareangk/netflix-de-project
- Owner: nareangk
- Created: 2025-06-28T16:23:12.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2025-06-28T16:59:57.000Z (12 months ago)
- Last Synced: 2025-06-28T17:31:54.640Z (12 months ago)
- Topics: adf, adlsgen2, azure, azuredatabricks, pyspark, python, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 2.64 MB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Netflix Azure Data Engineering Project
This project demonstrates an end-to-end data engineering pipeline using Azure and Databricks, following a Medallion architecture to process and analyze Netflix data.
## 📊 Project Overview
This project demonstrates an **end-to-end data pipeline** for processing Netflix data using **Azure and Databricks**. It follows the **Medallion Architecture** (Bronze, Silver, Gold) on **Azure Data Lake Storage (ADLS)** and utilizes **Azure Data Factory (ADF)**, **Databricks**, **Delta Live Tables**, and **Databricks Autoloader** for seamless data ingestion, transformation, and orchestration.
---
## 🏗️ Architecture

1. **Data Ingestion (Bronze Layer)**
- Source: GitHub repository containing Netflix data.
- Tool: **Azure Data Factory (ADF)** pipelines pull data from GitHub and load it into the **Bronze layer** on **ADLS**.
- **Linked services** connect **GitHub → ADF** and **ADF → ADLS Bronze**.
- **Databricks Autoloader** is used to incrementally load new files into the **Bronze Layer**, reducing manual ingestion efforts.
2. **Data Processing & Transformation (Silver Layer)**
- **Databricks Access Connector** links ADLS to Databricks.
- **Azure Databricks** processes raw data and applies cleaning & transformations.
- Data is stored as **Delta Tables** in the **Silver Layer**.
3. **Data Aggregation & Analysis (Gold Layer)**
- Transformed data is further aggregated in **Delta Live Tables**.
- **Unity Catalog** is used for data governance & organization.
- **Databricks Workflows** schedule and orchestrate jobs.
---
## 🛠 Technologies Used
- **Azure Data Lake Storage (ADLS)** – Storage for Bronze, Silver, and Gold layers.
- **Azure Data Factory (ADF)** – ETL tool for ingesting data from GitHub to ADLS.
- **Azure Databricks** – Processing, transformation, and analysis engine.
- **Delta Lake & Delta Live Tables** – Optimized storage & real-time transformations.
- **Databricks Autoloader** – Automated and incremental ingestion of new data files.
- **Unity Catalog** – Centralized governance for managing data assets.
- **Databricks Workflows** – Job scheduling and orchestration.
---
🎯 Key Features
✅ End-to-end data pipeline with Azure & Databricks.
✅ Medallion architecture (Bronze, Silver, Gold) for structured data processing.
✅ Delta Live Tables for real-time transformations.
✅ Automated workflows & scheduling using Databricks Workflows.
✅ Unity Catalog for centralized data governance.
---
## 📁 Folder Descriptions
| Folder | Description |
|------------------------|---------------------------------------------------|
| `Azure/` | ADF pipeline + linked services JSON files |
| `Data/` | Sample input data files |
| `Databricks_notebooks/` | Databricks notebooks for transformation/analysis |
| `parameter_file/` | ADF parameter files for dynamic configuration |