{"id":29403577,"url":"https://github.com/nareangk/netflix-de-project","last_synced_at":"2026-05-05T11:36:17.305Z","repository":{"id":301759509,"uuid":"1010231695","full_name":"nareangk/Netflix-DE-Project","owner":"nareangk","description":"This project demonstrates an end-to-end data engineering pipeline using Azure and Databricks, following a Medallion architecture to process and analyze Netflix data.","archived":false,"fork":false,"pushed_at":"2025-06-28T16:59:57.000Z","size":2772,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-06-28T17:31:54.640Z","etag":null,"topics":["adf","adlsgen2","azure","azuredatabricks","pyspark","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nareangk.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-28T16:23:12.000Z","updated_at":"2025-06-28T17:22:19.000Z","dependencies_parsed_at":"2025-06-28T17:32:28.523Z","dependency_job_id":"c70113c6-68e5-45eb-ab87-fe5fdca09ae9","html_url":"https://github.com/nareangk/Netflix-DE-Project","commit_stats":null,"previous_names":["nareangk/netflix-de-project"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/nareangk/Netflix-DE-Project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nareangk%2FNetflix-DE-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nareangk%2FNetflix-DE-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nareangk%2FNetflix-DE-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nareangk%2FNetflix-DE-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nareangk","download_url":"https://codeload.github.com/nareangk/Netflix-DE-Project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nareangk%2FNetflix-DE-Project/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264637858,"owners_count":23642062,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adf","adlsgen2","azure","azuredatabricks","pyspark","python","sql"],"created_at":"2025-07-10T19:00:45.022Z","updated_at":"2026-05-05T11:36:17.232Z","avatar_url":"https://github.com/nareangk.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Netflix Azure Data Engineering Project\n\nThis project demonstrates an end-to-end data engineering pipeline using Azure and Databricks, following a Medallion architecture to process and analyze Netflix data.\n\n## 📊 Project Overview\n\nThis project demonstrates an **end-to-end data pipeline** for processing Netflix data using **Azure and Databricks**. It follows the **Medallion Architecture** (Bronze, Silver, Gold) on **Azure Data Lake Storage (ADLS)** and utilizes **Azure Data Factory (ADF)**, **Databricks**, **Delta Live Tables**, and **Databricks Autoloader** for seamless data ingestion, transformation, and orchestration.  \n\n---\n\n## 🏗️ Architecture\n\n![Architecture](https://github.com/user-attachments/assets/c1556b47-4dae-4605-bb61-6c9e33ad95e6)\n\n\n1. **Data Ingestion (Bronze Layer)**\n   - Source: GitHub repository containing Netflix data.  \n   - Tool: **Azure Data Factory (ADF)** pipelines pull data from GitHub and load it into the **Bronze layer** on **ADLS**.\n   - **Linked services** connect **GitHub → ADF** and **ADF → ADLS Bronze**.  \n   - **Databricks Autoloader** is used to incrementally load new files into the **Bronze Layer**, reducing manual ingestion efforts.\n\n2. **Data Processing \u0026 Transformation (Silver Layer)**\n   - **Databricks Access Connector** links ADLS to Databricks.  \n   - **Azure Databricks** processes raw data and applies cleaning \u0026 transformations.  \n   - Data is stored as **Delta Tables** in the **Silver Layer**.  \n\n3. **Data Aggregation \u0026 Analysis (Gold Layer)**\n   - Transformed data is further aggregated in **Delta Live Tables**.  \n   - **Unity Catalog** is used for data governance \u0026 organization.  \n   - **Databricks Workflows** schedule and orchestrate jobs.  \n\n---\n\n## 🛠 Technologies Used  \n- **Azure Data Lake Storage (ADLS)** – Storage for Bronze, Silver, and Gold layers.  \n- **Azure Data Factory (ADF)** – ETL tool for ingesting data from GitHub to ADLS.  \n- **Azure Databricks** – Processing, transformation, and analysis engine.  \n- **Delta Lake \u0026 Delta Live Tables** – Optimized storage \u0026 real-time transformations.  \n- **Databricks Autoloader** – Automated and incremental ingestion of new data files.  \n- **Unity Catalog** – Centralized governance for managing data assets.  \n- **Databricks Workflows** – Job scheduling and orchestration.\n\n\n---\n\n🎯 Key Features\n\n✅ End-to-end data pipeline with Azure \u0026 Databricks.\n\n✅ Medallion architecture (Bronze, Silver, Gold) for structured data processing.\n\n✅ Delta Live Tables for real-time transformations.\n\n✅ Automated workflows \u0026 scheduling using Databricks Workflows.\n\n✅ Unity Catalog for centralized data governance.\n\n---\n\n## 📁 Folder Descriptions\n\n| Folder                 | Description                                        |\n|------------------------|---------------------------------------------------|\n| `Azure/`              | ADF pipeline + linked services JSON files         |\n| `Data/`               | Sample input data files                           |\n| `Databricks_notebooks/` | Databricks notebooks for transformation/analysis  |\n| `parameter_file/`     | ADF parameter files for dynamic configuration     |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnareangk%2Fnetflix-de-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnareangk%2Fnetflix-de-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnareangk%2Fnetflix-de-project/lists"}