https://github.com/bilgeswe/bigdatamanagement
Building a Data Pipeline with Lakehouse Architecture on Microsoft Azure Platform
https://github.com/bilgeswe/bigdatamanagement
azure azure-pipelines azure-service azure-storage big-data big-data-analytics big-data-processing data-visualization datalake-ingestion dataset kaggle sql uml-diagram
Last synced: about 2 months ago
JSON representation
Building a Data Pipeline with Lakehouse Architecture on Microsoft Azure Platform
- Host: GitHub
- URL: https://github.com/bilgeswe/bigdatamanagement
- Owner: bilgeswe
- Created: 2025-04-05T17:53:09.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2025-04-05T18:38:16.000Z (11 months ago)
- Last Synced: 2025-04-30T22:55:29.903Z (10 months ago)
- Topics: azure, azure-pipelines, azure-service, azure-storage, big-data, big-data-analytics, big-data-processing, data-visualization, datalake-ingestion, dataset, kaggle, sql, uml-diagram
- Language: TSQL
- Homepage:
- Size: 2.02 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# BigDataManagement
Building a Data Pipeline with Lakehouse Architecture on Microsoft Azure Platform
## INTRODUCTION
In today’s data-driven world, the ability to efficiently process, analyze, and derive
insights from large datasets is critical for organizations across various industries. This
project focuses on building an end-to-end data pipeline to analyze Netflix's content
dataset, leveraging Azure cloud services to implement a scalable and reliable solution.
The project follows a structured data pipeline model, transitioning data through ingestion,
processing, storage, and serving layers, to create actionable insights for business and
academic purposes. The goal is to identify trends in Netflix content production across
regions and categories, such as the proportion of modern vs. classic content and the
distribution of content duration and count by country. The pipeline design ensures
flexibility, automation, and adaptability to future data needs, adhering to best practices in
data engineering.
For the researcher, this study was eye opening as it was the first time for using many
tools at hand.
## Key Components
Azure Data Factory (ADF): Used for orchestrating and automating data ingestion and
transformation processes.
Azure Data Lake Storage Gen2: Serves as the storage layer, structured into Bronze,
Silver, and Gold layers for raw, processed, and analytics-ready data.
Azure Synapse Analytics: Enables querying, visualizing, and analyzing data using SQL
external tables.
External Tables: Facilitates the serving layer by providing seamless access to processed
data stored in the "Gold" layer.
Visualization Tools: SQL-based visualizations in Azure Synapse replace Power BI due
to subscription constraints.
### Keywords
Data Pipeline, Azure Cloud Services, Data Factory, Data Lake Storage, Synapse
Analytics, Content Analysis, Modern vs. Classic Content, Regional Content Trends,
Machine Learning Integration
Source Material: https://www.kaggle.com/datasets/shivamb/netflix-shows