Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/maazie-khan/olympics-data-enigeering
Worked with Azure Data Factory, Databricks, Data Lake Storage, and Synapse Analytics to build an ETL pipeline for processing and analyzing Olympic Games data from Kaggle.
https://github.com/maazie-khan/olympics-data-enigeering
azure big-data data-analysis dataengineering devops pipeline
Last synced: about 1 month ago
JSON representation
Worked with Azure Data Factory, Databricks, Data Lake Storage, and Synapse Analytics to build an ETL pipeline for processing and analyzing Olympic Games data from Kaggle.
- Host: GitHub
- URL: https://github.com/maazie-khan/olympics-data-enigeering
- Owner: Maazie-Khan
- Created: 2024-09-12T21:33:32.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-09-12T21:40:10.000Z (2 months ago)
- Last Synced: 2024-10-14T20:40:12.350Z (about 1 month ago)
- Topics: azure, big-data, data-analysis, dataengineering, devops, pipeline
- Language: Jupyter Notebook
- Homepage:
- Size: 3.51 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Olympic Games Data Engineering Pipeline (using Azure)
This repository contains the implementation of an ETL pipeline using **Azure Cloud** services to process **Olympic Games data** from Kaggle.
## Overview
The project extracts data from Kaggle, transforms it using **Apache Spark** in **Azure Databricks**, loads the cleaned data into **Azure Data Lake Storage**, and analyzes it using **Azure Synapse Analytics**.
![architecture](https://github.com/user-attachments/assets/459faf87-c6d7-48d1-aa56-ce1170db48d9)
### Objectives:
- **Extract** data from Kaggle.
- **Transform** using Spark.
- **Load** into Azure Data Lake Storage.
- **Analyze** using SQL and visualizations in Synapse.## Solution
1. **Data Extraction**:
- Data is extracted from Kaggle via **Azure Data Factory** and stored in **Azure Data Lake Storage**.
2. **Data Transformation**:
- **Azure Databricks** and **Apache Spark** are used to clean and transform the data.
3. **Data Loading**:
- The transformed data is reloaded into **Azure Data Lake Storage**.
4. **Data Analysis**:
- **Azure Synapse Analytics** runs SQL queries and generates visualizations.## Tech Stack
- **Azure Data Factory**: Data extraction and orchestration.
- **Azure Databricks**: Data transformation using Apache Spark.
- **Azure Data Lake Storage**: Scalable data storage.
- **Azure Synapse Analytics**: Data querying and visualization.## Insights
The project allows analysis of Olympic data, such as:
- **Medal Distribution** by country and year.
- **Athlete Trends** in age and performance.
- **Event Performance** analysis across nations.## Repository Structure
- `notebooks/`: Databricks notebooks for data transformation.
- `queries/`: SQL queries for analysis in Synapse.## How to Run
1. Set up **Azure Data Factory**, **Databricks**, **Data Lake Storage**, and **Synapse Analytics**.
2. Extract data from Kaggle via **Data Factory**.
3. Transform the data in **Databricks**.
4. Load and query the data in **Synapse**.## Preview
https://github.com/user-attachments/assets/08ca8988-ebe6-406b-9dfe-47adea61b670