Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/maazie-khan/olympics-data-enigeering

Worked with Azure Data Factory, Databricks, Data Lake Storage, and Synapse Analytics to build an ETL pipeline for processing and analyzing Olympic Games data from Kaggle.
https://github.com/maazie-khan/olympics-data-enigeering

azure big-data data-analysis dataengineering devops pipeline

Last synced: about 1 month ago
JSON representation

Worked with Azure Data Factory, Databricks, Data Lake Storage, and Synapse Analytics to build an ETL pipeline for processing and analyzing Olympic Games data from Kaggle.

Host: GitHub
URL: https://github.com/maazie-khan/olympics-data-enigeering
Owner: Maazie-Khan
Created: 2024-09-12T21:33:32.000Z (2 months ago)
Default Branch: main
Last Pushed: 2024-09-12T21:40:10.000Z (2 months ago)
Last Synced: 2024-10-14T20:40:12.350Z (about 1 month ago)
Topics: azure, big-data, data-analysis, dataengineering, devops, pipeline
Language: Jupyter Notebook
Homepage:
Size: 3.51 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Olympic Games Data Engineering Pipeline (using Azure)

This repository contains the implementation of an ETL pipeline using **Azure Cloud** services to process **Olympic Games data** from Kaggle.

## Overview

The project extracts data from Kaggle, transforms it using **Apache Spark** in **Azure Databricks**, loads the cleaned data into **Azure Data Lake Storage**, and analyzes it using **Azure Synapse Analytics**.

![architecture](https://github.com/user-attachments/assets/459faf87-c6d7-48d1-aa56-ce1170db48d9)

### Objectives:
- **Extract** data from Kaggle.
- **Transform** using Spark.
- **Load** into Azure Data Lake Storage.
- **Analyze** using SQL and visualizations in Synapse.

## Solution

1. **Data Extraction**:
- Data is extracted from Kaggle via **Azure Data Factory** and stored in **Azure Data Lake Storage**.
2. **Data Transformation**:
- **Azure Databricks** and **Apache Spark** are used to clean and transform the data.
3. **Data Loading**:
- The transformed data is reloaded into **Azure Data Lake Storage**.
4. **Data Analysis**:
- **Azure Synapse Analytics** runs SQL queries and generates visualizations.

## Tech Stack

- **Azure Data Factory**: Data extraction and orchestration.
- **Azure Databricks**: Data transformation using Apache Spark.
- **Azure Data Lake Storage**: Scalable data storage.
- **Azure Synapse Analytics**: Data querying and visualization.

## Insights

The project allows analysis of Olympic data, such as:
- **Medal Distribution** by country and year.
- **Athlete Trends** in age and performance.
- **Event Performance** analysis across nations.

## Repository Structure

- `notebooks/`: Databricks notebooks for data transformation.
- `queries/`: SQL queries for analysis in Synapse.

## How to Run

1. Set up **Azure Data Factory**, **Databricks**, **Data Lake Storage**, and **Synapse Analytics**.
2. Extract data from Kaggle via **Data Factory**.
3. Transform the data in **Databricks**.
4. Load and query the data in **Synapse**.

## Preview
https://github.com/user-attachments/assets/08ca8988-ebe6-406b-9dfe-47adea61b670