https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake

The project will utilize Airflow to orchestrate and manage the data pipeline as it creates and terminates an EMR transient cluster to save on cost. Apache Spark will transform data, and the final dataset will be loaded into Snowflake.
https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake

apache-airflow apache-spark aws aws-emr snowflake

Last synced: 7 months ago
JSON representation

Host: GitHub
URL: https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake
Owner: gakas14
Created: 2024-06-11T04:27:22.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-06-11T13:45:27.000Z (over 1 year ago)
Last Synced: 2025-01-17T04:45:55.902Z (9 months ago)
Topics: apache-airflow, apache-spark, aws, aws-emr, snowflake
Language: Python
Homepage: https://medium.com/@abdoulkaled/building-a-batch-etl-pipeline-using-airflow-spark-emr-and-snowflake-05559bb9799a
Size: 13.7 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# ETL pipeline using Airflow, Spark, EMR, and Snowflake
### This project will join the hourly_ridership(60M records) and wifi_location (300 records) datasets based on a column and calculate the total daily ridership broken down by three other columns.
hourly_ridership: This dataset provides subway ridership estimates hourly by subway station complex and class of fare payment. Link.

wifi_location: The MTA (Metropolitan Transportation Authority) contracted with Transit Wireless to provide all subway stations with Wi-Fi access and cell service. This dataset is a snapshot of the stations where Wi-Fi was available in part of 2015 and 2016. Link.

### The project will utilize Airflow to orchestrate and manage the data pipeline as it creates and terminates an EMR transient cluster. Apache Spark will transform data, and the final dataset will be loaded into Snowflake.

#### read more: https://abdoulkaled.medium.com/building-a-batch-etl-pipeline-using-airflow-spark-emr-and-snowflake-05559bb9799a

![batch ETL pipeline using Airflow, Spark, EMR, and Snowflake](https://github.com/gakas14/Batch-Data-Pipeline-using-Airflow-Spark-EMR-Snowflake/assets/74584964/37356d36-4c76-44a5-84be-20a5432a7385)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake

Awesome Lists containing this project

README