https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake
The project will utilize Airflow to orchestrate and manage the data pipeline as it creates and terminates an EMR transient cluster to save on cost. Apache Spark will transform data, and the final dataset will be loaded into Snowflake.
https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake
apache-airflow apache-spark aws aws-emr snowflake
Last synced: 7 months ago
JSON representation
The project will utilize Airflow to orchestrate and manage the data pipeline as it creates and terminates an EMR transient cluster to save on cost. Apache Spark will transform data, and the final dataset will be loaded into Snowflake.
- Host: GitHub
- URL: https://github.com/gakas14/batch-data-pipeline-using-airflow-spark-emr-snowflake
- Owner: gakas14
- Created: 2024-06-11T04:27:22.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-06-11T13:45:27.000Z (over 1 year ago)
- Last Synced: 2025-01-17T04:45:55.902Z (9 months ago)
- Topics: apache-airflow, apache-spark, aws, aws-emr, snowflake
- Language: Python
- Homepage: https://medium.com/@abdoulkaled/building-a-batch-etl-pipeline-using-airflow-spark-emr-and-snowflake-05559bb9799a
- Size: 13.7 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# ETL pipeline using Airflow, Spark, EMR, and Snowflake
### This project will join the hourly_ridership(60M records) and wifi_location (300 records) datasets based on a column and calculate the total daily ridership broken down by three other columns.
hourly_ridership: This dataset provides subway ridership estimates hourly by subway station complex and class of fare payment. Link.wifi_location: The MTA (Metropolitan Transportation Authority) contracted with Transit Wireless to provide all subway stations with Wi-Fi access and cell service. This dataset is a snapshot of the stations where Wi-Fi was available in part of 2015 and 2016. Link.
### The project will utilize Airflow to orchestrate and manage the data pipeline as it creates and terminates an EMR transient cluster. Apache Spark will transform data, and the final dataset will be loaded into Snowflake.
#### read more: https://abdoulkaled.medium.com/building-a-batch-etl-pipeline-using-airflow-spark-emr-and-snowflake-05559bb9799a
