https://github.com/undisputed-jay/aws-s3-integration-with-snowflake
This project sets up an ETL pipeline to load Citibike trip data from an AWS S3 bucket into Snowflake. It establishes a secure integration with S3, defines a CSV file format, stages the data, and loads it into a Snowflake table for analysis.
https://github.com/undisputed-jay/aws-s3-integration-with-snowflake
aws-s3 snowflake sql
Last synced: 3 months ago
JSON representation
This project sets up an ETL pipeline to load Citibike trip data from an AWS S3 bucket into Snowflake. It establishes a secure integration with S3, defines a CSV file format, stages the data, and loads it into a Snowflake table for analysis.
- Host: GitHub
- URL: https://github.com/undisputed-jay/aws-s3-integration-with-snowflake
- Owner: Undisputed-jay
- Created: 2023-06-09T22:12:28.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-11-03T22:27:03.000Z (over 1 year ago)
- Last Synced: 2025-10-06T14:41:40.226Z (8 months ago)
- Topics: aws-s3, snowflake, sql
- Homepage:
- Size: 7.81 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Citibike Data Integration and ETL Pipeline
This repository contains SQL scripts and configurations for a data integration and ETL pipeline that loads Citibike trip data from an Amazon S3 bucket into a Snowflake database. This pipeline uses Snowflake's external storage integration and staging capabilities to access, format, and transform Citibike data for analysis.
Features
-
AWS S3 and Snowflake Integration: Securely connects to a designated S3 bucket using Snowflake'sSTORAGE INTEGRATIONand IAM roles, ensuring a reliable and efficient data connection. -
Data Staging and File Formatting: Defines a reusable file format for consistent parsing of CSV files, including error handling for null values and missing data. -
Automated Data Loading: Copies raw CSV data from S3 into a Snowflake stage, transforming it for storage in a structured Snowflake table. -
Scalable ETL Process: Easily scalable for additional Citibike datasets or other data formats, supporting future data ingestion and transformations.
Table Structure
The pipeline loads data into a trips table, which includes columns for trip duration, start and end times, station details, bike IDs, and user demographic information. This structured format enables quick querying and supports further analytics.
Usage
-
Configure Storage Integration: Ensure Snowflake has permissions to access the S3 bucket by configuring theSTORAGE_AWS_ROLE_ARNwith appropriate IAM roles. -
Run SQL Scripts: Execute the SQL scripts in the repository to create the integration, file format, stage, and table, then load the data from S3. -
Query the Data: Run queries on thetripstable to analyze Citibike trip data.