https://github.com/undisputed-jay/aws-s3-integration-with-snowflake

This project sets up an ETL pipeline to load Citibike trip data from an AWS S3 bucket into Snowflake. It establishes a secure integration with S3, defines a CSV file format, stages the data, and loads it into a Snowflake table for analysis.
https://github.com/undisputed-jay/aws-s3-integration-with-snowflake

aws-s3 snowflake sql

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/undisputed-jay/aws-s3-integration-with-snowflake
Owner: Undisputed-jay
Created: 2023-06-09T22:12:28.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-11-03T22:27:03.000Z (over 1 year ago)
Last Synced: 2025-10-06T14:41:40.226Z (10 months ago)
Topics: aws-s3, snowflake, sql
Homepage:
Size: 7.81 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

Citibike Data Integration and ETL Pipeline

This repository contains SQL scripts and configurations for a data integration and ETL pipeline that loads Citibike trip data from an Amazon S3 bucket into a Snowflake database. This pipeline uses Snowflake's external storage integration and staging capabilities to access, format, and transform Citibike data for analysis.

Features

AWS S3 and Snowflake Integration: Securely connects to a designated S3 bucket using Snowflake's STORAGE INTEGRATION and IAM roles, ensuring a reliable and efficient data connection.

Data Staging and File Formatting: Defines a reusable file format for consistent parsing of CSV files, including error handling for null values and missing data.

Automated Data Loading: Copies raw CSV data from S3 into a Snowflake stage, transforming it for storage in a structured Snowflake table.

Scalable ETL Process: Easily scalable for additional Citibike datasets or other data formats, supporting future data ingestion and transformations.

Table Structure

The pipeline loads data into a trips table, which includes columns for trip duration, start and end times, station details, bike IDs, and user demographic information. This structured format enables quick querying and supports further analytics.

Usage

Configure Storage Integration: Ensure Snowflake has permissions to access the S3 bucket by configuring the STORAGE_AWS_ROLE_ARN with appropriate IAM roles.

Run SQL Scripts: Execute the SQL scripts in the repository to create the integration, file format, stage, and table, then load the data from S3.

Query the Data: Run queries on the trips table to analyze Citibike trip data.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome