Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/undisputed-jay/aws-data-pipeline-csv-to-parquet-with-glue-and-athena
https://github.com/undisputed-jay/aws-data-pipeline-csv-to-parquet-with-glue-and-athena
Last synced: about 14 hours ago
JSON representation
- Host: GitHub
- URL: https://github.com/undisputed-jay/aws-data-pipeline-csv-to-parquet-with-glue-and-athena
- Owner: Undisputed-jay
- Created: 2024-10-22T03:33:12.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-22T04:29:40.000Z (4 months ago)
- Last Synced: 2024-12-20T07:15:17.752Z (about 2 months ago)
- Language: Python
- Size: 28.5 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Project Overview
This project automates the process of extracting data from CSV files stored in an S3 bucket, transforming the data through AWS Glue, and writing the transformed data back to S3 in Parquet format. Finally, it uses AWS Glue Crawler to update the metadata in the AWS Glue Data Catalog for querying via Amazon Athena. This enables you to perform powerful SQL queries on large datasets efficiently.
Workflow Summary
-
Data Ingestion (S3 to Glue): CSV files are stored in the staging directory in S3. AWS Glue extracts these files, represented as dynamic frames in the ETL script. -
ETL Job:
- The data from albums, tracks, and artists CSV files is loaded into AWS Glue dynamic frames.
- The Join transformations combine data from the three tables based on relevant foreign keys (
artist_id
,track_id
). - Irrelevant or duplicate fields are dropped from the final dataset.
-
Data Transformation (Glue to Parquet): The final, transformed data is written back to a designated S3 location in Parquet format with Snappy compression for optimized storage and faster query performance. -
Cataloging with Glue Crawler: The AWS Glue Crawler automatically updates the Glue Data Catalog with the schema and metadata of the Parquet files, making the data queryable using Amazon Athena. -
Querying with Athena: Athena is used to perform SQL queries on the transformed data stored in S3, offering seamless analytics and reporting capabilities.
Steps for Execution
-
Set Up AWS Resources:
- Create an S3 bucket for storing raw CSV files (staging folder) and transformed data (datawarehouse folder).
- Upload your CSV files (e.g.,
albums.csv
,tracks.csv
,artists.csv
) to the S3 staging folder.
-
Create an AWS Glue Job:
- Use the provided code as the script for an AWS Glue Job.
- Ensure the Glue job is configured to access your S3 bucket and has permissions to read/write data.
-
Run the Glue Job: Execute the Glue job to perform the ETL process, which will transform the data and write it back to the S3 datawarehouse folder in Parquet format. -
Use AWS Glue Crawler: Configure a Glue Crawler to scan the datawarehouse folder and update the Glue Data Catalog with the schema of the Parquet files. -
Query Data with Athena: Once the data is cataloged, use Athena to perform SQL queries on the transformed data for analysis or reporting.
Features
-
CSV to Parquet Transformation: Converts raw CSV files to an optimized Parquet format for efficient querying. -
Data Cleaning and Joining: Joins multiple datasets and removes unnecessary fields before saving. -
S3 Integration: Seamless integration with S3 for both input (CSV) and output (Parquet) storage. -
AWS Glue Crawler: Automatically updates the data catalog for Athena queries. -
Scalability: Can easily be expanded to handle additional transformations or datasets.
Requirements
- AWS Glue (Job and Crawler)
- S3 Bucket for storage
- Amazon Athena for querying