Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/maxinexiong/cloud-data-warehousing-with-aws-redshift

This project builds a cloud-based ETL pipeline for Sparkify to move data to a cloud data warehouse. It extracts song and user activity data from AWS S3, stages it in Redshift, and transforms it into a star-schema data model with fact and dimension tables, enabling efficient querying to answer business questions.
https://github.com/maxinexiong/cloud-data-warehousing-with-aws-redshift

aws-boto3 aws-redshift aws-s3 cloud-data-warehouse data-warehouse data-warehousing dimensional-model dimensional-modeling etl etl-pipeline extract-transform-load infrastructure-as-code postgresql postgresql-database redshift-cluster

Last synced: 3 months ago
JSON representation

This project builds a cloud-based ETL pipeline for Sparkify to move data to a cloud data warehouse. It extracts song and user activity data from AWS S3, stages it in Redshift, and transforms it into a star-schema data model with fact and dimension tables, enabling efficient querying to answer business questions.

Awesome Lists containing this project

README

        

# Cloud Data Warehousing with AWS Redshift

[![GitHub](https://badgen.net/badge/icon/GitHub?icon=github&color=black&label)](https://github.com/MaxineXiong)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Made with Python](https://img.shields.io/badge/Python->=3.6-blue?logo=python&logoColor=white)](https://www.python.org)
[![Amazon Redshift](https://img.shields.io/badge/Amazon_Redshift-8C4FFF?logo=Amazon+Redshift&logoColor=white)](https://aws.amazon.com/redshift/)
[![Amazon S3](https://img.shields.io/badge/Amazon_S3-569A31?logo=Amazon+S3&logoColor=white)](https://aws.amazon.com/s3/)


## Project Description

This project is aimed at helping *Sparkify*, a music streaming startup, move its data and processes to the cloud. The goal is to build an **ETL pipeline** that extracts data from AWS S3, loads it into staging tables in a database hosted on Amazon Redshift, and transforms it into a star-schema-based data model to support analytics. This project handles large volumes of song metadata and user activity logs stored in S3, bringing them into a Redshift cluster for analysis by *Sparkify*'s data team. The final output is a dimensional model with fact and dimension tables in a star schema in Redshift that allows for efficient querying to answer business questions such as popular songs, user listening patterns, and peak activity times.

The ETL pipeline involves the following steps:

1. **Extracting** song metadata and user activity logs from **S3**.
2. **Loading** the data into **staging tables** in a **Redshift cluster**.
3. **Transforming** the staging data into a **fact and dimension tables** following a star schema.

The image below demonstrates the ETL process of moving data from S3 to Redshift:

![overall-process](https://github.com/user-attachments/assets/5c89b88f-d416-4af9-b85f-ed160c4efecc)

The project implements a scalable cloud solution for *Sparkify*'s analytics team to gain insights from their user and song data.


## Project Data

The project relies on two datasets stored in AWS S3:

- **Song Data**: Metadata about songs and artists, stored in JSON format in the path `s3://udacity-dend/song_data`.
- **Log Data**: User activity logs generated by the *Sparkify* app, stored in JSON format in the path `s3://udacity-dend/log_data`.

Additionally, the JSON metadata file `s3://udacity-dend/log_json_path.json` specifies how the log data is structured, enabling proper parsing during data loading into staging tables.

The song dataset consists of JSON files partitioned by the first three letters of each song’s track ID. For example, here are file paths to two files in the song dataset:

```
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
```

Below is is an example of what the single song file, TRAABJL12903CDCF1A.json, looks like:

```
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
```

The log dataset comprises log files in JSON format that are partitioned by year and month. For example, here are file paths to two files in the dataset.

```
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
```

And this image shows what the data in the log file, 2018-11-12-events.json, looks like:

![log-data](https://github.com/user-attachments/assets/e127ca71-f26a-4d47-a055-731ce268c524)

These datasets are processed and transformed into a **star-schema data model** in Redshift, as shown in the Entity Relationship Diagram (ERD) below, consisting of fact and dimension tables to facilitate analysis.

![ERD](https://github.com/user-attachments/assets/e590483b-9d90-40d9-8a7b-0baafabe4565)


## Repository Structure

The repository is structured as follows:

```
Sparkify-ETL-Pipeline/
├── 0_launch_Redshift_cluster.ipynb
├── 1_create_tables.py
├── 2_etl.py
├── 3_test_dimensional_model.ipynb
├── sql_queries.py
├── dwh.cfg
├── .gitignore
├── README.md
└── LICENSE
```

- **0_launch_Redshift_cluster.ipynb**: Jupyter notebook that sets up and configures an Amazon Redshift cluster used in the ETL process.
- **1_create_tables.py**: Python script responsible for creating the staging, fact, and dimension tables in the Redshift database.
- **2_etl.py**: Python script that extracts data from S3, loads it into staging tables on Redshift, and then transforms it into the target fact and dimension tables.
- **3_test_dimensional_model.ipynb**: Jupyter notebook used for testing and verifying the data loading process, validating the schema, and running analytic queries.
- **sql_queries.py**: This file contains all the SQL queries required for creating tables and performing the ETL operations.
- **dwh.cfg**: Configuration file that stores Redshift cluster, database, and AWS credentials.
- **.gitignore**: Specifies files and directories for Git to ignore, helping to manage sensitive data and unnecessary files.
- **README.md**: Provides an overview and instructions for this repository.
- **LICENSE**: The license file for the project.


## Usage

1. **Launch Redshift Cluster**: First, configure and launch the Redshift cluster using the `0_launch_Redshift_cluster.ipynb` notebook. This step sets up the target database on Redshift.
2. **Create Tables**: Run `1_create_tables.py` to create the staging, fact, and dimension tables in Redshift. This script can be rerun to reset the database if needed.
3. **Run ETL Pipeline**: Execute `2_etl.py` to load data from S3 into the staging tables in Redshift using the `COPY` command, and then insert the data into the fact and dimension tables using the staging tables.
4. **Test Dimensional Model**: Use `3_test_dimensional_model.ipynb` to validate the schema, check row counts, and run analytic queries to ensure that the model is ready for analytical workloads.
5. **Tear Down Cluster**: After completing the project, return to the final step in `1_create_tables.py` to delete the Redshift cluster and clean up associated resources.


## Contribution

Contributions to this project are welcome. If you'd like to improve the ETL pipeline or add additional functionalities, please fork the repository, create a new branch, and submit a pull request. Ensure that your code follows best practices and is well documented.


## License

This project is licensed under the [MIT License](https://choosealicense.com/licenses/mit/). Feel free to use, modify, and distribute the application in accordance with the terms of the license.


## Acknowledgement

Special thanks to [Udacity](https://www.udacity.com/) for providing the datasets and project specifications. The song and log data used in this project come from the [Million Song Dataset](http://millionsongdataset.com/) and [event simulator](https://github.com/Interana/eventsim) logs provided by Udacity.