https://github.com/bondxue/Data-Warehouse-with-AWS
:mushroom:Udacity Data Engineering Nanodegree Project 3
https://github.com/bondxue/Data-Warehouse-with-AWS
aws-s3 data-warehouse redshift
Last synced: 5 months ago
JSON representation
:mushroom:Udacity Data Engineering Nanodegree Project 3
- Host: GitHub
- URL: https://github.com/bondxue/Data-Warehouse-with-AWS
- Owner: bondxue
- License: mit
- Created: 2019-05-03T22:42:54.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-07-14T05:36:42.000Z (almost 6 years ago)
- Last Synced: 2024-08-14T07:09:19.823Z (8 months ago)
- Topics: aws-s3, data-warehouse, redshift
- Language: Python
- Homepage:
- Size: 48.8 KB
- Stars: 12
- Watchers: 1
- Forks: 18
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- jimsghstars - bondxue/Data-Warehouse-with-AWS - :mushroom:Udacity Data Engineering Nanodegree Project 3 (Python)
README
# Project 3: Song Play Analysis with S3 and Redshift
-------------------------
[](https://img.shields.io/badge/project-passed-success.svg)### Introduction
In this project, we try to help one music streaming startup, Sparkify, to move their user base and song database processes on to the cloud. Specifically, I build an ETL pipeline that extracts their data from **AWS S3** (data storage), stages tables on **AWS Redshift** (data warehouse with *columnar storage*), and execute **SQL** statements that create the analytics tables from these staging tables.
### Datasets
Datasets used in this project are provided in two public **S3 buckets**. One bucket contains info about songs and artists, the second bucket has info concerning actions done by users (which song are listening, etc.. ). The objects contained in both buckets
are *JSON* files.The **Redshift** service is where data will be ingested and transformed, in fact though `COPY` command we will access to the JSON files inside the buckets and copy their content on our *staging tables*.
### Database Schema
We have two staging tables which *copy* the JSON file inside the **S3 buckets**.
#### Staging Table
+ **staging_songs** - info about songs and artists
+ **staging_events** - actions done by users (which song are listening, etc.. )I createa a star schema optimized for queries on song play analysis. This includes the following tables.
#### Fact Table
+ **songplays** - records in event data associated with song plays i.e. records with page `NextSong`#### Dimension Tables
+ **users** - users in the app
+ **songs** - songs in music database
+ **artists** - artists in music database
+ **time** - timestamps of records in **songplays** broken down into specific unitsThe database schema is shown as follows
### Data Warehouse Configurations and Setup
* Create a new `IAM user` in your AWS account
* Give it AdministratorAccess and Attach policies
* Use access key and secret key to create clients for `EC2`, `S3`, `IAM`, and `Redshift`.
* Create an `IAM Role` that makes `Redshift` able to access `S3 bucket` (ReadOnly)
* Create a `RedShift Cluster` and get the `DWH_ENDPOIN(Host address)` and `DWH_ROLE_ARN` and fill the config file.### ETL Pipeline
+ Created tables to store the data from `S3 buckets`.
+ Loading the data from `S3 buckets` to staging tables in the `Redshift Cluster`.
+ Inserted data into fact and dimension tables from the staging tables.### Project Structure
+ `create_tables.py` - This script will drop old tables (if exist) ad re-create new tables.
+ `etl.py` - This script executes the queries that extract `JSON` data from the `S3 bucket` and ingest them to `Redshift`.
+ `sql_queries.py` - This file contains variables with SQL statement in String formats, partitioned by `CREATE`, `DROP`, `COPY` and `INSERT` statement.
+ `dhw.cfg` - Configuration file used that contains info about `Redshift`, `IAM` and `S3`### How to Run
1. Create tables by running `create_tables.py`.
2. Execute ETL process by running `etl.py`.