https://github.com/collumbus/aws-cloud-data-warehouses
It is a data warehouse project on AWS where has built an ETL pipeline for a database hosted on Redshift. To complete the project, we load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
https://github.com/collumbus/aws-cloud-data-warehouses
Last synced: 8 months ago
JSON representation
It is a data warehouse project on AWS where has built an ETL pipeline for a database hosted on Redshift. To complete the project, we load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
- Host: GitHub
- URL: https://github.com/collumbus/aws-cloud-data-warehouses
- Owner: Collumbus
- Created: 2021-12-28T09:23:12.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2022-01-05T18:08:44.000Z (over 4 years ago)
- Last Synced: 2024-12-27T12:28:14.440Z (over 1 year ago)
- Language: Jupyter Notebook
- Size: 9.18 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Project: Build A Cloud Data Warehouse
It is a data warehouse project on AWS where has built an ETL pipeline for a database hosted on Redshift. To complete the project, we load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
## Introduction
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
## Project Datasets
You'll be working with two datasets that reside in S3. Here are the S3 links for each:
* Song data: ```s3://udacity-dend/song_data```
* Log data: ```s3://udacity-dend/log_data```
Log data json path: ```s3://udacity-dend/log_json_path.json```
## Song Dataset
The first dataset is a subset of real data from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.
```
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
```
And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.
```
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
```
## Log Dataset
The second dataset consists of log files in JSON format generated by this [event simulator based](https://github.com/Interana/eventsim) on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings.
The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.
```
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
```
And below is an example of what the data in a log file, 2018-11-12-events.json, looks like.

## Project Structure
```
Cloud Data Warehouse
|____create_tables.py # Database/table creation script
|____etl.py # ELT builder
|____sql_queries.py # SQL query collections
|____dwh.cfg # AWS configuration file
|____tester.ipynb # Test all proccess
```
## ELT Pipeline
#### etl.py
ELT pipeline builder
1. ```load_staging_tables```
* Load raw data from S3 buckets to Redshift staging tables
2. ```insert_tables```
* Transform staging table data to dimensional tables for data analysis
#### create_tables.py
Creating Staging, Fact and Dimension table schema
1. ```drop_tables```
2. ```create_tables```
#### sql_queries.py
SQL query statement collecitons for ```create_tables.py``` and ```etl.py```
1. ```*_table_drop```
2. ```*_table_create```
3. ```staging_*_copy```
4. ```*_table_insert```
##Database Schema
####Staging tables
```
staging_events
artist VARCHAR,
auth VARCHAR,
firstName VARCHAR,
gender CHAR(1),
itemInSession INT,
lastName VARCHAR,
length FLOAT,
level VARCHAR,
location TEXT,
method VARCHAR,
page VARCHAR,
registration VARCHAR,
sessionId INT,
song VARCHAR,
status INT,
ts BIGINT,
userAgent TEXT,
userId INT
staging_songs
artist_id VARCHAR,
artist_latitude FLOAT,
artist_location TEXT,
artist_longitude FLOAT,
artist_name VARCHAR,
duration FLOAT,
num_songs INT,
song_id VARCHAR,
title VARCHAR,
year INT
```
#### Fact table
```
songplays
songplay_id INT IDENTITY(0,1),
start_time TIMESTAMP,
user_id INT,
level VARCHAR,
song_id VARCHAR,
artist_id VARCHAR,
session_id INT,
location TEXT,
user_agent TEXT
```
#### Dimension tables
```
users
user_id INT,
first_name VARCHAR,
last_name VARCHAR,
gender CHAR(1),
level VARCHAR
songs
song_id VARCHAR,
title VARCHAR,
artist_id VARCHAR,
year INT,
duration FLOAT
artists
artist_id VARCHAR,
name VARCHAR,
location TEXT ,
latitude FLOAT ,
longitude FLOAT
time
start_time TIMESTAMP,
hour INT,
day INT,
week INT,
month INT,
year INT,
weekday VARCHAR
```
## Running the project
To test the entire project flow we can run the tester.ipynb notebook. In this file, there are also example queries for data verification in dwh.