Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/cschan1828/data-warehouse-with-aws

datawarehouse etl postgresql

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/cschan1828/data-warehouse-with-aws
Owner: cschan1828
Created: 2020-03-07T13:01:12.000Z (almost 5 years ago)
Default Branch: master
Last Pushed: 2020-03-07T13:06:14.000Z (almost 5 years ago)
Last Synced: 2024-11-29T02:40:59.982Z (about 1 month ago)
Topics: datawarehouse, etl, postgresql
Language: Python
Size: 2.93 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Introduction
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Now they want to move their processes and data onto the AWS. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.

Consequently, a data engineer is required to build an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.

# Dataset
It contains the following two sub-datasets.

### Song Dataset
The first dataset is a subset of real data from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID.

### Log Dataset
The second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the songs in the dataset above. These simulate app activity logs from an imaginary music streaming app based on configuration settings. The log files are partitioned by year and month.

# Relevant Files
* create_tables.py - drop existing and re-create new tables.
* sql_queries.py - Query data from warehouse for verification.
* etl.py - perform ETL that extract JSON data from S3 and ingest them to Redshift.
* dhw.cfg - Configuration file used that contains info about Redshift, IAM and S3.