Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/joyceannie/data-warehouse-aws

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.
https://github.com/joyceannie/data-warehouse-aws

aws aws-s3 data-warehouse python3 redshift redshift-cluster

Last synced: 24 days ago
JSON representation

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.

Awesome Lists containing this project

README

        

# Data Warehouse with AWS

## Overview

A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.

## Datasets

There are 2 datasets that resides in S3 buckets.

### Song Dataset

The first dataset is a subset of real data from the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Record:

> {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

### Log Dataset

The second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset are partitioned by year and month.

Sample Log Record:

> {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}

## Schema

### Fact Table

* fact_songplay: The table records log data associated with the song played by users.

### Dimension Tables

* dim_user: Users of Sparkify app. Columns are user_id, first_name, last_name, gender, level.

* dim_song: Songs in the dataset. Columns are song_id, title, artist_id, year, duration.

* dim_artist: Artists in the dataset. Columns are artist_id, name, location, lattitude, longitude.

* dim_time: timestamps of records in songplays broken down into specific units. Columns are start_time, hour, day, week, month, year, weekday.

## How To Run

You should have an AWS account to run the project.
You should setup the configuration file.

Create Redshift cluster by running create_cluster_iac.py.

```
$ python create_cluster_iac.py
```

Run create_tables.py to create the staging tables.

```
$ python create_tables.py
```

Run etl.py to to load data from staging tables to analytics tables on Redshift.

```
$ python etl.py
```

Now, you can run analytic queries on your Redshift database.

Delete the Redshift cluster by ruuning delete_cluster_iac.py

```
$ python delete_cluster_iac.py
```