Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joyceannie/data-warehouse-aws
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.
https://github.com/joyceannie/data-warehouse-aws
aws aws-s3 data-warehouse python3 redshift redshift-cluster
Last synced: 24 days ago
JSON representation
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.
- Host: GitHub
- URL: https://github.com/joyceannie/data-warehouse-aws
- Owner: joyceannie
- Created: 2022-06-17T21:28:27.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2022-06-22T13:52:06.000Z (over 2 years ago)
- Last Synced: 2023-03-05T11:33:03.348Z (almost 2 years ago)
- Topics: aws, aws-s3, data-warehouse, python3, redshift, redshift-cluster
- Language: Jupyter Notebook
- Homepage:
- Size: 17.6 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Warehouse with AWS
## Overview
A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.
## Datasets
There are 2 datasets that resides in S3 buckets.
### Song Dataset
The first dataset is a subset of real data from the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song.
Sample Song Record:
> {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
### Log Dataset
The second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset are partitioned by year and month.
Sample Log Record:
> {"artist":null,"auth":"Logged In","firstName":"Walter","gender":"M","itemInSession":0,"lastName":"Frye","length":null,"level":"free","location":"San Francisco-Oakland-Hayward, CA","method":"GET","page":"Home","registration":1540919166796.0,"sessionId":38,"song":null,"status":200,"ts":1541105830796,"userAgent":"\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"","userId":"39"}
## Schema
### Fact Table
* fact_songplay: The table records log data associated with the song played by users.
### Dimension Tables
* dim_user: Users of Sparkify app. Columns are user_id, first_name, last_name, gender, level.
* dim_song: Songs in the dataset. Columns are song_id, title, artist_id, year, duration.
* dim_artist: Artists in the dataset. Columns are artist_id, name, location, lattitude, longitude.
* dim_time: timestamps of records in songplays broken down into specific units. Columns are start_time, hour, day, week, month, year, weekday.
## How To Run
You should have an AWS account to run the project.
You should setup the configuration file.Create Redshift cluster by running create_cluster_iac.py.
```
$ python create_cluster_iac.py
```Run create_tables.py to create the staging tables.
```
$ python create_tables.py
```Run etl.py to to load data from staging tables to analytics tables on Redshift.
```
$ python etl.py
```Now, you can run analytic queries on your Redshift database.
Delete the Redshift cluster by ruuning delete_cluster_iac.py
```
$ python delete_cluster_iac.py
```