{"id":22100490,"url":"https://github.com/joyceannie/data-warehouse-aws","last_synced_at":"2026-04-10T22:32:37.637Z","repository":{"id":37619138,"uuid":"504676312","full_name":"joyceannie/Data-Warehouse-AWS","owner":"joyceannie","description":"A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.","archived":false,"fork":false,"pushed_at":"2022-06-22T13:52:06.000Z","size":18,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-05T08:51:11.361Z","etag":null,"topics":["aws","aws-s3","data-warehouse","python3","redshift","redshift-cluster"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/joyceannie.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-06-17T21:28:27.000Z","updated_at":"2022-06-22T13:50:43.000Z","dependencies_parsed_at":"2022-09-19T12:10:36.907Z","dependency_job_id":null,"html_url":"https://github.com/joyceannie/Data-Warehouse-AWS","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joyceannie%2FData-Warehouse-AWS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joyceannie%2FData-Warehouse-AWS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joyceannie%2FData-Warehouse-AWS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/joyceannie%2FData-Warehouse-AWS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/joyceannie","download_url":"https://codeload.github.com/joyceannie/Data-Warehouse-AWS/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245195916,"owners_count":20575936,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-s3","data-warehouse","python3","redshift","redshift-cluster"],"created_at":"2024-12-01T05:14:21.194Z","updated_at":"2026-04-10T22:32:37.601Z","avatar_url":"https://github.com/joyceannie.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Warehouse with AWS\n\n## Overview\n\nA music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The objective of the project is to create an ETL pieline to build a datawarehouse . We extract data from S3, stage them in Redshift, and transform data into a set of dimensional tables for the analytics team to continue finding insights into what songs their users are listening to.  \n\n## Datasets\n\nThere are 2 datasets that resides in S3 buckets.\n\n### Song Dataset\n\nThe first dataset is a subset of real data from the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song. \n\nSample Song Record:\n\n\u003e {\"num_songs\": 1, \"artist_id\": \"ARJIE2Y1187B994AB7\", \"artist_latitude\": null, \"artist_longitude\": null, \"artist_location\": \"\", \"artist_name\": \"Line Renaud\", \"song_id\": \"SOUPIRU12A6D4FA1E1\", \"title\": \"Der Kleine Dompfaff\", \"duration\": 152.92036, \"year\": 0}\n\n### Log Dataset\n\nThe second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations. The log files in the dataset  are partitioned by year and month. \n\nSample Log Record:\n\n\u003e {\"artist\":null,\"auth\":\"Logged In\",\"firstName\":\"Walter\",\"gender\":\"M\",\"itemInSession\":0,\"lastName\":\"Frye\",\"length\":null,\"level\":\"free\",\"location\":\"San Francisco-Oakland-Hayward, CA\",\"method\":\"GET\",\"page\":\"Home\",\"registration\":1540919166796.0,\"sessionId\":38,\"song\":null,\"status\":200,\"ts\":1541105830796,\"userAgent\":\"\\\"Mozilla\\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\\/537.36 (KHTML, like Gecko) Chrome\\/36.0.1985.143 Safari\\/537.36\\\"\",\"userId\":\"39\"}\n\n## Schema\n\n### Fact Table\n\n* fact_songplay: The table records log data associated with the song played by users. \n\n### Dimension Tables\n\n* dim_user: Users of Sparkify app. Columns are user_id, first_name, last_name, gender, level.\n\n* dim_song: Songs in the dataset. Columns are song_id, title, artist_id, year, duration.\n\n* dim_artist: Artists in the dataset. Columns are artist_id, name, location, lattitude, longitude.\n\n* dim_time: timestamps of records in songplays broken down into specific units. Columns are start_time, hour, day, week, month, year, weekday.\n\n## How To Run\n\nYou should have an AWS account to run the project.\nYou should setup the configuration file. \n\nCreate Redshift cluster by running create_cluster_iac.py.\n\n```\n$ python create_cluster_iac.py\n```\n\nRun create_tables.py to create the staging tables.\n\n```\n$ python create_tables.py\n```\n\nRun etl.py to to load data from staging tables to analytics tables on Redshift.\n\n```\n$ python etl.py\n```\n\nNow, you can run analytic queries on your Redshift database.\n\nDelete the Redshift cluster by ruuning delete_cluster_iac.py\n\n```\n$ python delete_cluster_iac.py\n```\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoyceannie%2Fdata-warehouse-aws","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjoyceannie%2Fdata-warehouse-aws","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjoyceannie%2Fdata-warehouse-aws/lists"}