{"id":20748757,"url":"https://github.com/vuanhtuan1012/data-warehouse","last_synced_at":"2026-04-18T21:32:22.802Z","repository":{"id":155259343,"uuid":"325940538","full_name":"vuanhtuan1012/data-warehouse","owner":"vuanhtuan1012","description":"Building an ETL pipeline that extracts data from S3, stages them in Redshift.","archived":false,"fork":false,"pushed_at":"2021-05-05T04:29:26.000Z","size":439,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-18T03:23:58.789Z","etag":null,"topics":["aws","cloud-data-warehouse","etl-pipeline","redshift"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vuanhtuan1012.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-01-01T08:07:22.000Z","updated_at":"2024-04-12T11:06:52.000Z","dependencies_parsed_at":null,"dependency_job_id":"864403ce-3386-45d3-93a1-5003c2e9f2d5","html_url":"https://github.com/vuanhtuan1012/data-warehouse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-warehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-warehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-warehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-warehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vuanhtuan1012","download_url":"https://codeload.github.com/vuanhtuan1012/data-warehouse/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243048116,"owners_count":20227592,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","cloud-data-warehouse","etl-pipeline","redshift"],"created_at":"2024-11-17T08:18:14.266Z","updated_at":"2026-04-18T21:32:22.755Z","avatar_url":"https://github.com/vuanhtuan1012.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Warehouse\n\n## Contents\n\n1. [Introduction](#1-introduction)\n2. [Dataset](#2-dataset)\n3. [Repo Structure](3-repo-structure)\n4. [Create Tables](#4-create-tables)\n5. [Load Data](#5-load-data)\n6. [Analyses](#6-analyses)\n\n## 1. Introduction\n\nA music streaming startup, Sparkify, has grown their user base and song database and **want to move their processes and data onto the cloud**. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.\n\nIn this project, I will:\n- building an ETL pipeline that extracts their data from S3, stages them in Redshift.\n- transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.\n\n## 2. Dataset\n\nI'll be working with two datasets that reside in S3:\n- Song data: `s3://udacity-dend/song_data`\n- Log data: `s3://udacity-dend/log_data`\n\n## 3. Repo Structure\n\nThis repo consists of six files and one directories:\n1. [dwh.cfg](dwh.cfg) contains information of Redshift database and IAM role.\n2. [sql_queries.py](sql_queries.py)  contains all SQL queries to drop, create tables, and transform data from S3 into tables.\n3.  [create_table.py](create_tables.py) connects to the database and drops tables (if it exists) before creating it.\n4.  [etl.py](etl.py)  load data from S3 to staging tables on Redshift, and load data from staging tables to analytics tables.\n5.  [analyses.ipynb](analyses.ipynb)  generates statistics and some analytic graphs on the database.\n6.  [images](images)  directory contains all images generated from  `analyses.ipynb`\n7.  [README.md](README.md)  (this file) gives a summary of the project and an explanation of designing tables, of transforming data and programming. It also provides analyses on song plays.\n\n## 4. Create Tables\n\nWe need two types of tables:\n1. staging tables, which are used to load data from JSON files in S3 bucket into database.\n2. analytic tables, which are used for analyzing to find the insights into their users' listening.\n\n### 4.1. Staging Tables\n\n#### Staging Songs\nFiles in song dataset are in JSON format and contain metadata about a song and the artist of that song. An example of what a single song file is in the code below.\n\n```JSON\n{\n    \"num_songs\": 1,\n    \"artist_id\": \"ARJIE2Y1187B994AB7\",\n    \"artist_latitude\": null,\n    \"artist_longitude\": null,\n    \"artist_location\": \"\",\n    \"artist_name\": \"Line Renaud\",\n    \"song_id\": \"SOUPIRU12A6D4FA1E1\",\n    \"title\": \"Der Kleine Dompfaff\",\n    \"duration\": 152.92036,\n    \"year\": 0\n}\n```\n\nTherefore, the SQL command below is used to create the staging table `staging_songs`.\n\n```SQL\nCREATE TABLE staging_songs(\n    artist_id VARCHAR(max),\n    artist_latitude FLOAT,\n    artist_location VARCHAR(max),\n    artist_longitude FLOAT,\n    artist_name VARCHAR(max),\n    duration FLOAT,\n    num_songs INT,\n    song_id VARCHAR,\n    title VARCHAR(max),\n    year INT\n)\n```\nIn some cases, fields `artist_id`, `artist_location`, `artist_name`, and `title` have a value that is longer than 256 characters, so these fields are set to type VARCHAR(max).\n\n#### Staging Events\n\nLog files in JSON format generated by [event simulator](https://github.com/Interana/eventsim) based on the songs in the song dataset. An example of what data in a log file is shown in the table below.\n\n![log data example](images/log-data.png)\n\nTherefore, the SQL command below is used to create the staging table `staging_events`.\n\n```SQL\nCREATE TABLE staging_events(\n    artist VARCHAR,\n    auth VARCHAR,\n    first_name VARCHAR,\n    gender VARCHAR,\n    item_in_section INT,\n    last_name VARCHAR,\n    length FLOAT,\n    level VARCHAR,\n    location VARCHAR,\n    method VARCHAR,\n    page VARCHAR,\n    registration FLOAT,\n    session_id INT,\n    song VARCHAR,\n    status INT,\n    ts BIGINT,\n    user_agent VARCHAR,\n    user_id INT\n)\n```\n\n### 4.2. Analytic Tables\n\nThe figure below presents the star schema optimized for queries on song play analysis given by Sparkify team. It includes a fact table `songplays` and four dimension tables `users`,  `songs`, `artists`, `time`.\n\n![sparkify schema](images/sparkify_schema.svg)\n\nThese SQL commands below are used to create fact and dimension tables.\n\n```SQL\nCREATE TABLE songplays(\n    songplay_id BIGINT IDENTITY(1, 1) NOT NULL PRIMARY KEY,\n    start_time TIMESTAMP NOT NULL,\n    user_id INT NOT NULL,\n    level VARCHAR NOT NULL,\n    song_id VARCHAR NOT NULL,\n    artist_id VARCHAR,\n    session_id INT,\n    location VARCHAR,\n    user_agent VARCHAR\n)\n\nCREATE TABLE users(\n    user_id INT NOT NULL PRIMARY KEY,\n    first_name VARCHAR,\n    last_name VARCHAR,\n    gender VARCHAR,\n    level VARCHAR NOT NULL\n)\n\nCREATE TABLE songs(\n    song_id VARCHAR NOT NULL PRIMARY KEY,\n    title VARCHAR(max) NOT NULL,\n    artist_id VARCHAR,\n    year INT,\n    duration FLOAT\n)\n\nCREATE TABLE artists(\n    artist_id VARCHAR NOT NULL PRIMARY KEY,\n    name VARCHAR(max) NOT NULL,\n    location VARCHAR(max),\n    latitude FLOAT,\n    longitude FLOAT\n)\n\nCREATE TABLE time(\n    start_time TIMESTAMP NOT NULL PRIMARY KEY,\n    hour INT,\n    day INT,\n    week INT,\n    month INT,\n    year INT,\n    weekday INT\n)\n```\n\n## 5. Load Data\n\nFirst of all, data is loaded from S3 bucket into staging tables. Then, it will be transformed from staging tables into analytic tables.\n\n### S3 to Staging Tables\n\n```Python\nstaging_events_copy = (\"\"\"\nCOPY staging_events\nFROM {}\nCREDENTIALS 'aws_iam_role={}'\nFORMAT AS JSON {}\nREGION 'us-west-2'\nBLANKSASNULL\nEMPTYASNULL\n\"\"\").format(LOG_DATA, ARN, LOG_JSON_PATH)\n\nstaging_songs_copy = (\"\"\"\nCOPY staging_songs\nFROM {}\nCREDENTIALS 'aws_iam_role={}'\nFORMAT AS JSON 'auto'\nREGION 'us-west-2'\nBLANKSASNULL\nEMPTYASNULL\n\"\"\").format(SONG_DATA, ARN)\n```\n\n### Staging Tables to Analytic Tables\n\n```SQL\nINSERT INTO songplays(start_time, user_id, level, song_id, artist_id,\nsession_id, location, user_agent)\nSELECT\nTIMESTAMP 'epoch' + ts::numeric / 1000 * INTERVAL '1 second' AS start_time,\nuser_id, level, s.song_id, s.artist_id, session_id, location, user_agent\nFROM staging_events e\nJOIN staging_songs s\nON (s.title = e.song) AND (e.artist = s.artist_name)\nWHERE (ts IS NOT NULL) AND (user_id IS NOT NULL)\n      AND (level is NOT NULL) AND (song_id IS NOT NULL)\n\nINSERT INTO users\nSELECT user_id, first_name, last_name, gender, level\nFROM staging_events\nWHERE (user_id, ts) IN (\n  SELECT user_id, MAX(ts)\n  FROM staging_events\n  WHERE user_id IS NOT NULL\n  GROUP BY user_id) AND (level IS NOT NULL)\n\nINSERT INTO songs\nSELECT DISTINCT song_id, title, artist_id, year, duration\nFROM staging_songs\nWHERE (song_id IS NOT NULL) AND (title IS NOT NULL)\n\nINSERT INTO artists\nSELECT DISTINCT artist_id, artist_name, artist_location,\n                artist_latitude, artist_longitude\nFROM staging_songs\nWHERE (artist_id IS NOT NULL) AND (artist_name IS NOT NULL)\n\nINSERT INTO time\nSELECT DISTINCT\nTIMESTAMP 'epoch' + ts::numeric / 1000 * INTERVAL '1 second' AS start_time,\nDATE_PART(h, start_time) as hour,\nDATE_PART(d, start_time) as day,\nDATE_PART(w, start_time) as week,\nDATE_PART(mon, start_time) as month,\nDATE_PART(y, start_time) as year,\nDATE_PART(weekday, start_time) as weekday\nFROM staging_events\nWHERE ts IS NOT NULL\n```\n\nAs user's level can be changed over time, I take only the last information if an user have more than one record in staging tables.\n\n## 6. Analyses\n\nIn this section, I do some queries on the database to find insight into Sparkify's users.\n\n### General Statistics\n\n-   Total songplays = 333\n-   Total users = 97\n-   Total artists = 9553\n-   Total songs = 14896\n\n### Which user level is more active on Sparkify app?\n\n```SQL\nSELECT level, count(songplay_id)\nFROM songplays\nGROUP BY level;\n```\n![plays per level](images/plays_per_level.png)\n\n### What is the ratio of user levels?\n\n```SQL\nSELECT level, count(user_id)\nFROM users\nGROUP BY level;\n```\n\n![users per level](images/users_per_level.png)\n\n### Which browsers are used to access Sparkify app?\n\n```SQL\nSELECT user_agent, count(songplay_id)\nFROM songplays\nGROUP BY user_agent;\n```\n\n![browser](images/browser.png)\n\n### What is the rate of using Sparkify app over weeks?\n\n```SQL\nSELECT week, count(songplay_id)\nFROM songplays\nJOIN time ON songplays.start_time = time.start_time\nGROUP BY week\nORDER BY week;\n```\n\n![week](images/week.png)\n\n### Top users, regions using Sparkify app\n\n```SQL\nSELECT user_id, count(songplay_id) AS plays\nFROM songplays\nGROUP BY user_id\nORDER BY plays DESC\nLIMIT 10;\n\nSELECT REVERSE(TRIM(SPLIT_PART(REVERSE(location), ',', 1))) AS region, count(songplay_id) AS plays\nFROM songplays\nGROUP BY region\nORDER BY plays DESC\nLIMIT 10;\n```\n\n![top users regions](images/top_users_regions.png)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvuanhtuan1012%2Fdata-warehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvuanhtuan1012%2Fdata-warehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvuanhtuan1012%2Fdata-warehouse/lists"}