{"id":20924557,"url":"https://github.com/alkasaliss/aws-datawarehouse","last_synced_at":"2026-05-18T04:40:53.684Z","repository":{"id":97738671,"uuid":"301540054","full_name":"AlkaSaliss/aws-datawarehouse","owner":"AlkaSaliss","description":"Udacity Data engineering Nanodegree: Project 3 on datawarehouse implementation ","archived":false,"fork":false,"pushed_at":"2020-10-17T15:34:32.000Z","size":287,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-19T17:54:38.464Z","etag":null,"topics":["aws","data-engineering","etl-pipeline","python","redshift","s3-bucket","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AlkaSaliss.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-10-05T21:07:42.000Z","updated_at":"2020-10-17T15:34:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"a6d20a73-cb72-40f5-a574-b0232d104150","html_url":"https://github.com/AlkaSaliss/aws-datawarehouse","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlkaSaliss%2Faws-datawarehouse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlkaSaliss%2Faws-datawarehouse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlkaSaliss%2Faws-datawarehouse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AlkaSaliss%2Faws-datawarehouse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AlkaSaliss","download_url":"https://codeload.github.com/AlkaSaliss/aws-datawarehouse/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243318745,"owners_count":20272139,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","data-engineering","etl-pipeline","python","redshift","s3-bucket","sql"],"created_at":"2024-11-18T20:23:30.075Z","updated_at":"2026-05-18T04:40:48.634Z","avatar_url":"https://github.com/AlkaSaliss.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sparkify : Your Provider for The World Best Music Library\n\nThis repository contains scripts for **E**xtracting music data from AWS S3 bucket and loading them into AWS Redshift staging tables, **T**ransform these data to match the **sparkifydb** database schema, and **L**oad the transformed data into the database final fact and dimension tables.\n\nThe two staging tables schemas are represented below :\n\n| staging_events | staging_songs |\n|----------------|---------------|\n|![db schema](assests/staging_events.png)|![db schema](assests/staging_songs.png)|\n\nFollowing is a diagram representing the database schema, with 1 fact table `songplays` and 4 dimension tables `users`, `time`, `songs` and `artists` :\n\n![db schema](assests/sparkifydb_schema.png)\n\nThe database design follows a `star schema` to help our analyst team, **The Sparkalysts**, in their mission to answering the questions running through the head of our CEO **The Big Spark** such as :\n\n1. List of songs listened by user `Lily` `Koch` ?\n2. In which year did our users listened the most to music ?\n3. ...\n\n## Project dataset\n\nWe have two datasets stored on S3 buckets :\n* song data : json files representing a subset of the **Million Song Dataset**; sample of song file below :\n\t```json\n\t{\n\t\t\"song_id\": \"SOUYDPQ12A6D4F88E6\",\n\t\t\"num_songs\": 1,\n\t\t\"title\": \"Tears Of Joy\",\n\t\t\"artist_name\": \"Wendy \u0026 Lisa\",\n\t\t\"artist_latitude\": null,\n\t\t\"year\": 1989,\n\t\t\"duration\": 278.46485,\n\t\t\"artist_id\": \"ARN4X0U1187B9AFF37\",\n\t\t\"artist_longitude\": null,\n\t\t\"artist_location\": \"\"\n\t}\n\t```\n* log data : json files representing our users' activities regarding the songs; sample log record :\n\t```json\n\t\t{\n\t\t\t\"artist\": null,\n\t\t\t\"auth\": \"Logged In\",\n\t\t\t\"firstName\": \"Walter\",\n\t\t\t\"gender\": \"M\",\n\t\t\t\"itemInSession\": 0,\n\t\t\t\"lastName\": \"Frye\",\n\t\t\t\"length\": null,\n\t\t\t\"level\": \"free\",\n\t\t\t\"location\": \"San Francisco-Oakland-Hayward, CA\",\n\t\t\t\"method\": \"GET\",\n\t\t\t\"page\": \"Home\",\n\t\t\t\"registration\": 1540919166796.0,\n\t\t\t\"sessionId\": 38,\n\t\t\t\"song\": null,\n\t\t\t\"status\": 200,\n\t\t\t\"ts\": 1541105830796,\n\t\t\t\"userAgent\": \"\\\"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.143 Safari/537.36\\\"\",\n\t\t\t\"userId\": \"39\"\n\t\t}\n\t```\n\n## Project Structure\n\nThe project is structured as follow:\n\n* A script `iac_scripts.py` (Infrastructure As Code) which contains functions for creating/deleting IAM roles and Redshift clusters\n* A script `sql_queries.py` which contains all the SQL queries for creating the `sparkifydb`, the different tables, and ETL pipeline.\n* A script `create_tables.py` which creates the database and the defined tables\n* A script `etl.py` which extracts the data from S3 and transforms the log and song data before loading the processed data into the tables created by the script `create_tables.py`\n\n## Project Setup\n\nTo set everything up, there is an extra requirement a configuration file `dwh.cfg` in the project folder located at the same level of other scripts. This file contains the credentials and AWS  for connecting to the database. Following is a sample configuration file content :\n\n```\n[CLUSTER]\nHOST='your cluster host address'\nDWH_CLUSTER_TYPE=multi-node\nDWH_NUM_NODES=2\nDWH_NODE_TYPE=dc2.large\nDWH_CLUSTER_IDENTIFIER=dwhCluster\nDB_NAME=sparkifydb\nDB_USER=dwhuser\nDB_PASSWORD=\"your very complicated and long passsword\"\nDB_PORT=5439\nREGION='us-west-2'\n\n[IAM_ROLE]\nROLE_NAME=dwhRole\nARN='your iam role arn'\n\n[S3]\nLOG_DATA='s3://udacity-dend/log_data'\nLOG_JSONPATH='s3://udacity-dend/log_json_path.json'\nSONG_DATA='s3://udacity-dend/song_data'\n\n[AWS]\nKEY='the key of your IAM user with admin rights'\nSECRET='the secret of your IAM user with admin rights'\n```\n\n\u003e Note : you'll needto create an IAM user with admin rights if you don't already have one\n\nTo set up the project the following steps can be followed in the given order :\n\n* run the following command :\n\t```sh\n\tpython iac_scripts.py -t create_role\n\t```\n\tThis will create an IAM Role and save its characteristics in json file `role_arn.json`. You can copy the ARN from the json file in order to fill the config file with right ARN.\n\n* Next, run :\n\t```sh\n\tpython iac_scripts.py -t create_cluster\n\t```\n\tThis will create a Redshift cluster. You can adapt the cluster config according to your needs by modifying the config file, more precisely the `[CLUSTER]` section. When the script finishes it'll dump the cluster characteristics in `redshift_cluster.json` file in current directory.\n\tThen we can copy cluster address from this file to fill the `HOST` field in the configurtion file `dwh.cfg`\n\n* Next, run :\n\t```sh\n\tpython create_tables.py\n\t```\n  this will connect to the Redshift cluster, drop the tables if they exist and recreate them\n\n* Next run :\n\t```bash\n\tpython etl.py -t staging\n\t```\n\tthat will copy the songs and log data from S3 buckets into the two staging tables\n\n* Finally, run :\n\t```bash\n\tpython etl.py -t analytics\n\t```\n\tthat will extract data from staging tables and apply the necessary transformations before loading them into the fact and dimension tables.\n\n## Design choice\n\nThe for all dimension tables, primary key is used as sort keys as they will be used all the time for joining with fact table.\n\nFor the fact table, the `start_time` column is used as distribution key and sort key, as a very frequent query would be to get activities for a given range date, given months, days, ... Also distributing according to this column may give more even balanced distribution across nodes. Other candidate columns for distribution key are `song_id` and `artist_id`, as there might also be frequent to join with song and artist tables respectively\n\n## TO-DO List\n\n* [ ] Add Analytic Dashboard for easier interaction with the database\n* [ ] Run performance benchmark to compare different distribution strategies adn distribution keys choices\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falkasaliss%2Faws-datawarehouse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falkasaliss%2Faws-datawarehouse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falkasaliss%2Faws-datawarehouse/lists"}