{"id":18811676,"url":"https://github.com/gares95/data-warehouse_aws-redshift","last_synced_at":"2026-01-11T13:30:19.608Z","repository":{"id":218120526,"uuid":"304048368","full_name":"Gares95/Data-Warehouse_AWS-Redshift","owner":"Gares95","description":"Building an ETL pipeline for a database hosted on Redshift. Project based on Udacity's template. ","archived":false,"fork":false,"pushed_at":"2020-10-14T15:30:07.000Z","size":35,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2024-12-30T00:13:15.564Z","etag":null,"topics":["aws-redshift","data-warehouse","redshift","udacity-nanodegree"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Gares95.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2020-10-14T14:55:06.000Z","updated_at":"2021-02-14T15:51:18.000Z","dependencies_parsed_at":"2024-01-19T21:31:18.473Z","dependency_job_id":"4d09d143-da69-4ae7-872a-c3e2d978c37a","html_url":"https://github.com/Gares95/Data-Warehouse_AWS-Redshift","commit_stats":null,"previous_names":["gares95/data-warehouse_aws-redshift"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gares95%2FData-Warehouse_AWS-Redshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gares95%2FData-Warehouse_AWS-Redshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gares95%2FData-Warehouse_AWS-Redshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Gares95%2FData-Warehouse_AWS-Redshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Gares95","download_url":"https://codeload.github.com/Gares95/Data-Warehouse_AWS-Redshift/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239748248,"owners_count":19690232,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-redshift","data-warehouse","redshift","udacity-nanodegree"],"created_at":"2024-11-07T23:27:16.634Z","updated_at":"2026-01-11T13:30:19.563Z","avatar_url":"https://github.com/Gares95.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Warehouse with AWS Redshift\n***\nThis project includes files to create and define tables for a database with a star schema by data modeling, and also to build an ETL pipeline for a database in Redshift.\n\nThis repository simulates the creation of an ETL pipeline for a music streaming startup whose data resides in S3 and want to transform it into a set of dimensional tables for their analytics team. \n\nThe data is extracted from JSON logs in S3 buckets in AWS where the clients have been loading all the information they've collected over some time. This data will be processed to allow the clients to **analyze** and to **extract new relevant information** which can help their decision-making process on future options regarding marketing, store availability...  \nHaving this information available with queries is a powerful tool that will give the client plenty of flexibility.\n\nThe files that include this project are:\n\n* create_tables.py\n* sql_queries.py\n* etl.py\n* dwh.cfg\n* Redshift_Management.ipynb\n\n## Data Files\n***\nThe datasets used for this project that reside in S3 are:\n- Song data: s3://udacity-dend/song_data\n- Log data: s3://udacity-dend/log_data\n\n### Song Dataset\nThe first dataset is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. Here is an example of a filepath: _\"song_data/A/B/C/TRABCEI128F424C983.json\"_\nAnd here is an example of one of the json files: _{\"num_songs\": 1, \"artist_id\": \"ARJIE2Y1187B994AB7\", \"artist_latitude\": null, \"artist_longitude\": null, \"artist_location\": \"\", \"artist_name\": \"Line Renaud\", \"song_id\": \"SOUPIRU12A6D4FA1E1\", \"title\": \"Der Kleine Dompfaff\", \"duration\": 152.92036, \"year\": 0}_\n\n### Log Dataset\nThe second dataset consists of log files in JSON format generated by this _event simulator_ based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.\nHere is an example of a filepath: _\"log_data/2018/11/2018-11-12-events.json\"_\nAnd here is an example of a json file for these events: _{\"artist\": \"None\", \"auth\": \"Logged In\", \"gender\": \"F\", \"itemInSession\": 0, \"lastName\": \"Williams\", \"length\": \"227.15873\", \"level\": \"free\", \"location\": \"Klamath Falls OR\", \"method\": \"GET\", \"page\": \"Home\", \"registration\": \"1.541078e-12\", \"sessionId\": \"438\", \"Song\": \"None\", \"status\": \"200\", \"ts\": \"15465488945234\", \"userAgent\": \"Mozilla/5.0(WindowsNT,6.1;WOW641)\", \"userId\": \"53\"}_\n\n### The star schema tables\nThe star schema that is going to be created using this program will have the next structure:\n\n- _Fact table_:\n1. songplays [songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent]\n\n- _Dimension tables_:\n2. users [user_id, first_name, last_name, gender, level]\n3. songs [song_id, title, artist_id, year, duration]\n4. artist [artist_id, name, location, lattitude, longitude]\n5. time [start_time, hour, day, week, month, year, weekday]\n\n![alt text](https://raw.githubusercontent.com/Gares95/Data-Warehouse_AWS-Redshift/master/Star%20Schema.PNG)\n\n## Program files\n***\n### create_tables.py\nThis file creates the connection to the Redshift cluster previously created and it creates a set of dimensional tables that will contain the information for their analytics team to obtain information about which songs their users are listening to. \n\n### sql_queries.py\nThis file contains the functions imported by the file \u003cem\u003ecreate_tables.py\u003c/em\u003e  \nwhich will allow to create the set of dimensional tables with a star schema.\nThis functions will allow to \u003cem\u003eDrop\u003c/em\u003e old tables, \u003cem\u003eCreate\u003c/em\u003e new ones and also, to \u003cem\u003eInsert\u003c/em\u003e data into them.  \n\n### etl.py\nWith this file we will copy the data from the S3 buckets to some staging tables and subsequently process the data to load it into the fact and dimensional tables.   \n\n### dwh.cfg\nThis file contains the AWS credentials to access the S3 buckets and the Redshift cluster. \nHere you will have to introduce your AWS key and secret access key:\n\n\n### Redshift_Management.ipynb\nThis file will allow us to create the Redshift cluster if it hasn't been created yet and the IAM role that will be used. It requires the _access key_ and _secret access key_ of the AWS user.\nThis code will also return the ENDPOINT (HOST) and ARN that need to be in the _dwh.cfg_ file to access the cluster when creating the tables. \nAt the end of the code this file contains the commands to delete the Redshift cluster and the role created.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgares95%2Fdata-warehouse_aws-redshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgares95%2Fdata-warehouse_aws-redshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgares95%2Fdata-warehouse_aws-redshift/lists"}