{"id":21888369,"url":"https://github.com/najuzilu/cdw-awsredshift","last_synced_at":"2026-05-15T23:32:53.565Z","repository":{"id":168729801,"uuid":"368928593","full_name":"najuzilu/CDW-AWSRedshift","owner":"najuzilu","description":"Building a cloud data warehouse with AWS Redshift.","archived":false,"fork":false,"pushed_at":"2021-08-01T21:27:12.000Z","size":342,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-26T20:31:58.207Z","etag":null,"topics":["aws-ec2","aws-redshift","cloud-data-warehouse","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/najuzilu.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-19T16:09:16.000Z","updated_at":"2021-08-01T21:27:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"cd4daee0-a1fc-40e1-b29d-d0ddeeea4eeb","html_url":"https://github.com/najuzilu/CDW-AWSRedshift","commit_stats":null,"previous_names":["najuzilu/cdw-awsredshift"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/najuzilu%2FCDW-AWSRedshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/najuzilu%2FCDW-AWSRedshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/najuzilu%2FCDW-AWSRedshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/najuzilu%2FCDW-AWSRedshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/najuzilu","download_url":"https://codeload.github.com/najuzilu/CDW-AWSRedshift/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244895457,"owners_count":20527893,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws-ec2","aws-redshift","cloud-data-warehouse","python"],"created_at":"2024-11-28T11:15:07.436Z","updated_at":"2026-05-15T23:32:48.532Z","avatar_url":"https://github.com/najuzilu.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Language](https://img.shields.io/badge/language-python--3.8-blue) [![Contributors][contributors-shield]][contributors-url] [![Stargazers][stars-shield]][stars-url] [![Forks][forks-shield]][forks-url] [![Issues][issues-shield]][issues-url] [![MIT License][license-shield]][license-url] [![LinkedIn][linkedin-shield]][linkedin-url]\n\n\u003cbr /\u003e\n\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/najuzilu/CDW-AWSRedshift\"\u003e\n        \u003cimg src=\"./images/logo.png\" alt=\"Logo\" width=\"300\" height=\"200\"\u003e\n    \u003c/a\u003e\n    \u003ch3 align=\"center\"\u003eCloud Data Warehouse with AWS Redshift\u003c/h3\u003e\n\u003c/p\u003e\n\n## About the Project\n\nSparkify has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.\n\nThey'd like a data engineer to build an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.\n\n## Description\n\nIn this project, you will move Sparkify's processes and data onto the cloud. Specifically, you will build ETL pipelines that extract data from S3 and stage them in Redshift, while transforming the data into a set of dimensional tables to allow Sparkify's analytical team to explore user song preferences and find insights.\n\n### Tools\n\n* python\n* AWS\n* Redshift\n\n## Datasets\n\nYou will work with two datasets that reside in S3. Here are the S3 links for each dataset:\n* Song data: `s3://udacity-dend/song_data`\n* Log data: `s3://udacity-dend/log_data`\n    * Log data JSON path: `s3://udacity-dend/log_json_path.json`.\n\nThe song dataset contains a subset of the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID like so:\n\n```text\nsong_data/A/B/C/TRABCEI128F424C983.json\nsong_data/A/A/B/TRAABJL12903CDCF1A.json\n```\n\nThis is what the content of each JSON file looks like:\n```json\n{\"num_songs\": 1, \"artist_id\": \"ARJIE2Y1187B994AB7\", \"artist_latitude\": null, \"artist_longitude\": null, \"artist_location\": \"\", \"artist_name\": \"Line Renaud\", \"song_id\": \"SOUPIRU12A6D4FA1E1\", \"title\": \"Der Kleine Dompfaff\", \"duration\": 152.92036, \"year\": 0}\n```\n\nThe log dataset contains simulated app activity logs from a music streaming app based on configuration settings. The log files are partitioned by year and month, like so:\n\n```text\nlog_data/2018/11/2018-11-12-events.json\nlog_data/2018/11/2018-11-13-events.json\n```\n\nHere is an example of what the data in the log file looks like.\n![2018-11-12-events](./images/2018-11-12-events.png)\n\n## ERD Model\n\nYou will use the _star database schema_ as data model for this ETL pipeline, which contains fact and dimension tables. An entity relationship diagram is shown below.\n\n![image1.jpeg](./images/erd.jpeg)\n\n## Getting Started\n\nClone this repository\n\n```bash\ngit clone https://github.com/najuzilu/CDW-AWSRedshift.git\n```\n\n### Prerequisites\n\n* conda\n* python 3.8\n* psycopg2\n* boto3\n* json\n* botocore\n* configparser\n\nCreate a virtual environment through Anaconda using\n\n```bash\nconda env create --file environment.yml\n```\n\n## Project Steps\n\n1. Use `dwh_example.cfg` to create and populate a `dwh.cfg` file with the AWS Access Key and Secret Key fields.\n2. Run `create_tables.py` to create a new Redshift cluster on AWS and tables.\n    ```bash\n    python create_tables.py\n    ```\n3. Run `etl.py` to load the data from S3 to staging tables in Redshift and insert data from staging tables to final tables.\n    ```bash\n    python etl.py\n    ```\n    The pipeline will also execute test queries to make sure the tables have been populated.\n\n## Authors\n\nYuna Luzi - @najuzilu\n\n## License\n\nDistributed under the MIT License. See `LICENSE` for more information.\n\n\u003c!-- Links ---\u003e\n\n[contributors-shield]: https://img.shields.io/github/contributors/najuzilu/CDW-AWSRedshift.svg?style=flat-square\n[contributors-url]: https://github.com/najuzilu/CDW-AWSRedshift/graphs/contributors\n[forks-shield]: https://img.shields.io/github/forks/najuzilu/CDW-AWSRedshift.svg?style=flat-square\n[forks-url]: https://github.com/najuzilu/CDW-AWSRedshift/network/members\n[stars-shield]: https://img.shields.io/github/stars/najuzilu/CDW-AWSRedshift.svg?style=flat-square\n[stars-url]: https://github.com/najuzilu/CDW-AWSRedshift/stargazers\n[issues-shield]: https://img.shields.io/github/issues/najuzilu/CDW-AWSRedshift.svg?style=flat-square\n[issues-url]: https://github.com/najuzilu/CDW-AWSRedshift/issues\n[license-shield]: https://img.shields.io/badge/License-MIT-yellow.svg\n[license-url]: https://github.com/najuzilu/CDW-AWSRedshift/blob/master/LICENSE\n[linkedin-shield]: https://img.shields.io/badge/-LinkedIn-black.svg?style=flat-square\u0026logo=linkedin\u0026colorB=555\n[linkedin-url]: https://www.linkedin.com/in/yuna-luzi/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnajuzilu%2Fcdw-awsredshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnajuzilu%2Fcdw-awsredshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnajuzilu%2Fcdw-awsredshift/lists"}