{"id":18648381,"url":"https://github.com/collumbus/data-modeling-with-postgres","last_synced_at":"2026-05-05T23:31:34.302Z","repository":{"id":130255275,"uuid":"436223257","full_name":"Collumbus/Data-Modeling-with-Postgres","owner":"Collumbus","description":"This is a project, where I applied concepts of data modeling with Postgres and build an ETL pipeline using Python. To complete the project has been defined fact and dimension tables for a star schema for a particular analytic focus, and wrote an ETL pipeline that transfers data from files in two local directories into these tables in Postgres using Python and SQL. ","archived":false,"fork":false,"pushed_at":"2021-12-12T13:21:18.000Z","size":586,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-17T20:07:43.997Z","etag":null,"topics":["data-engineering","pipeline","postgres","python","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Collumbus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-12-08T11:30:47.000Z","updated_at":"2021-12-15T18:54:00.000Z","dependencies_parsed_at":"2024-03-08T10:31:15.413Z","dependency_job_id":null,"html_url":"https://github.com/Collumbus/Data-Modeling-with-Postgres","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Collumbus/Data-Modeling-with-Postgres","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-with-Postgres","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-with-Postgres/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-with-Postgres/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-with-Postgres/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Collumbus","download_url":"https://codeload.github.com/Collumbus/Data-Modeling-with-Postgres/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Collumbus%2FData-Modeling-with-Postgres/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32672528,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-05T11:29:49.557Z","status":"ssl_error","status_checked_at":"2026-05-05T11:29:48.587Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","pipeline","postgres","python","sql"],"created_at":"2024-11-07T06:30:37.785Z","updated_at":"2026-05-05T23:31:34.280Z","avatar_url":"https://github.com/Collumbus.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data-Modeling-with-Postgres\nThis project arose from the need for a startup (a fictitious company) to analyze the data collected from music and user activity in its new music streaming app. \n\nThe company's data is stored in JSON files, which makes it hard to analyze by queries. So, a data modelling composed of a Postgres database and an ETL pipeline by Python was proposed, which will allow the analytics team to carry out their analysis. \n\n## Datasets\nIn the Song Dataset each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.\n\n```\nsong_data/A/B/C/TRABCEI128F424C983.json\nsong_data/A/A/B/TRAABJL12903CDCF1A.json\n```\n\nAnd below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like.\n```\n{\"num_songs\": 1, \"artist_id\": \"ARJIE2Y1187B994AB7\", \"artist_latitude\": null, \"artist_longitude\": null, \"artist_location\": \"\", \"artist_name\": \"Line Renaud\", \"song_id\": \"SOUPIRU12A6D4FA1E1\", \"title\": \"Der Kleine Dompfaff\", \"duration\": 152.92036, \"year\": 0\n```\n\nThe Log Dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.\n\nThe log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.\n\n```\nlog_data/2018/11/2018-11-12-events.json\nlog_data/2018/11/2018-11-13-events.json\n```\n\nWe can take a look at one of the log files visually using a pandas dataframe to read the data:\n```\ndf = pd.read_json(filepath, lines=True)\ndf = pd.read_json('data/log_data/2018/11/2018-11-01-events.json', lines=True)\ndf.head()\n```\nThis will allow us to view, as in the image below, the data from the 2018-11-01-event.json file.\n\n![log_sample](/img/log_data.png)\n\n\n\n\n## Project Structure\n\n```\nData Modeling with Postgres\n|____data\t\t\t        # Datasets (json files)\n| |____log_data                         # Log Dataset \n| | |____...\n| |____song_data                        # Song Dataset \n| | |____...\n|\n|____jupyter_notebooks\t\t\t# Notebooks for developing and testing ETL\n| |____etl.ipynb    \t    \t        # Developing ETL builder\n| |____test.ipynb\t    \t        # Testing ETL builder\n|\n|____scripts        \t\t\t# python codes\n| |____etl.py\t\t    \t        # ETL builder\n| |____sql_queries.py\t\t        # ETL query helper functions\n| |____create_tables.py\t\t        # Database/table creation script\n```\n\n\n## ETL Pipeline\n### etl.py\nETL pipeline builder\n\n1. `process_data`\n\t* Iterating dataset to apply `process_song_file` and `process_log_file` functions\n2. `process_song_file`\n\t* Process song dataset to insert record into _songs_ and _artists_ dimension table\n3. `process_log_file`\n\t* Process log file to insert record into _time_ and _users_ dimensio table and _songplays_ fact table\n\n### create_tables.py\nCreating Fact and Dimension table schema\n\n1. `create_database`\n2. `drop_tables`\n3. `create_tables`\n\n### sql_queries.py\nHelper SQL query statements for `etl.py` and `create_tables.py`\n\n1. `*_table_drop`\n2. `*_table_create`\n3. `*_table_insert`\n4. `song_select`\n\n\n## Database Schema\n![Entity Relationship Diagram - ERD](/img/sparkifydb_erd.png)\n\n### Fact table\n```\nsongplays\n\t- songplay_id \tPrimary key created serially\n\t- start_time \tThe foreign key related to the time table\n\t- user_id\tThe foreign key related to the users table\n\t- level\n\t- song_id \tThe foreign key related to the songs table\n\t- artist_id \tThe foreign key related to the artists table\n\t- session_id\n\t- location\n\t- user_agent\n```\n\n### Dimension table\n```\nusers\n\t- user_id \tPrimary key created serially\n\t- first_name\n\t- last_name\n\t- gender\n\t- level\n\nsongs\n\t- song_id \tPrimary key created serially\n\t- title\n\t- artist_id     Foreign key related to the artists table\n\t- year\n\t- duration\n\nartists\n\t- artist_id \tPrimary key created serially\n\t- name\n\t- location\n\t- latitude\n\t- longitude\n\ntime\n\t- start_time \tPrimary key created serially\n\t- hour\n\t- day\n\t- week\n\t- month\n\t- year\n\t- weekday\n```\n## Implementing the project...\nInitially, we have to create database and tables in postgres:\n* For this, the creation and drop statements of each table were written in the [sql_queries.py](https://github.com/Collumbus/Data-Modeling-with-Postgres/blob/main/scripts/sql_queries.py) file. Drop statements are needed so that we can delete everything and run the project more than once (for test purpose).\n\n* Next, we have to run the [create_tables.py](https://github.com/KentHsu/Udacity-Data-Engineering-Nanodgree/blob/main/Data%20Modeling%20with%20Postgres/src/create_tables.py) file to create your database and tables.\nTo run the file, just go to the project's root folder in the terminal and enter the command ```python scripts\\create_tables.py ```.\n\nOnce that's done, we need to create the ETL pipeline:\n*  For this, the entire ETL pipeline process was written in the [etl.py](https://github.com/Collumbus/Data-Modeling-with-Postgres/blob/main/scripts/etl.py) file.\n*  Like the previous file, we need to run the ```etl.py``` file through the terminal running the ```python scripts\\etl.py ``` command.\n\nP.S.:During the ETL pipeline development process an MVP was first created that can be checked in the [etl.ipynb](https://github.com/Collumbus/Data-Modeling-with-Postgres/blob/main/jupyter_notebooks/etl.ipynb) notebook jupyter. Before running ```etl.ipynb``` or running the [test.ipynb](https://github.com/Collumbus/Data-Modeling-with-Postgres/blob/main/jupyter_notebooks/test.ipynb) tests the ```create_tables.py``` file must be run.\n\n### Example of query and results for song play analysis\n\nOne of the analysts wrote a query to analyze the top 10 hours of the day with the highest number of free users using the platform to prepare a price list for ads.\n\n### Query\n```\nSELECT \n        COUNT(songplay_id) count_users,\n        t.hour \nFROM songplays sp\nJOIN time t ON sp.start_time = t.start_time\nWHERE sp.level = 'free'\nGROUP BY 2\nORDER BY 1 desc\nlIMIT 10\n```\n\n### Results\n```\n  | count_users\t| Hour\n-----------------------\n0 |\t118     | 15\n1 |\t111     | 14\n2 |\t100     | 16\n3 |\t81      | 13\n4 |\t74      | 18\n5 |\t63      | 10\n6 |\t62      | 17\n7 |\t57      | 21\n8 |\t57      | 12\n9 |\t48      | 4\n```\n**P.S.: We can run this query directly in pgAdmin's query editor for example.**","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcollumbus%2Fdata-modeling-with-postgres","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcollumbus%2Fdata-modeling-with-postgres","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcollumbus%2Fdata-modeling-with-postgres/lists"}