{"id":20748752,"url":"https://github.com/vuanhtuan1012/data-modeling-with-postgres","last_synced_at":"2026-04-30T20:32:40.057Z","repository":{"id":155259345,"uuid":"320689454","full_name":"vuanhtuan1012/data-modeling-with-postgres","owner":"vuanhtuan1012","description":"Design database for a music app optimizing for queries on song play analysis.","archived":false,"fork":false,"pushed_at":"2020-12-30T06:45:07.000Z","size":2602,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-19T22:08:10.131Z","etag":null,"topics":["analyses","data-modelling","database","etl-pipeline","jupyter-notebook","optimize-queries","postgresql","python3","seaborn"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vuanhtuan1012.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-12-11T21:38:05.000Z","updated_at":"2020-12-30T06:45:10.000Z","dependencies_parsed_at":null,"dependency_job_id":"4474badd-27d0-4b16-aa9d-c4b2e34b7cc7","html_url":"https://github.com/vuanhtuan1012/data-modeling-with-postgres","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/vuanhtuan1012/data-modeling-with-postgres","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-modeling-with-postgres","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-modeling-with-postgres/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-modeling-with-postgres/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-modeling-with-postgres/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vuanhtuan1012","download_url":"https://codeload.github.com/vuanhtuan1012/data-modeling-with-postgres/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vuanhtuan1012%2Fdata-modeling-with-postgres/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32476682,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-30T13:12:12.517Z","status":"ssl_error","status_checked_at":"2026-04-30T13:12:06.837Z","response_time":57,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analyses","data-modelling","database","etl-pipeline","jupyter-notebook","optimize-queries","postgresql","python3","seaborn"],"created_at":"2024-11-17T08:18:13.416Z","updated_at":"2026-04-30T20:32:40.031Z","avatar_url":"https://github.com/vuanhtuan1012.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Modeling with Postgres\n\n## Introduction\n\nA startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in **understanding what songs users are listening to**.\n\nCurrently, they don’t have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app and a directory with JSON metadata on the songs in their app. They’d like to **create a Postgres database with tables designed to optimize queries on song play analysis**.\n\nIn this project, I will:\n- Create a star schema:  define fact and dimension tables for analytic focus.\n- Write an ETL pipeline: transfer data from files in two local directories into tables in Postgres.\n- Test the database and ETL pipeline.\n- Do analyses on the song plays.\n\n## Repo Structure\n\nThis repo consists of seven files and two directories:\n1. [data](data/) directory contains two sub-directories:\n\t- [song_data](data/song_data/) is a subset of real data from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song.\n\t- [log_data](data/log_data/) consists of log files in JSON format generated by [event simulator](https://github.com/Interana/eventsim) simulating activity logs from Sparkify music streaming app.\n2. [create_table.py](create_tables.py) drops and creates the database `sparkifydb` with its tables. This file is used to reset the database before running ETL scripts.\n3. [etl.py](etl.py) reads and processes files from `data` directory and loads them into tables.\n4. [sql_queries.py](sql_queries.py) contains all SQL queries and is imported to `create_table.py`, `etl.py`, and `etl.ipynb`.\n5. [etl.ipynb](etl.ipynb) reads and processes a single file from `song_data` and `log_data` and loads the data into tables.\n6. [test.ipynb](test.ipynb) displays the first few rows of each table to check the database.\n7. [dashboard.ipynb](dashboard.ipynb) generates statistics and analytic graphs on the database.\n8. [images](images/) directory contains all images generated from `dashboard.ipynb`\n9. [README.md](README.md) (this file) gives a summary of the project and an explanation of designing the schema and programming. It also provides analyses on song plays.\n\n## Designing Database\n\nDesigning the database `sparkifydb` in PostgreSQL, the Sparkify team has two objectives:\n1. Load data from JSON files into tables to ease query data.\n2. Tables designed to optimize queries on song play analysis.\n\nEach file in `song_data` contains metadata about the song and artist of that song. This data, therefore, is stored in two separate tables:\n- **artists** (artist_id, *name, location, latitude, longitude*): \n\t- *artist_id* is the primary key of this table as it's unique to each artist.\n\t- Since the value of *artist_id* in JSON file is text, its data type is VARCHAR.\n- **songs** (song_id, *artist_id, title, year, duration*):\n\t- *song_id* is the primary key as it's unique to each song.\n\t- *artist_id* is foreign key is to link to the table **artists**.\n\nEach file in `log_data` contains data about the user and the song, the time, the location, the browser, etc., when he uses the app. This data, therefore, is stored in two separate tables:\n- **users** (user_id, *first_name, last_name, gender, level*):\n\t- *user_id* is the primary key, it's unique to each user.\n\t- Since the value of *user_id* in JSON file is a number, its data type is INT\n- **songplays** (songplay_id, *user_id, start_time, song_id, session_id, location, user_agent*):\n\t- *songplay_id* is the primary key. It doesn't exists in JSON file, so its data type is set to SERIAL to ease inserting data.\n\t- *user_id, song_id* are foreign keys to link to tables **users, songs**\n\nThis design of four tables satisfies 3NF, but it limits flexibility and doesn't optimize song play analysis queries.\n\nFor example, to answer the question *\"which is the most favorite artist in the app?\"* we need to join three tables **songplays, songs**, and **artists**. Another example, to answer *\"which type of user is more active?\"* we need to join two tables **songplays** and **users**.\n\nTherefore, to optimize queries on song play analysis, we do denormalization.\n- add two more columns *level* and *artist_id* to the table **songplays**.\n- break down *start_time* into specific units: *day, month, year, hour, week, weekday*. They are stored in a new table **time** (start_time, *day, month, year, hour, week, weekday*).\n\nFinally, we have a database schema optimized on queries on song play in the figure below.\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/sparkify_schema.svg\" alt=\"Sparkify Database Schema\"\u003e\n\u003c/p\u003e\n\n## Data Constraints\n\nForeign keys are required NOT NULL. Those are fields:\n- *artist_id* of the table **songs**\n- *start_time, user_id, song_id, artist_id* of the table **songplays**. However, to ease the tests, I let *song_id* and *artist_id* are IS NULL.\n\nThe user level indicates the type of user, `free` or `paid`, hence the column *level* of the table **users** and table **songplays** are also NOT NULL.\n\n## Inserting Data\n\nSince the data is loaded in from files, there may be a conflict at tables **songs, artists,** and **time** on its primary key if that user, song, artist, and time have been added earlier. We need, therefore, to set *do nothing* if having conflict when inserting data. For example,\n\n```SQL\nsong_table_insert = (\"\"\"\nINSERT INTO songs(song_id, title, artist_id, year, duration)\nVALUES(%s, %s, %s, %s, %s)\nON CONFLICT(song_id)\nDO NOTHING;\n\"\"\")\n```\n\nSince users might change from `free` to `paid` and vice versa, we update the column  *level* for **users** table for the existing records.\n\n```SQL\nuser_table_insert = (\"\"\"\nINSERT INTO users(user_id, first_name, last_name, gender, level)\nVALUES(%s, %s, %s, %s, %s)\nON CONFLICT(user_id)\nDO UPDATE\n    SET level = EXCLUDED.level;\n\"\"\")\n```\n\nThe primary key of table **songplays** is an auto-increment field, so there's no conflict when inserting data. But I may have duplicate data. We need to remove duplicates when inserting data.\n\nFrom my perspective, two records are duplicated if they have the same values in all fields except *songplay_id*.\n\n```Python\n# insert songplay records\nsongplay_data = list()\nfor index, row in df.iterrows():\n\t# get songid and artistid from song and artist tables\n\tcur.execute(song_select, (row.song, row.artist, row.length))\n\tresults = cur.fetchone()\n\n\tif results:\n\t\tsongid, artistid = results\n\telse:\n\t\tsongid, artistid = None, None\n\n\tsongplay_data.append((\n\t\tpd.to_datetime(row.ts, unit='ms'), row.userId,\n\t\trow.level, songid, artistid, row.sessionId,\n\t\trow.location, row.userAgent\n\t))\n# remove duplicates\nsongplay_data = list(set(songplay_data))\ncur.executemany(songplay_table_insert, songplay_data)\n```\n\n## How to run the Python scripts\n\nThe scripts will connect to PostgreSQL at the address `127.0.0.1` by using the username `student` and password `student`. The user `student` has to have permission to create a database. You also must have a database named `studentdb` on your system.\n\nFollowing these steps below to test the program:\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/sparkify_process.svg\" alt=\"Sparkify Process\"\u003e\n\u003c/p\u003e\n\n1. Create the database `sparkifydb`: run the script `create_tables.py`.\n**Attention:** This script will drop your database `sparkifydb` if it exists.\n2. Import data from the directory `data` into the database:\n\t- Run the script `etl.py` if you want to load data from all JSON files into tables.\n\t- Run the notebook `etl.ipynb` if you want to load data from one JSON file in `song_data` and one JSON file in `log_data` into tables.\n3. After creating the database and importing data, you're free to run notebooks:\n\t- The notebook `dashboard.ipynb` provides a general statistic on tables and gives some analytic graphs on song plays.\n\t- The notebook `test.ipynb` displays 5 rows of each table.\n\n## General Statistics\n\nThese statistics are realized after loading all files in `data` into the database.\n- Total songplays = 6820\n- Total users = 96\n- Total artists = 69\n- Total songs = 71\n\n## Analyses\n\nThese analyses are realised after loading all files in `data` into the database.\n\n#### Which user level is more active on Sparkify app?\n \n```SQL\nSELECT level, count(songplay_id)\nFROM songplays\nGROUP BY level;\n```\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/plays_per_level.png\" alt=\"plays per level\"\u003e\n\u003c/p\u003e\n\n#### What is the rate of user levels?\n\n```SQL\nSELECT level, count(user_id)\nFROM users\nGROUP BY level;\n```\n \n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/users_per_level.png\" alt=\"users per level\"\u003e\n\u003c/p\u003e\n\n\n#### Which browsers are used to access Sparkify app?\n \n```SQL\nSELECT user_agent, count(songplay_id)\nFROM songplays\nGROUP BY user_agent;\n```\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/browser.png\" alt=\"browser\"\u003e\n\u003c/p\u003e\n\n#### Which OS are used to access Sparkify app?\n \n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/os.png\" alt=\"os\"\u003e\n\u003c/p\u003e\n\n#### Which devices are used to access Sparkify app?\n \n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/device.png\" alt=\"device\"\u003e\n\u003c/p\u003e\n\n#### What is the rate of using Sparkify app over weeks?\n\n```SQL\nSELECT week, count(songplay_id)\nFROM songplays\nJOIN time ON songplays.start_time = time.start_time\nGROUP BY week\nORDER BY week;\n```\n \n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/week.png\" alt=\"week\"\u003e\n\u003c/p\u003e\n\n#### Top users, top regions using Sparkify app\n\n```SQL\nSELECT user_id, count(songplay_id) AS plays\nFROM songplays\nGROUP BY user_id\nORDER BY plays DESC\nLIMIT 10;\n```\n\n```SQL\nSELECT REVERSE(TRIM(SPLIT_PART(REVERSE(location), ',', 1))) AS region, count(songplay_id) AS plays\nFROM songplays\nGROUP BY region\nORDER BY plays DESC\nLIMIT 10;\n```\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"images/top_users_regions.png\" alt=\"top users regions\"\u003e\n\u003c/p\u003e","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvuanhtuan1012%2Fdata-modeling-with-postgres","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvuanhtuan1012%2Fdata-modeling-with-postgres","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvuanhtuan1012%2Fdata-modeling-with-postgres/lists"}