https://github.com/vuanhtuan1012/data-modeling-with-postgres

Design database for a music app optimizing for queries on song play analysis.
https://github.com/vuanhtuan1012/data-modeling-with-postgres

analyses data-modelling database etl-pipeline jupyter-notebook optimize-queries postgresql python3 seaborn

Last synced: 5 months ago
JSON representation

Design database for a music app optimizing for queries on song play analysis.

Host: GitHub
URL: https://github.com/vuanhtuan1012/data-modeling-with-postgres
Owner: vuanhtuan1012
Created: 2020-12-11T21:38:05.000Z (almost 5 years ago)
Default Branch: main
Last Pushed: 2020-12-30T06:45:07.000Z (almost 5 years ago)
Last Synced: 2025-06-07T04:02:23.435Z (5 months ago)
Topics: analyses, data-modelling, database, etl-pipeline, jupyter-notebook, optimize-queries, postgresql, python3, seaborn
Language: Jupyter Notebook
Homepage:
Size: 2.48 MB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Data Modeling with Postgres

## Introduction

A startup called Sparkify wants to analyze the data they’ve been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in **understanding what songs users are listening to**.

Currently, they don’t have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app and a directory with JSON metadata on the songs in their app. They’d like to **create a Postgres database with tables designed to optimize queries on song play analysis**.

In this project, I will:
- Create a star schema: define fact and dimension tables for analytic focus.
- Write an ETL pipeline: transfer data from files in two local directories into tables in Postgres.
- Test the database and ETL pipeline.
- Do analyses on the song plays.

## Repo Structure

This repo consists of seven files and two directories:
1. [data](data/) directory contains two sub-directories:
- [song_data](data/song_data/) is a subset of real data from the [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/). Each file is in JSON format and contains metadata about a song and the artist of that song.
- [log_data](data/log_data/) consists of log files in JSON format generated by [event simulator](https://github.com/Interana/eventsim) simulating activity logs from Sparkify music streaming app.
2. [create_table.py](create_tables.py) drops and creates the database `sparkifydb` with its tables. This file is used to reset the database before running ETL scripts.
3. [etl.py](etl.py) reads and processes files from `data` directory and loads them into tables.
4. [sql_queries.py](sql_queries.py) contains all SQL queries and is imported to `create_table.py`, `etl.py`, and `etl.ipynb`.
5. [etl.ipynb](etl.ipynb) reads and processes a single file from `song_data` and `log_data` and loads the data into tables.
6. [test.ipynb](test.ipynb) displays the first few rows of each table to check the database.
7. [dashboard.ipynb](dashboard.ipynb) generates statistics and analytic graphs on the database.
8. [images](images/) directory contains all images generated from `dashboard.ipynb`
9. [README.md](README.md) (this file) gives a summary of the project and an explanation of designing the schema and programming. It also provides analyses on song plays.

## Designing Database

Designing the database `sparkifydb` in PostgreSQL, the Sparkify team has two objectives:
1. Load data from JSON files into tables to ease query data.
2. Tables designed to optimize queries on song play analysis.

Each file in `song_data` contains metadata about the song and artist of that song. This data, therefore, is stored in two separate tables:
- **artists** (artist_id, *name, location, latitude, longitude*):
- *artist_id* is the primary key of this table as it's unique to each artist.
- Since the value of *artist_id* in JSON file is text, its data type is VARCHAR.
- **songs** (song_id, *artist_id, title, year, duration*):
- *song_id* is the primary key as it's unique to each song.
- *artist_id* is foreign key is to link to the table **artists**.

Each file in `log_data` contains data about the user and the song, the time, the location, the browser, etc., when he uses the app. This data, therefore, is stored in two separate tables:
- **users** (user_id, *first_name, last_name, gender, level*):
- *user_id* is the primary key, it's unique to each user.
- Since the value of *user_id* in JSON file is a number, its data type is INT
- **songplays** (songplay_id, *user_id, start_time, song_id, session_id, location, user_agent*):
- *songplay_id* is the primary key. It doesn't exists in JSON file, so its data type is set to SERIAL to ease inserting data.
- *user_id, song_id* are foreign keys to link to tables **users, songs**

This design of four tables satisfies 3NF, but it limits flexibility and doesn't optimize song play analysis queries.

For example, to answer the question *"which is the most favorite artist in the app?"* we need to join three tables **songplays, songs**, and **artists**. Another example, to answer *"which type of user is more active?"* we need to join two tables **songplays** and **users**.

Therefore, to optimize queries on song play analysis, we do denormalization.
- add two more columns *level* and *artist_id* to the table **songplays**.
- break down *start_time* into specific units: *day, month, year, hour, week, weekday*. They are stored in a new table **time** (start_time, *day, month, year, hour, week, weekday*).

Finally, we have a database schema optimized on queries on song play in the figure below.

Sparkify Database Schema

## Data Constraints

Foreign keys are required NOT NULL. Those are fields:
- *artist_id* of the table **songs**
- *start_time, user_id, song_id, artist_id* of the table **songplays**. However, to ease the tests, I let *song_id* and *artist_id* are IS NULL.

The user level indicates the type of user, `free` or `paid`, hence the column *level* of the table **users** and table **songplays** are also NOT NULL.

## Inserting Data

Since the data is loaded in from files, there may be a conflict at tables **songs, artists,** and **time** on its primary key if that user, song, artist, and time have been added earlier. We need, therefore, to set *do nothing* if having conflict when inserting data. For example,

```SQL
song_table_insert = ("""
INSERT INTO songs(song_id, title, artist_id, year, duration)
VALUES(%s, %s, %s, %s, %s)
ON CONFLICT(song_id)
DO NOTHING;
""")
```

Since users might change from `free` to `paid` and vice versa, we update the column *level* for **users** table for the existing records.

```SQL
user_table_insert = ("""
INSERT INTO users(user_id, first_name, last_name, gender, level)
VALUES(%s, %s, %s, %s, %s)
ON CONFLICT(user_id)
DO UPDATE
SET level = EXCLUDED.level;
""")
```

The primary key of table **songplays** is an auto-increment field, so there's no conflict when inserting data. But I may have duplicate data. We need to remove duplicates when inserting data.

From my perspective, two records are duplicated if they have the same values in all fields except *songplay_id*.

```Python
# insert songplay records
songplay_data = list()
for index, row in df.iterrows():
# get songid and artistid from song and artist tables
cur.execute(song_select, (row.song, row.artist, row.length))
results = cur.fetchone()

if results:
songid, artistid = results
else:
songid, artistid = None, None

songplay_data.append((
pd.to_datetime(row.ts, unit='ms'), row.userId,
row.level, songid, artistid, row.sessionId,
row.location, row.userAgent
))
# remove duplicates
songplay_data = list(set(songplay_data))
cur.executemany(songplay_table_insert, songplay_data)
```

## How to run the Python scripts

The scripts will connect to PostgreSQL at the address `127.0.0.1` by using the username `student` and password `student`. The user `student` has to have permission to create a database. You also must have a database named `studentdb` on your system.

Following these steps below to test the program:

Sparkify Process

1. Create the database `sparkifydb`: run the script `create_tables.py`.
**Attention:** This script will drop your database `sparkifydb` if it exists.
2. Import data from the directory `data` into the database:
- Run the script `etl.py` if you want to load data from all JSON files into tables.
- Run the notebook `etl.ipynb` if you want to load data from one JSON file in `song_data` and one JSON file in `log_data` into tables.
3. After creating the database and importing data, you're free to run notebooks:
- The notebook `dashboard.ipynb` provides a general statistic on tables and gives some analytic graphs on song plays.
- The notebook `test.ipynb` displays 5 rows of each table.

## General Statistics

These statistics are realized after loading all files in `data` into the database.
- Total songplays = 6820
- Total users = 96
- Total artists = 69
- Total songs = 71

## Analyses

These analyses are realised after loading all files in `data` into the database.

#### Which user level is more active on Sparkify app?

```SQL
SELECT level, count(songplay_id)
FROM songplays
GROUP BY level;
```

plays per level

#### What is the rate of user levels?

```SQL
SELECT level, count(user_id)
FROM users
GROUP BY level;
```

users per level

#### Which browsers are used to access Sparkify app?

```SQL
SELECT user_agent, count(songplay_id)
FROM songplays
GROUP BY user_agent;
```

browser

#### Which OS are used to access Sparkify app?

#### Which devices are used to access Sparkify app?

device

#### What is the rate of using Sparkify app over weeks?

```SQL
SELECT week, count(songplay_id)
FROM songplays
JOIN time ON songplays.start_time = time.start_time
GROUP BY week
ORDER BY week;
```

week

#### Top users, top regions using Sparkify app

```SQL
SELECT user_id, count(songplay_id) AS plays
FROM songplays
GROUP BY user_id
ORDER BY plays DESC
LIMIT 10;
```

```SQL
SELECT REVERSE(TRIM(SPLIT_PART(REVERSE(location), ',', 1))) AS region, count(songplay_id) AS plays
FROM songplays
GROUP BY region
ORDER BY plays DESC
LIMIT 10;
```

top users regions

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/vuanhtuan1012/data-modeling-with-postgres

Awesome Lists containing this project

README