https://github.com/kenhanscombe/project-postgres

Udacity data engineering nanodegree project
https://github.com/kenhanscombe/project-postgres

data-engineering docker-image postgres-database python3 udacity-nanodegree

Last synced: 2 months ago
JSON representation

Udacity data engineering nanodegree project

Host: GitHub
URL: https://github.com/kenhanscombe/project-postgres
Owner: kenhanscombe
Created: 2019-11-06T14:18:48.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-11-08T17:36:41.000Z (over 5 years ago)
Last Synced: 2025-04-01T22:47:15.039Z (3 months ago)
Topics: data-engineering, docker-image, postgres-database, python3, udacity-nanodegree
Language: Jupyter Notebook
Size: 32.2 KB
Stars: 22
Watchers: 2
Forks: 30
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Project 1: Data modeling with Postgres

This **Udacity Data Engineering nanodegree** project creates a postgres database `sparkifydb` for a music app, *Sparkify*. The purpose of the database is to model song and log datasets (originaly stored in JSON format) with a star schema optimised for queries on song play analysis.

> **Note:** The whole exercise can be run in a docker container. See instruction below.

## Schema design and ETL pipeline

The star schema has 1 *fact* table (songplays), and 4 *dimension* tables (users, songs, artists, time). `DROP`, `CREATE`, `INSERT`, and `SELECT` queries are defined in **sql_queries.py**. **create_tables.py** uses functions `create_database`, `drop_tables`, and `create_tables` to create the database sparkifydb and the required tables.

![](sparkify_erd.png?raw=true)

Extract, transform, load processes in **etl.py** populate the **songs** and **artists** tables with data derived from the JSON song files, `data/song_data`. Processed data derived from the JSON log files, `data/log_data`, is used to populate **time** and **users** tables. A `SELECT` query collects song and artist id from the **songs** and **artists** tables and combines this with log file derived data to populate the **songplays** fact table.

## Song play example queries

Simple queries might include number of users with each membership level.

`SELECT COUNT(level) FROM users;`

Day of the week music most frequently listened to.

`SELECT COUNT(weekday) FROM time;`

Or, hour of the day music most often listened to.

`SELECT COUNT(hour) FROM time;`




### **A docker image**

I've created a docker image **postgres-student-image** on docker hub, from which you can run a container with user 'student', password 'student', and database **studentdb** (the starting point for the exercise). You do not need to install postgres (it runs in the container).

To download the image, install [docker](https://docs.docker.com/) which requires you to create a username and password. In a terminal, log into docker hub (you'll be prompted for your docker username and password)

```

docker login docker.io

```

Pull the image

```

docker pull onekenken/postgres-student-image

```

Run the container

```

docker run -d --name postgres-student-container -p 5432:5432 onekenken/postgres-student-image

```

The **create_tables.py** pre-defined connection `conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")` will now connect to the container.

To stop and remove the container after the exercise

```

docker stop postgres-student-container

docker rm postgres-student-container

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kenhanscombe/project-postgres

Awesome Lists containing this project

README