Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/kenhanscombe/project-postgres
Udacity data engineering nanodegree project
https://github.com/kenhanscombe/project-postgres
data-engineering docker-image postgres-database python3 udacity-nanodegree
Last synced: 4 days ago
JSON representation
Udacity data engineering nanodegree project
- Host: GitHub
- URL: https://github.com/kenhanscombe/project-postgres
- Owner: kenhanscombe
- Created: 2019-11-06T14:18:48.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-08T17:36:41.000Z (about 5 years ago)
- Last Synced: 2023-10-20T05:08:20.407Z (about 1 year ago)
- Topics: data-engineering, docker-image, postgres-database, python3, udacity-nanodegree
- Language: Jupyter Notebook
- Size: 32.2 KB
- Stars: 12
- Watchers: 2
- Forks: 28
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Project 1: Data modeling with Postgres
This **Udacity Data Engineering nanodegree** project creates a postgres database `sparkifydb` for a music app, *Sparkify*. The purpose of the database is to model song and log datasets (originaly stored in JSON format) with a star schema optimised for queries on song play analysis.
> **Note:** The whole exercise can be run in a docker container. See instruction below.
## Schema design and ETL pipeline
The star schema has 1 *fact* table (songplays), and 4 *dimension* tables (users, songs, artists, time). `DROP`, `CREATE`, `INSERT`, and `SELECT` queries are defined in **sql_queries.py**. **create_tables.py** uses functions `create_database`, `drop_tables`, and `create_tables` to create the database sparkifydb and the required tables.
![](sparkify_erd.png?raw=true)
Extract, transform, load processes in **etl.py** populate the **songs** and **artists** tables with data derived from the JSON song files, `data/song_data`. Processed data derived from the JSON log files, `data/log_data`, is used to populate **time** and **users** tables. A `SELECT` query collects song and artist id from the **songs** and **artists** tables and combines this with log file derived data to populate the **songplays** fact table.
## Song play example queries
Simple queries might include number of users with each membership level.
`SELECT COUNT(level) FROM users;`
Day of the week music most frequently listened to.
`SELECT COUNT(weekday) FROM time;`
Or, hour of the day music most often listened to.
`SELECT COUNT(hour) FROM time;`
### **A docker image**
I've created a docker image **postgres-student-image** on docker hub, from which you can run a container with user 'student', password 'student', and database **studentdb** (the starting point for the exercise). You do not need to install postgres (it runs in the container).
To download the image, install [docker](https://docs.docker.com/) which requires you to create a username and password. In a terminal, log into docker hub (you'll be prompted for your docker username and password)
```
docker login docker.io
```Pull the image
```
docker pull onekenken/postgres-student-image
```Run the container
```
docker run -d --name postgres-student-container -p 5432:5432 onekenken/postgres-student-image
```The **create_tables.py** pre-defined connection `conn = psycopg2.connect("host=127.0.0.1 dbname=studentdb user=student password=student")` will now connect to the container.
To stop and remove the container after the exercise
```
docker stop postgres-student-container
docker rm postgres-student-container
```