https://github.com/alexandrabaturina/data-modeling-with-postgres
ETL pipeline to transfer data from files into Postgres database using Python and SQL.
https://github.com/alexandrabaturina/data-modeling-with-postgres
data-modeling postrgres python-etl-pipeline sql-queries star-schema
Last synced: 6 months ago
JSON representation
ETL pipeline to transfer data from files into Postgres database using Python and SQL.
- Host: GitHub
- URL: https://github.com/alexandrabaturina/data-modeling-with-postgres
- Owner: alexandrabaturina
- Created: 2022-07-21T17:02:41.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2022-07-23T00:53:29.000Z (about 3 years ago)
- Last Synced: 2025-02-14T20:39:38.170Z (8 months ago)
- Topics: data-modeling, postrgres, python-etl-pipeline, sql-queries, star-schema
- Language: Jupyter Notebook
- Homepage:
- Size: 411 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Engineering ND #1: Data Modeling With Postgres
## Overview
**Data Modeling With Postgres** is the first project of Udacity [Data Engineering Nanodegree](https://d20vrrgs8k4bvw.cloudfront.net/documents/en-US/Data+Engineering+Nanodegree+Program+Syllabus.pdf). It requires to create a [Postgres](https://www.postgresql.org/) database for a music streaming app and write an [ETL pipeline](https://databricks.com/glossary/extract-transform-load) that transfers data from files in two local directories into this database using Python and SQL.The project has the following goals:
* Model a relation database with Postgres for a [star schema](https://en.wikipedia.org/wiki/Star_schema)
* Define [fact](https://en.wikipedia.org/wiki/Star_schema#Fact_tables) and [dimension](https://en.wikipedia.org/wiki/Star_schema#Dimension_tables) tables
* Insert data into tables
* Create ETL pipeline with Python
## Repo Contents
The project includes ```data``` folder with *song* and *log* datasets, as well as the following files:
* ```test.ipynb```: displays the first few rows of each table to check the database.
* ```create_tables.py```: drops and creates tables.
* ```etl.ipynb```: reads and processes a single file from ```data/song_data``` and ```data/log_data```and loads the data into tables. This notebook contains detailed instructions on the ETL process for each of the tables.
* ```etl.py```: reads and processes files from ```data/song_data``` and ```data/log_data``` and loads them into tables.
* ```sql_queries.py```: contains all SQL queries.
## Database
### Database Purpose
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their music streaming app. The **sparkifydb** database is designed to optimize queries on song play analysis.
### Data Description
The data resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app.
#### Song Dataset ```data/song_data```
This dataset is a subset of real data from [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song.
Below is an example of what a single song file ```data/song_data/A/A/B/TRAABJV128F1460C49.json``` looks like.
```
{
"num_songs": 1,
"artist_id": "ARIK43K1187B9AE54C",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "Beverly Hills, CA",
"artist_name": "Lionel Richie",
"song_id": "SOBONFF12A6D4F84D8",
"title": "Tonight Will Be Alright",
"duration": 307.3824,
"year": 1986
}
```
#### Log Dataset ```data/log_data```
This dataset consists of log files in JSON format generated by [eventsim](https://github.com/Interana/eventsim) event simulator based on the songs in the song dataset. These simulate activity logs from a music streaming app based on specified configurations.
Below is an example of what a single line of a single file ```data/log_data/2018/11/2018-11-09-events.json``` looks like.
```
{
"artist":"Beastie Boys",
"auth":"Logged In",
"firstName":"Harper",
"gender":"M",
"itemInSession":2,
"lastName":"Barrett",
"length":161.56689,
"level":"paid",
"location":"New York-Newark-Jersey City, NY-NJ-PA",
"method":"PUT",
"page":"NextSong",
"registration":1540685364796.0,
"sessionId":275,
"song":"Lighten Up",
"status":200,
"ts":1541722186796,
"userAgent":"\"Mozilla\/5.0 (Windows NT 6.3; WOW64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"",
"userId":"42"
}
```
### Database Design
The **sparkifydb** is a Postgres database filled with data from *songs* and *log* datasets. It is designed with star schema to optimize queries on song play analysis.
The **sparkifydb** database contains the following tables:
* Fact table
* ***songplays***: records in log data associated with song plays, i.e. records with page ```NextSong```
* Dimension tables
* ***users***: users in the app
* ***songs***: songs in music database
* ***artists***: artists in music database
* ***time***: timestamps of records in ***songplays*** broken down into specific unitsThe **sparkifybd** schema is shown below. Primary and foreign keys are marked as ```PK``` and ```FK```, respectively.
## Getting Started
To run ETL pipeline locally,
1. Clone this repo.
2. ```cd``` into project directory.
3. Run ```create_tables.py``` to create database and tables:
```sh
pyhton create_tables.py
```
> To confirm the creation of tables with correct columns, run ```test.ipynb```. Make sure to click **Restart kernel** to close the connection to the database after running this notebook.
4. Run ```etl.py``` to fill the tables:
```sh
python etl.py
```
> Remember to run ```create_tables.py``` before running ```etl.py``` to reset tables.## Authors
Alexandra Baturina