Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/joyceannie/data-modeling-with-postgres

The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.
https://github.com/joyceannie/data-modeling-with-postgres

data-engineering data-modeling postgresql python

Last synced: about 1 month ago
JSON representation

The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.

Awesome Lists containing this project

README

        

# Overview

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data. The objective of the project is to create a database schema and ETL pipeline for this analysis.

# Description

The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.

# Dataset

## Song Dataset

The first dataset is a subset of real data from the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Record:

> {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

## Log Dataset

The second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations.

# Database Schema

We have used star schema for the data modeling. The database consists of 5 tables.

## Fact Table

* songplays: The table records log data associated with the song played by users.

## Dimension Tables

* users: Users of Sparkify app.

* songs: Songs in the dataset.

* artists: Artists in the dataset.

* time: Timestamps of the user activities.

# Project Files

* sql_queries.py: contains sql queries for dropping and creating all the tables. Also, contains insertion query template.

* create_tables.py: contains code for setting up sparkify database with all the tables.

* etl.ipynb: a jupyter notebook to analyse dataset before loading.

* etl.py: reads and processes files from song_data and log_data and loads them into the tables.

* test.ipynb: a notebook to perform data validation.

# Environment

Python 3.6 or above

PostgresSQL 9.5 or above

psycopg2 - PostgreSQL database adapter for Python

# How to Run

Run the python scripts in the same order as given below.

```
python create_tables.py
python etl.py
```