https://github.com/joyceannie/data-modeling-with-postgres

The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.
https://github.com/joyceannie/data-modeling-with-postgres

data-engineering data-modeling postgresql python

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/joyceannie/data-modeling-with-postgres
Owner: joyceannie
Created: 2022-06-11T18:23:31.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2022-06-22T02:39:36.000Z (about 3 years ago)
Last Synced: 2025-02-05T08:51:08.517Z (5 months ago)
Topics: data-engineering, data-modeling, postgresql, python
Language: Python
Homepage:
Size: 397 KB
Stars: 3
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Overview

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data. The objective of the project is to create a database schema and ETL pipeline for this analysis.

# Description

# Dataset

## Song Dataset

The first dataset is a subset of real data from the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song.

Sample Song Record:

> {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

## Log Dataset

The second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations.

# Database Schema

We have used star schema for the data modeling. The database consists of 5 tables.

## Fact Table

* songplays: The table records log data associated with the song played by users.

## Dimension Tables

* users: Users of Sparkify app.

* songs: Songs in the dataset.

* artists: Artists in the dataset.

* time: Timestamps of the user activities.

# Project Files

* sql_queries.py: contains sql queries for dropping and creating all the tables. Also, contains insertion query template.

* create_tables.py: contains code for setting up sparkify database with all the tables.

* etl.ipynb: a jupyter notebook to analyse dataset before loading.

* etl.py: reads and processes files from song_data and log_data and loads them into the tables.

* test.ipynb: a notebook to perform data validation.

# Environment

Python 3.6 or above

PostgresSQL 9.5 or above

psycopg2 - PostgreSQL database adapter for Python

# How to Run

Run the python scripts in the same order as given below.

```
python create_tables.py
python etl.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/joyceannie/data-modeling-with-postgres

Awesome Lists containing this project

README