Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/joyceannie/data-modeling-with-postgres
The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.
https://github.com/joyceannie/data-modeling-with-postgres
data-engineering data-modeling postgresql python
Last synced: about 1 month ago
JSON representation
The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.
- Host: GitHub
- URL: https://github.com/joyceannie/data-modeling-with-postgres
- Owner: joyceannie
- Created: 2022-06-11T18:23:31.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2022-06-22T02:39:36.000Z (over 2 years ago)
- Last Synced: 2023-03-05T11:33:03.235Z (almost 2 years ago)
- Topics: data-engineering, data-modeling, postgresql, python
- Language: Python
- Homepage:
- Size: 397 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Overview
A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. Currently, they don't have an easy way to query their data. The objective of the project is to create a database schema and ETL pipeline for this analysis.
# Description
The main focus of the project is data modeling with Postgres and build an ETL pipeline using Python. The first step is to define fact and dimension tables for a star schema for a particular analytic focus. The second step is to write an ETL pipeline that transfers data from files in different directories into these tables in Postgres using Python and SQL.
# Dataset
## Song Dataset
The first dataset is a subset of real data from the [Million Song Dataset](http://millionsongdataset.com/). Each file is in JSON format and contains metadata about a song and the artist of that song.
Sample Song Record:
> {"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
## Log Dataset
The second dataset consists of log files in JSON format generated by this [event simulator](https://github.com/Interana/eventsim) based on the song dataset. These simulate activity logs from a music streaming app based on specified configurations.
# Database Schema
We have used star schema for the data modeling. The database consists of 5 tables.
## Fact Table
* songplays: The table records log data associated with the song played by users.
## Dimension Tables
* users: Users of Sparkify app.
* songs: Songs in the dataset.
* artists: Artists in the dataset.
* time: Timestamps of the user activities.
# Project Files
* sql_queries.py: contains sql queries for dropping and creating all the tables. Also, contains insertion query template.
* create_tables.py: contains code for setting up sparkify database with all the tables.
* etl.ipynb: a jupyter notebook to analyse dataset before loading.
* etl.py: reads and processes files from song_data and log_data and loads them into the tables.
* test.ipynb: a notebook to perform data validation.
# Environment
Python 3.6 or above
PostgresSQL 9.5 or above
psycopg2 - PostgreSQL database adapter for Python
# How to Run
Run the python scripts in the same order as given below.
```
python create_tables.py
python etl.py
```