https://github.com/mikeacosta/data-model-songplays

Data modeling and ETL pipeline using Python and PostgreSQL
https://github.com/mikeacosta/data-model-songplays

data-model etl jupyter-notebook postgresql python sql

Last synced: 3 months ago
JSON representation

Data modeling and ETL pipeline using Python and PostgreSQL

Host: GitHub
URL: https://github.com/mikeacosta/data-model-songplays
Owner: mikeacosta
Created: 2019-11-30T03:53:52.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2019-11-30T18:27:50.000Z (over 5 years ago)
Last Synced: 2025-01-10T19:06:17.034Z (4 months ago)
Topics: data-model, etl, jupyter-notebook, postgresql, python, sql
Language: Jupyter Notebook
Homepage:
Size: 414 KB
Stars: 0
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# data-model-songplays

## Background

The analytics team for music streaming startup Spakify wants to anaylze the song-listening activity of their users. This analysis will be based on JSON user activity logs and song metadata that exist on their mobile app.

## Objective

The goal of this project is to design a database schema and create an ETL pipeline that enters data from the activity and song metadata JSON files, and deliver tables in a Postgres database against which the analtyics team can run optimized queries for performing song play analysis.

## Data model

## ETL summary

Process song files

Insert unique songs records

Insert unique artists records

Process log files

Filter by "NextSong" action

Insert time records

Insert unique user records

Insert songplays records, getting songid and artistid from songs and artists tables, respetively

## Project files

- `sql_queries.py` - queries for creating tables and entering data
- `create_tables.py` - drops and creates tables, used to reset tables prior to running ETL scripts
- `etl.ipynb` - notebook for developing ETL process, runs insert queries with sample data from `song_data` and `log_data`
- `test.ipynb` - selects and displays data from each table to ensure data is correctly entered
- `etl.py` - primary ETL file, populates tables based on all activity and song metadata files

## Steps to run project

1. Create tables

```
python create_tables.py
```

2. Execute ETL pipeline

```
python etl.py
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mikeacosta/data-model-songplays

Awesome Lists containing this project

README