An open API service indexing awesome lists of open source software.

https://github.com/mikeacosta/data-model-songplays

Data modeling and ETL pipeline using Python and PostgreSQL
https://github.com/mikeacosta/data-model-songplays

data-model etl jupyter-notebook postgresql python sql

Last synced: 3 months ago
JSON representation

Data modeling and ETL pipeline using Python and PostgreSQL

Awesome Lists containing this project

README

        

# data-model-songplays

## Background

The analytics team for music streaming startup Spakify wants to anaylze the song-listening activity of their users. This analysis will be based on JSON user activity logs and song metadata that exist on their mobile app.

## Objective

The goal of this project is to design a database schema and create an ETL pipeline that enters data from the activity and song metadata JSON files, and deliver tables in a Postgres database against which the analtyics team can run optimized queries for performing song play analysis.

## Data model

## ETL summary


  1. Process song files


    1. Insert unique songs records

    2. Insert unique artists records

  2. Process log files


    1. Filter by "NextSong" action

    2. Insert time records

    3. Insert unique user records

    4. Insert songplays records, getting songid and artistid from songs and artists tables, respetively


## Project files

- `sql_queries.py` - queries for creating tables and entering data
- `create_tables.py` - drops and creates tables, used to reset tables prior to running ETL scripts
- `etl.ipynb` - notebook for developing ETL process, runs insert queries with sample data from `song_data` and `log_data`
- `test.ipynb` - selects and displays data from each table to ensure data is correctly entered
- `etl.py` - primary ETL file, populates tables based on all activity and song metadata files

## Steps to run project

1. Create tables

```
python create_tables.py
```

2. Execute ETL pipeline

```
python etl.py
```