https://github.com/mikeacosta/data-model-songplays
Data modeling and ETL pipeline using Python and PostgreSQL
https://github.com/mikeacosta/data-model-songplays
data-model etl jupyter-notebook postgresql python sql
Last synced: 3 months ago
JSON representation
Data modeling and ETL pipeline using Python and PostgreSQL
- Host: GitHub
- URL: https://github.com/mikeacosta/data-model-songplays
- Owner: mikeacosta
- Created: 2019-11-30T03:53:52.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-30T18:27:50.000Z (over 5 years ago)
- Last Synced: 2025-01-10T19:06:17.034Z (4 months ago)
- Topics: data-model, etl, jupyter-notebook, postgresql, python, sql
- Language: Jupyter Notebook
- Homepage:
- Size: 414 KB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# data-model-songplays
## Background
The analytics team for music streaming startup Spakify wants to anaylze the song-listening activity of their users. This analysis will be based on JSON user activity logs and song metadata that exist on their mobile app.
## Objective
The goal of this project is to design a database schema and create an ETL pipeline that enters data from the activity and song metadata JSON files, and deliver tables in a Postgres database against which the analtyics team can run optimized queries for performing song play analysis.
## Data model
## ETL summary
- Process song files
- Insert unique songs records
- Insert unique artists records
- Process log files
- Filter by "NextSong" action
- Insert time records
- Insert unique user records
- Insert songplays records, getting songid and artistid from songs and artists tables, respetively
## Project files
- `sql_queries.py` - queries for creating tables and entering data
- `create_tables.py` - drops and creates tables, used to reset tables prior to running ETL scripts
- `etl.ipynb` - notebook for developing ETL process, runs insert queries with sample data from `song_data` and `log_data`
- `test.ipynb` - selects and displays data from each table to ensure data is correctly entered
- `etl.py` - primary ETL file, populates tables based on all activity and song metadata files## Steps to run project
1. Create tables
```
python create_tables.py
```2. Execute ETL pipeline
```
python etl.py
```