https://github.com/jeffbrennan/jpmdb

personalized movie database
https://github.com/jeffbrennan/jpmdb

dash data-cleaning data-engineering data-visualization delta oss plotly scraping

Last synced: 8 months ago
JSON representation

personalized movie database

Host: GitHub
URL: https://github.com/jeffbrennan/jpmdb
Owner: jeffbrennan
Created: 2025-01-25T23:02:23.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-11T01:44:47.000Z (over 1 year ago)
Last Synced: 2025-10-23T22:57:10.231Z (8 months ago)
Topics: dash, data-cleaning, data-engineering, data-visualization, delta, oss, plotly, scraping
Language: Python
Homepage: https://jpmdb.jeffbrennan.dev
Size: 1020 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# [jpmdb](https://jpmdb.jeffbrennan.dev/)

a personalized movie database for my friend Juan
![site](docs/site.png)

## Cleaning Process

1. The original source data was a .txt file containing a list of movies/tv shows, the order they were watched that year, and a rating out of 10

2. The .txt file was parsed in `create_silver_jpmdb.py`, including parsing the ratings, seasons, watch order, year specifiers and other metadata

3. Downloaded imdb data from [IMDb Datasets](https://datasets.imdbws.com/) and converted the .gz files into `silver/imdb/title_basics` and `silver/imdb/title_ratings` using `create_silver_imdb.py`

4. The jpmdb and imdb datasets were initially joined using standard string cleaning and fuzzy matching approaches into `stg_jpmdb_combined` using `create_silver_stg_jpmdb_combined.py`

5. Entries were manually reviewed a small CLI tool `review_combined_jpmdb.py`, giving an opportunity to correct fuzzy matching errors and manually add missing entries

6. After all entries were validated, the data was moved to the gold table `gold/jpmdb` in `create_gold_jpmdb.py`

## Dashboard

The dashboard is built using Dash and Plotly. It currently includes 4 visualizations:

1. A virtualized table of all entries in the database
2. A scatter plot of ratings over the watched order to show ratings over time
3. A scatter plot comparing ratings to IMDb ratings
4. A box plot showing distribution of ratings per IMDb genre

## TODOs

- [] incorporate scraped poster images into the dashboard
- [] cross visualization filtering by genre
- [] short summary of top 10 titles per year

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jeffbrennan/jpmdb

Awesome Lists containing this project

README