https://github.com/tonykipkemboi/data_pipeline

Last synced: 4 months ago
JSON representation

Host: GitHub
URL: https://github.com/tonykipkemboi/data_pipeline
Owner: tonykipkemboi
Created: 2022-08-09T18:31:14.000Z (about 3 years ago)
Default Branch: master
Last Pushed: 2023-01-09T03:19:47.000Z (almost 3 years ago)
Last Synced: 2025-06-20T05:45:03.950Z (4 months ago)
Language: TypeScript
Size: 330 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# AWS Data Engineering Program Project

### Movie Database
#### Determining the underlying statistics behind various aspects of the movie industry

This is project works with a public dataset containing information pulled from IMDB and TMDB about various movies.

This dataset shows several segments relevant to the movie industry:

- Film budgets
- Film revenue
- Movie runtime
- Keywords describing the movie topics
- Movie ratings
- Number of user ratings

#### Tech Stack

`NB: I will update this portion with diagrams for better visual`

- AWS CDK
- Amazon Glue
- Amazon Athena
- Amazon S3
- AWS CLI
- QuickSight
- Python & Typescript
- Jupyter Notebooks

#### Challenges

- Dataset was a challenge to wrangle as it had lots of nested objects and tables that were either inaccurate, mislabeled, or missing data
- Joining multiple tables, when not done carefully, resulted in data duplication and uncaught leads erroneous insights.
- Had minor issues with QuickSight getting proper viz; QuickSight often wants to aggregate by count or not aggregate values for certain graph types.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tonykipkemboi/data_pipeline

Awesome Lists containing this project

README