https://github.com/tonykipkemboi/data_pipeline
https://github.com/tonykipkemboi/data_pipeline
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/tonykipkemboi/data_pipeline
- Owner: tonykipkemboi
- Created: 2022-08-09T18:31:14.000Z (about 3 years ago)
- Default Branch: master
- Last Pushed: 2023-01-09T03:19:47.000Z (almost 3 years ago)
- Last Synced: 2025-06-20T05:45:03.950Z (4 months ago)
- Language: TypeScript
- Size: 330 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# AWS Data Engineering Program Project
### Movie Database
#### Determining the underlying statistics behind various aspects of the movie industryThis is project works with a public dataset containing information pulled from IMDB and TMDB about various movies.
This dataset shows several segments relevant to the movie industry:
- Film budgets
- Film revenue
- Movie runtime
- Keywords describing the movie topics
- Movie ratings
- Number of user ratings#### Tech Stack
`NB: I will update this portion with diagrams for better visual`
- AWS CDK
- Amazon Glue
- Amazon Athena
- Amazon S3
- AWS CLI
- QuickSight
- Python & Typescript
- Jupyter Notebooks#### Challenges
- Dataset was a challenge to wrangle as it had lots of nested objects and tables that were either inaccurate, mislabeled, or missing data
- Joining multiple tables, when not done carefully, resulted in data duplication and uncaught leads erroneous insights.
- Had minor issues with QuickSight getting proper viz; QuickSight often wants to aggregate by count or not aggregate values for certain graph types.