{"id":23706985,"url":"https://github.com/athul64/tmdb-dataset-analysis","last_synced_at":"2026-04-14T06:04:31.530Z","repository":{"id":267755268,"uuid":"902253485","full_name":"Athul64/TMDB-Dataset-Analysis","owner":"Athul64","description":"This data set contains information about 10,000 movies extracted from TMDB. The dataset contains movies from 1960 to 2015. Including user ratings and revenue. Original data from Kaggle.","archived":false,"fork":false,"pushed_at":"2024-12-12T08:19:10.000Z","size":104,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-20T06:19:05.808Z","etag":null,"topics":["data-visualization","dataframe","eda","numpy","pandas","python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Athul64.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-12T07:58:13.000Z","updated_at":"2024-12-12T08:22:22.000Z","dependencies_parsed_at":"2024-12-12T08:34:10.416Z","dependency_job_id":"3a03eeac-bb89-48c2-a2be-f71fbb7e4bbd","html_url":"https://github.com/Athul64/TMDB-Dataset-Analysis","commit_stats":null,"previous_names":["athul64/tmdb-dataset-analysis"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Athul64%2FTMDB-Dataset-Analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Athul64%2FTMDB-Dataset-Analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Athul64%2FTMDB-Dataset-Analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Athul64%2FTMDB-Dataset-Analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Athul64","download_url":"https://codeload.github.com/Athul64/TMDB-Dataset-Analysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239786251,"owners_count":19696772,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-visualization","dataframe","eda","numpy","pandas","python"],"created_at":"2024-12-30T16:01:52.849Z","updated_at":"2026-02-03T04:30:20.945Z","avatar_url":"https://github.com/Athul64.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# TMDB Movies Dataset Analysis\n\n### Mini Data Analysis Project\n\n| Contents  |\n| ------------- | \n| Content Cell  | \n| Dataset Description |\n| Columns Description  |\n| Questions for Analysis |\n| Data Wrangling |\n| Data Cleaning |\n| Exploratory Data Analysis |\n|Built with|\n\n## Dataset Description:\n\nThis data set contains information about 10,000 movies extracted from TMDB. The dataset contains movies from 1960 to 2015. Including user ratings and revenue. Original data from Kaggle.\n\n## Columns Description:\n\n* `id`, `imdb_id`: unique id or IMDB id for each movie on TMDB\n* `Popularity`: a metric used to measure the popularity of the movie.\n* `Budget`: the total budget for the movie is USD.\n* `Revenue`: the total revenue of the movie in USD.\n* `original_title`: the original title of the movie.\n* `cast`: the names of the cast of the movie separated by \"|\".\n* `homepage`: the website of the movie (if it existed).\n* `director`:name(s) of the director(s) of the movie (separated by \"|\" if there are more than one director).\n* `tagline`: a catchphrase describing the movie.\n* `Keywords`: keywords related to the movie.\n* `Overview`: summary of the plot of the movie.\n* `runtime`: total runtime of the movie in minutes.\n* `genres`: genres of the movie separated by \"|\".\n* `production_companies`:production compan(y/ies) of the movie.\n* `release_date`: is the movie's release date.\n* `vote_count`: number of voters in the movie.\n* `vote_average`: the average user rating of the movie\n* `release_year`:release year of the movie (from 1960 to 2015)\n* `budget_adj`: the total budget of the movie in USD in terms of 2010 dollars, accounting for inflation over time.\n* `revenue_adj`: the total budget of the movie in USD in terms of 2010 dollars, accounting for inflation over time.\n\n## Questions for Analysis:\n\n* Do movies with high popularity achieve high revenue?\n* What are the most filmed genres in this whole dataset?\n* Is there a correlation between a movie budget and its revenue?\n\n## Data Cleaning:\n\nMain Observations:\n\n1. Our dataset consisted of a total of 10866 rows and 21 columns.\n2. We had only 1 duplicated row which had been dropped.\n3. Some columns won't be useful in answering our questions so they were dropped.\n4. A few columns had many missing values that needed to be handled.\n5. Columns cast director genre had values separated with a '|'.\n6. release_date's data type needed to be cast.\n7. We could append a column for the movie profit using the formula: profit=revenue−budget.\n8. vote_average better be presented as a categorical variable that groups multiple rating values.\n9. We might also categorize the profit column for better EDA\n\n## Exploratory Data Analysis: \n\nAfter finishing our dataset cleaning, we ended up with a total of 10840 records and 10 columns. The dataset now has no duplicates nor null values, and the data types are consistent with suitable categorical variables to address our questions. We then performed some analytics and created some visualizations to answer our targeted questions.\n\n\n**Q1: Do movies with high popularity achieve high revenue?**\n\u003e More popular movies receive way more revenue than less popular movies.\n\n**Q2: What are the most filmed genres in this whole dataset?**\n\u003e Drama, Comedy, and Action are the most three filmed genres in a total of 10839 movies in our dataset.\nThe drama genre alone is filmed 22.6% of the time on our dataset.\n\n\n**Q3: Is there a correlation between a movie budget and its revenue?**\n\u003eThere is a positive correlation between budget and revenue, indicating a relation between them with little outliers found.\n\n## Built with:\n* google colab\n* Python\n* Pandas\n* Numpy\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fathul64%2Ftmdb-dataset-analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fathul64%2Ftmdb-dataset-analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fathul64%2Ftmdb-dataset-analysis/lists"}