https://github.com/pncnmnp/timdb
The Indian Movie Database - supports content-based and collaborative filtering techniques
https://github.com/pncnmnp/timdb
bollywood indian-cinema movie-database movie-recommendation
Last synced: about 1 month ago
JSON representation
The Indian Movie Database - supports content-based and collaborative filtering techniques
- Host: GitHub
- URL: https://github.com/pncnmnp/timdb
- Owner: pncnmnp
- License: mit
- Created: 2019-11-12T13:53:19.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2022-08-26T14:08:01.000Z (about 3 years ago)
- Last Synced: 2023-03-03T13:05:51.378Z (over 2 years ago)
- Topics: bollywood, indian-cinema, movie-database, movie-recommendation
- Language: Python
- Homepage: https://github.com/pncnmnp/TIMDB-analysis
- Size: 4.06 MB
- Stars: 29
- Watchers: 2
- Forks: 10
- Open Issues: 0
-
Metadata Files:
- Readme: README.txt
- License: LICENSE
Awesome Lists containing this project
README
TIMDB - The Indian Movie Database
An initiative to curate a well structured database for Indian moviesCURRENT STATUS: movies from 1950-2019
(can be used in both: content-based and collaborative filtering approaches)DATABASE SIZE: 13.7 MB
The project is divided into five directories based on the year of release and type of ML approach:
>> "collaborative" (2.5 MB)
>> "1950-2019" (5.7 MB)
>> "1950-1989" (2.4 MB)
>> "1990-2009" (2.0 MB)
>> "2010-2019" (1.0 MB)ATTRIBUTES PRESENT:
IN ALL THE DATABASE PROVIDES 35 UNIQUE ATTRIBUTES TO TINKER WITH!In "1950-1989", "1990-2009" and "2010-2019":
'./x/bollywood.csv': title, imdb_id, poster_path,wiki_link
'./x/bollywood_meta.csv': imdb_id, title, original_title, is_adult, year_of_release, runtime, genres
'./x/bollywood_ratings.csv': imdb_id, imdb_rating, imdb_votes
'./x/bollywood_text.csv': imdb_id, story, summary, tagline, actors, wins_nominations, release_datewhere x = ["1950-1989", "1990-2009", "2010-2019"]
UNIQUE (18 attributes):
title(wiki), imdb_id, poster_path, wiki_link, original_title,
title(imdb), is_adult, year_of_release, runtime, genres,
imdb_rating, imdb_votes, story, summary, tagline,
actors, wins_nominations, release_dateIMPORTANT:
For a dataset which merges values from the 3 directories:
SEE "./1950-2019/bollywood_full.csv"
(This dataset is merged ON "imdb_id",
hence if you find a niche-movie missing, see the respective year's directory).In "1950-2019":
'./x/bollywood_crew.csv': imdb_id, directors, writers
For Director(s) info:
'./x/bollywood_crew_data.csv': crew_id, name, born_year, death_year, profession, known_for
For Writer(s) info:
'./x/bollywood_writers_data.csv': crew_id, name, born_year, death_year, profession, known_forUNIQUE (7 attributes):
crew_id ('directors' and 'writers' column in bollywood_crew.csv contains their respective crew_id),
imdb_id, name, born_year, death_year, profession, known_forIn "collaborative":
'./x/genome_scores.csv': movie_id, tag_id, relevance
'./x/genome_tags.csv': tag_id, tag
'./x/links.csv': movie_id, imdb_id
'./x/ratings.csv': user_id, movie_id, rating, timestamp
'./x/tags.csv': user_id, movie_id, tag, timestamp
'./x/titles.csv': movie_id, titleUNIQUE (10 attributes):
movie_id, tag (from genome_tags.csv), tag (from tags.csv), tag_id,
relevance, imdb_id, user_id, rating, timestamp, titleNOTE:
From MovieLens database:
The leading zeros are removed for imdb_id, which are not removed for the rest of the
database(i.e for "1950-1989", "1990-2009", "2010-2019" and "1950-2019").
Example: in links.csv if imdb_id is 123456,
it can be tt0123456 in imdb_id col in the datasets in "1950-1989", "1990-2009" and "2010-2019".The ratings(MovieLens) for collaborative filtering were from "Full" dataset,
available at http://files.grouplens.org/datasets/movielens/ml-latest.zipThe genome-scores were available for very few movies (64 in total) from "Full" MovieLens dataset.
In 'bollywood_ratings.csv' if:
value is NaN -> it means the film is yet to be released
value is 0 -> No rating was given to the filmIn 'bollywood_meta.csv' if:
A title is missing, chances are the title had no info in the imdb dump
Or the title is yet to be released
(indicated by \N)As of this commit, all the is_adult values are 0.
In 'bollywood_text.csv':
Some inconsistencies are removed, as of this commit some inconsistencies are left to find!To separate multiple actors: '|' is used
'text delimiter' has to be 'None' to view the dataset in LibreOffice Calc
as the attributes like story and summary contain "" and '' in themIn bollywood_crew.csv, bollywood_crew_data.csv and bollywood_writers_data.csv:
To separate multiple directors, writers and known_for titles and professions: '|' is usedIn 'src' directory:
The paths mentioned in the script are relative, see ./src/PATHS.pyFUTURE SCOPE:
Plans on curating movies for other languages, like 'Gujarati', 'Tamil', 'Telugu', etc.ATTRIBUTION:
> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context.
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19.
> IMDB https://datasets.imdbws.com/
> Wikipedia https://wikimediafoundation.org/support/LICENSE:
All the scripts are licensed under MIT License
For database licensing see the Attribution sectionCONTRIBUTION:
If you find a bollywood movie is missing from the appropriate directory,
please send a pull request appending the movie information at the end of the appropriate file.If a movie "ABC DEF" was released in "yyyy" year,
the pull request should be send to the appropriate directory.
For example:
A movie released in 1993 belongs to "1990-2009" directory.