https://github.com/pncnmnp/timdb

The Indian Movie Database - supports content-based and collaborative filtering techniques
https://github.com/pncnmnp/timdb

bollywood indian-cinema movie-database movie-recommendation

Last synced: about 1 month ago
JSON representation

The Indian Movie Database - supports content-based and collaborative filtering techniques

Host: GitHub
URL: https://github.com/pncnmnp/timdb
Owner: pncnmnp
License: mit
Created: 2019-11-12T13:53:19.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2022-08-26T14:08:01.000Z (about 3 years ago)
Last Synced: 2023-03-03T13:05:51.378Z (over 2 years ago)
Topics: bollywood, indian-cinema, movie-database, movie-recommendation
Language: Python
Homepage: https://github.com/pncnmnp/TIMDB-analysis
Size: 4.06 MB
Stars: 29
Watchers: 2
Forks: 10
Open Issues: 0
Metadata Files:
- Readme: README.txt
- License: LICENSE

Awesome Lists containing this project

README

TIMDB - The Indian Movie Database
An initiative to curate a well structured database for Indian movies

CURRENT STATUS: movies from 1950-2019
(can be used in both: content-based and collaborative filtering approaches)

DATABASE SIZE: 13.7 MB

The project is divided into five directories based on the year of release and type of ML approach:
>> "collaborative" (2.5 MB)
>> "1950-2019" (5.7 MB)
>> "1950-1989" (2.4 MB)
>> "1990-2009" (2.0 MB)
>> "2010-2019" (1.0 MB)

ATTRIBUTES PRESENT:
IN ALL THE DATABASE PROVIDES 35 UNIQUE ATTRIBUTES TO TINKER WITH!

In "1950-1989", "1990-2009" and "2010-2019":
'./x/bollywood.csv': title, imdb_id, poster_path,wiki_link
'./x/bollywood_meta.csv': imdb_id, title, original_title, is_adult, year_of_release, runtime, genres
'./x/bollywood_ratings.csv': imdb_id, imdb_rating, imdb_votes
'./x/bollywood_text.csv': imdb_id, story, summary, tagline, actors, wins_nominations, release_date

where x = ["1950-1989", "1990-2009", "2010-2019"]

UNIQUE (18 attributes):
title(wiki), imdb_id, poster_path, wiki_link, original_title,
title(imdb), is_adult, year_of_release, runtime, genres,
imdb_rating, imdb_votes, story, summary, tagline,
actors, wins_nominations, release_date

IMPORTANT:
For a dataset which merges values from the 3 directories:
SEE "./1950-2019/bollywood_full.csv"
(This dataset is merged ON "imdb_id",
hence if you find a niche-movie missing, see the respective year's directory).

In "1950-2019":
'./x/bollywood_crew.csv': imdb_id, directors, writers
For Director(s) info:
'./x/bollywood_crew_data.csv': crew_id, name, born_year, death_year, profession, known_for
For Writer(s) info:
'./x/bollywood_writers_data.csv': crew_id, name, born_year, death_year, profession, known_for

UNIQUE (7 attributes):
crew_id ('directors' and 'writers' column in bollywood_crew.csv contains their respective crew_id),
imdb_id, name, born_year, death_year, profession, known_for

In "collaborative":
'./x/genome_scores.csv': movie_id, tag_id, relevance
'./x/genome_tags.csv': tag_id, tag
'./x/links.csv': movie_id, imdb_id
'./x/ratings.csv': user_id, movie_id, rating, timestamp
'./x/tags.csv': user_id, movie_id, tag, timestamp
'./x/titles.csv': movie_id, title

UNIQUE (10 attributes):
movie_id, tag (from genome_tags.csv), tag (from tags.csv), tag_id,
relevance, imdb_id, user_id, rating, timestamp, title

NOTE:
From MovieLens database:
The leading zeros are removed for imdb_id, which are not removed for the rest of the
database(i.e for "1950-1989", "1990-2009", "2010-2019" and "1950-2019").
Example: in links.csv if imdb_id is 123456,
it can be tt0123456 in imdb_id col in the datasets in "1950-1989", "1990-2009" and "2010-2019".

The ratings(MovieLens) for collaborative filtering were from "Full" dataset,
available at http://files.grouplens.org/datasets/movielens/ml-latest.zip

The genome-scores were available for very few movies (64 in total) from "Full" MovieLens dataset.

In 'bollywood_ratings.csv' if:
value is NaN -> it means the film is yet to be released
value is 0 -> No rating was given to the film

In 'bollywood_meta.csv' if:
A title is missing, chances are the title had no info in the imdb dump
Or the title is yet to be released
(indicated by \N)

As of this commit, all the is_adult values are 0.

In 'bollywood_text.csv':
Some inconsistencies are removed, as of this commit some inconsistencies are left to find!

To separate multiple actors: '|' is used

'text delimiter' has to be 'None' to view the dataset in LibreOffice Calc
as the attributes like story and summary contain "" and '' in them

In bollywood_crew.csv, bollywood_crew_data.csv and bollywood_writers_data.csv:
To separate multiple directors, writers and known_for titles and professions: '|' is used

In 'src' directory:
The paths mentioned in the script are relative, see ./src/PATHS.py

FUTURE SCOPE:
Plans on curating movies for other languages, like 'Gujarati', 'Tamil', 'Telugu', etc.

ATTRIBUTION:
> F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context.
ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19.
> IMDB https://datasets.imdbws.com/
> Wikipedia https://wikimediafoundation.org/support/

LICENSE:
All the scripts are licensed under MIT License
For database licensing see the Attribution section

CONTRIBUTION:
If you find a bollywood movie is missing from the appropriate directory,
please send a pull request appending the movie information at the end of the appropriate file.

If a movie "ABC DEF" was released in "yyyy" year,
the pull request should be send to the appropriate directory.
For example:
A movie released in 1993 belongs to "1990-2009" directory.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/pncnmnp/timdb

Awesome Lists containing this project

README