An open API service indexing awesome lists of open source software.

https://github.com/automata/minera-imdb

Web mining for IMDB using Complex Network techniques
https://github.com/automata/minera-imdb

Last synced: 28 days ago
JSON representation

Web mining for IMDB using Complex Network techniques

Awesome Lists containing this project

README

        

# README

Web mining for IMDB using Complex Network techniques.

# TODO

## Lucas

* Implement SVD DONE!
* Implement Single-liknage IHAC DONE!

## Vilson

* Create a database (SQLite?) DONE!
* Movie title
* Genre
* Director
* Actors (at least 4, 2 men, 2 women)
* Plot
* Keywords
* Backlinks (> 50 < 200)

# Installing

We are using IMDb.py to download and access all IMDB information
with an OOP model. In this way, we have some dependencies.

## Dependencies

Currently we are covering a Ubuntu GNU/Linux 11.04 system. Install
the following packages:

# MySQL
sudo apt-get install mysql-client mysql-server

# Python 2.6+ and some libs
sudo apt-get install python python-mysqldb

# IMDb.py
wget -rc http://prdownloads.sourceforge.net/imdbpy/IMDbPY-4.9.tar.gz
tar -xvzf IMDbPY-4.9.tar.gz
cd IMDbPY-4.9/
sudo python setup.py install

## Downloading the IMDB plain files

We used a local copy of the entire IMDB database (until June 22,
2012). Here are the steps to get your own. The plain files will be
downloaded in your ~/tmp/imdb directory. It is a time consuming
action (around 1.1gb of data), so go take a coffee.

mkdir -f ~/tmp/imdb
cd ~/tmp/imdb
wget -rc ftp://ftp.fu-berlin.de/pub/misc/movies/database/
mv ftp.fu-berlin.de/pub/misc/movies/database/*.gz ./
rm -rf ftp.fu-berlin.de

So now we have 1.1gb of .list.gz files.

## Setting up a local SQL database

First of all create a database:

mysqladmin -u root -p create imdb

Having all the .list.gz files at ~/tmp/imdb, run this script, inside of
IMDb.py directory:

cd IMDbPY-4.9/bin/
python imdbpy2sql.py -d ~/tmp/imdb/ -u mysql://root:lm2526@localhost/imdb

This will take a *lot* of time (we spent about 5 hours).

# Using

# Some interesting information

We downloaded 8,2G in 3h 32m 27s (676 KB/s).

And we indexed the entire IMDb plain files data base in 303 min:

# TIME TOTAL TIME TO INSERT/WRITE DATA : 258min, 17sec (wall) 111min, 21sec (user) 25min, 54sec (system)
building database indexes (this may take a while)
# TIME createIndexes() : 21min, 37sec (wall) 0min, 0sec (user) 0min, 0sec (system)
adding foreign keys (this may take a while)
# TIME createForeignKeys() : 23min, 7sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for movies... DONE! (restored 0 entries out of 0)
# TIME restore movies : 0min, 1sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for people... DONE! (restored 0 entries out of 0)
# TIME restore people : 0min, 0sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for characters... DONE! (restored 0 entries out of 0)
# TIME restore characters : 0min, 3sec (wall) 0min, 0sec (user) 0min, 0sec (system)
RESTORING imdbIDs values for companies... DONE! (restored 0 entries out of 0)
# TIME restore companies : 0min, 1sec (wall) 0min, 0sec (user) 0min, 0sec (system)
# TIME FINAL : 303min, 6sec (wall) 111min, 21sec (user) 25min, 54sec (system)

Total time: 212 min + 303 min = 515 min = 8.5 h to download and index

More about the attributes available on IMDB plain files database,
please refer to
ftp://ftp.fu-berlin.de/pub/misc/movies/database/tools/movie-database-faq .

# Authors

* Lucas Rodrigues
* Vilson Vieira

IFSC / University of São Paulo / 2012