https://github.com/automata/minera-imdb

Web mining for IMDB using Complex Network techniques
https://github.com/automata/minera-imdb

Last synced: 28 days ago
JSON representation

Web mining for IMDB using Complex Network techniques

Host: GitHub
URL: https://github.com/automata/minera-imdb
Owner: automata
Created: 2012-06-14T22:09:02.000Z (almost 13 years ago)
Default Branch: master
Last Pushed: 2012-06-25T23:06:12.000Z (almost 13 years ago)
Last Synced: 2025-04-03T13:43:56.228Z (about 2 months ago)
Language: Python
Homepage:
Size: 1.01 MB
Stars: 11
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # README

Web mining for IMDB using Complex Network techniques.

# TODO

## Lucas

* Implement SVD DONE!

* Implement Single-liknage IHAC DONE!

## Vilson

* Create a database (SQLite?) DONE!

    * Movie title

    * Genre

    * Director

    * Actors (at least 4, 2 men, 2 women)

    * Plot

    * Keywords

    * Backlinks (> 50 < 200)

# Installing

We are using IMDb.py to download and access all IMDB information

with an OOP model. In this way, we have some dependencies.

## Dependencies

Currently we are covering a Ubuntu GNU/Linux 11.04 system. Install

the following packages:

    # MySQL

    sudo apt-get install mysql-client mysql-server

    # Python 2.6+ and some libs

    sudo apt-get install python python-mysqldb

    # IMDb.py

    wget -rc http://prdownloads.sourceforge.net/imdbpy/IMDbPY-4.9.tar.gz

    tar -xvzf IMDbPY-4.9.tar.gz

    cd IMDbPY-4.9/

    sudo python setup.py install

## Downloading the IMDB plain files

We used a local copy of the entire IMDB database (until June 22,

2012). Here are the steps to get your own. The plain files will be

downloaded in your ~/tmp/imdb directory. It is a time consuming

action (around 1.1gb of data), so go take a coffee.

    mkdir -f ~/tmp/imdb

    cd ~/tmp/imdb

    wget -rc ftp://ftp.fu-berlin.de/pub/misc/movies/database/

    mv ftp.fu-berlin.de/pub/misc/movies/database/*.gz ./

    rm -rf ftp.fu-berlin.de

So now we have 1.1gb of .list.gz files.

## Setting up a local SQL database

First of all create a database:

    mysqladmin -u root -p create imdb

Having all the .list.gz files at ~/tmp/imdb, run this script, inside of

IMDb.py directory:

    cd IMDbPY-4.9/bin/

    python imdbpy2sql.py -d ~/tmp/imdb/ -u mysql://root:lm2526@localhost/imdb

This will take a *lot* of time (we spent about 5 hours).

# Using

# Some interesting information

We downloaded 8,2G in 3h 32m 27s (676 KB/s).

And we indexed the entire IMDb plain files data base in 303 min:

    # TIME TOTAL TIME TO INSERT/WRITE DATA : 258min, 17sec (wall) 111min, 21sec (user) 25min, 54sec (system)

    building database indexes (this may take a while)

    # TIME createIndexes() : 21min, 37sec (wall) 0min, 0sec (user) 0min, 0sec (system)

    adding foreign keys (this may take a while)

    # TIME createForeignKeys() : 23min, 7sec (wall) 0min, 0sec (user) 0min, 0sec (system)

    RESTORING imdbIDs values for movies... DONE! (restored 0 entries out of 0)

    # TIME restore movies : 0min, 1sec (wall) 0min, 0sec (user) 0min, 0sec (system)

    RESTORING imdbIDs values for people... DONE! (restored 0 entries out of 0)

    # TIME restore people : 0min, 0sec (wall) 0min, 0sec (user) 0min, 0sec (system)

    RESTORING imdbIDs values for characters... DONE! (restored 0 entries out of 0)

    # TIME restore characters : 0min, 3sec (wall) 0min, 0sec (user) 0min, 0sec (system)

    RESTORING imdbIDs values for companies... DONE! (restored 0 entries out of 0)

    # TIME restore companies : 0min, 1sec (wall) 0min, 0sec (user) 0min, 0sec (system)

    # TIME FINAL : 303min, 6sec (wall) 111min, 21sec (user) 25min, 54sec (system)

Total time: 212 min + 303 min = 515 min = 8.5 h to download and index

More about the attributes available on IMDB plain files database,

please refer to

ftp://ftp.fu-berlin.de/pub/misc/movies/database/tools/movie-database-faq .

# Authors

* Lucas Rodrigues

* Vilson Vieira

IFSC / University of São Paulo / 2012

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/automata/minera-imdb

Awesome Lists containing this project

README