Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/mithunsatheesh/imdb-scrapper

nodejs daemon application which will scrap the imdb movie data and insert into the local mongo database. Also has a socket.io dashboard on which you can view the movie details that are getting scrapped.
https://github.com/mithunsatheesh/imdb-scrapper

Last synced: about 1 month ago
JSON representation

nodejs daemon application which will scrap the imdb movie data and insert into the local mongo database. Also has a socket.io dashboard on which you can view the movie details that are getting scrapped.

Awesome Lists containing this project

README

        

imdb-scrapper
=============

nodejs daemon application which will scrap the imdb movie data and insert into the local mongo database. It has a movie dashboard which updates data getting entered into the local database into a dashboard with websockets.

### Requirements

1. Mongo DB installed
2. Node.js installed

### How to use it ?

1. Make sure you have node.js and MongoDB installed.
2. cd to the application root and do `npm install` which installs the dependencies
3. do `node app.js` to start the daemon.
4. take `localhost:3000` to see the scrapped movie data pushed on to dashboard and check your local `mongo` instance for the collected data.

### Features used

1. Mongo Grid FS for storing the images scrapped from imdb.
2. Websockets for realtime dashboard.

### Node package dependencies

1. cheerio - as html parser
2. express - for the dashboard app
3. jade - templating engine
4. mongodb - node driver for mongo
5. socket.io - for the realtime push

### Configuration

The configuration of this application can be done via the config.json file in the application root.
The various config parameters are:

1. **mongodb** : The mongo db connection url which has the ip,port, authentication and the db details. If you are connectiong to the local mongo please leave it as such.
The default database used will be imdb connection to the mongo instance at localhost:27017.

2. **mongocollection** : Name the collection to which the movie data has to be inserted. Defaults to events.

3. **movieId** : The imdb movie id from which we have to scrap the data. You can set it to 1 if you want to scrap all the data.
The movie id refers to the integer id in the imdb url after the `tt`.

4. **application_port** : The port at which the dashboard app should run. Defaults to 3000.

5. **req_pool** : The http request pool size. This refers to the maximum number of http requests that would be initiated in parallel to the imdb website.