Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/mithunsatheesh/imdb-scrapper
nodejs daemon application which will scrap the imdb movie data and insert into the local mongo database. Also has a socket.io dashboard on which you can view the movie details that are getting scrapped.
https://github.com/mithunsatheesh/imdb-scrapper
Last synced: about 1 month ago
JSON representation
nodejs daemon application which will scrap the imdb movie data and insert into the local mongo database. Also has a socket.io dashboard on which you can view the movie details that are getting scrapped.
- Host: GitHub
- URL: https://github.com/mithunsatheesh/imdb-scrapper
- Owner: mithunsatheesh
- Created: 2013-11-05T09:51:24.000Z (about 11 years ago)
- Default Branch: master
- Last Pushed: 2014-03-26T19:30:29.000Z (almost 11 years ago)
- Last Synced: 2024-04-15T03:09:30.188Z (9 months ago)
- Language: JavaScript
- Size: 10.4 MB
- Stars: 7
- Watchers: 6
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
imdb-scrapper
=============nodejs daemon application which will scrap the imdb movie data and insert into the local mongo database. It has a movie dashboard which updates data getting entered into the local database into a dashboard with websockets.
### Requirements
1. Mongo DB installed
2. Node.js installed### How to use it ?
1. Make sure you have node.js and MongoDB installed.
2. cd to the application root and do `npm install` which installs the dependencies
3. do `node app.js` to start the daemon.
4. take `localhost:3000` to see the scrapped movie data pushed on to dashboard and check your local `mongo` instance for the collected data.### Features used
1. Mongo Grid FS for storing the images scrapped from imdb.
2. Websockets for realtime dashboard.### Node package dependencies
1. cheerio - as html parser
2. express - for the dashboard app
3. jade - templating engine
4. mongodb - node driver for mongo
5. socket.io - for the realtime push### Configuration
The configuration of this application can be done via the config.json file in the application root.
The various config parameters are:1. **mongodb** : The mongo db connection url which has the ip,port, authentication and the db details. If you are connectiong to the local mongo please leave it as such.
The default database used will be imdb connection to the mongo instance at localhost:27017.2. **mongocollection** : Name the collection to which the movie data has to be inserted. Defaults to events.
3. **movieId** : The imdb movie id from which we have to scrap the data. You can set it to 1 if you want to scrap all the data.
The movie id refers to the integer id in the imdb url after the `tt`.4. **application_port** : The port at which the dashboard app should run. Defaults to 3000.
5. **req_pool** : The http request pool size. This refers to the maximum number of http requests that would be initiated in parallel to the imdb website.