Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jiren/dcrawler
distributed web-spider framework using mongodb as a storage.
https://github.com/jiren/dcrawler
Last synced: 8 days ago
JSON representation
distributed web-spider framework using mongodb as a storage.
- Host: GitHub
- URL: https://github.com/jiren/dcrawler
- Owner: jiren
- License: mit
- Created: 2012-08-30T06:44:06.000Z (over 12 years ago)
- Default Branch: master
- Last Pushed: 2012-08-30T06:50:13.000Z (over 12 years ago)
- Last Synced: 2024-11-06T19:55:30.482Z (about 2 months ago)
- Language: Ruby
- Size: 116 KB
- Stars: 2
- Watchers: 3
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
Dcrawler
========Dcrawler is distributed crawler inspire by Anemone and store data to mongodb.
Features
========* Multi-threaded design for high performance
* Tracks 301 HTTP redirects
* Built-in BFS algorithm for determining page depth
* Allows exclusion of URLs based on regular expressions
* Choose the links to follow on each page with focus_crawl()
* HTTPS support
* Records response time for each page
* Obey robots.txt
* Persistent storage of pages during crawl, using MongoDB
* Crawler status update into database after certain page crawl.Mongodb configuration and Environment
=====================================Config file
-----------
Add configration for crawler status database 'process_admin' and page and link database
for each environmentdefaluts: &defaults
databases:
process_admin:
uri: "mongodb://localhost:27017/process_admin"development:
<<: *defaults
uri: "mongodb://localhost:27017/test"
pool_size: 5Export config variable 'CRAWLER'
--------------------------------export CRAWLER="db_config:mongo.yml,env:development"
db_config is mongodb configration file path
env is crawler environmentAlso, we can define 'CRAWLER' variable in ruby script.
ENV['CRAWLER'] = "db_config:#{dir}/config/mongo.yml,env:development"
Crawler Example
===============Add domains or links to be crawler
-----------------------------------Dcrawler::Link.enq(:url => 'http://www.example.com/')
Options
-------
opts = {:verbose => true,
:queue_timeout => 20,
:page_crawl_limit => 10}
- queue_timeout: stop crawler if it is idle for queue_timeout.
- page_crawl_limit: max page to be crawl.Start crawler
-------------Dcrawler::Core.crawl(opts)
Example
=======
For sample example you can find into example folder.