https://github.com/jiren/dcrawler

distributed web-spider framework using mongodb as a storage.
https://github.com/jiren/dcrawler

Last synced: 4 days ago
JSON representation

distributed web-spider framework using mongodb as a storage.

Host: GitHub
URL: https://github.com/jiren/dcrawler
Owner: jiren
License: mit
Created: 2012-08-30T06:44:06.000Z (over 12 years ago)
Default Branch: master
Last Pushed: 2012-08-30T06:50:13.000Z (over 12 years ago)
Last Synced: 2025-02-17T08:18:13.963Z (3 months ago)
Language: Ruby
Size: 116 KB
Stars: 2
Watchers: 3
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

        Dcrawler

========

Dcrawler is distributed crawler inspire by Anemone and store data to mongodb.

Features

========

* Multi-threaded design for high performance

* Tracks 301 HTTP redirects

* Built-in BFS algorithm for determining page depth

* Allows exclusion of URLs based on regular expressions

* Choose the links to follow on each page with focus_crawl()

* HTTPS support

* Records response time for each page

* Obey robots.txt

* Persistent storage of pages during crawl, using MongoDB

* Crawler status update into database after certain page crawl.

Mongodb configuration and Environment

=====================================

Config file

-----------

  Add configration for crawler status database 'process_admin' and page and link database

  for each environment

    defaluts: &defaults

      databases:

        process_admin:

          uri: "mongodb://localhost:27017/process_admin"

    development:

      <<: *defaults

      uri: "mongodb://localhost:27017/test"

      pool_size: 5

Export config variable 'CRAWLER'

--------------------------------

    export CRAWLER="db_config:mongo.yml,env:development"

db_config is mongodb configration file path

env is crawler environment

Also, we can define 'CRAWLER' variable in ruby script.

    ENV['CRAWLER'] = "db_config:#{dir}/config/mongo.yml,env:development"

Crawler Example

===============

Add domains or links to be crawler

-----------------------------------

    Dcrawler::Link.enq(:url => 'http://www.example.com/')

Options

-------

    

    opts = {:verbose => true, 

            :queue_timeout => 20, 

            :page_crawl_limit => 10}

       

- queue_timeout: stop crawler if it is idle for queue_timeout.

- page_crawl_limit: max page to be crawl.

Start crawler

-------------

    Dcrawler::Core.crawl(opts)

Example

=======

For sample example you can find into example folder.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jiren/dcrawler

Awesome Lists containing this project

README