Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/flyerhzm/regexp_crawler

A crawler which uses regular expression to catch data from website.
https://github.com/flyerhzm/regexp_crawler

Last synced: 16 days ago
JSON representation

A crawler which uses regular expression to catch data from website.

Host: GitHub
URL: https://github.com/flyerhzm/regexp_crawler
Owner: flyerhzm
License: mit
Created: 2009-07-08T12:20:26.000Z (over 15 years ago)
Default Branch: master
Last Pushed: 2010-02-06T14:43:04.000Z (almost 15 years ago)
Last Synced: 2024-09-17T20:10:04.545Z (about 2 months ago)
Language: Ruby
Homepage:
Size: 103 KB
Stars: 45
Watchers: 4
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.textile
- License: LICENSE

Awesome Lists containing this project

README

h1. RegexpCrawler

regexp_crawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression.

**************************************************************************

h2. Install



sudo gem install regexp_crawler

**************************************************************************

h2. Usage

It's really easy to use, sometime just one line.



RegexpCrawler::Crawler.new(options).start

options is a hash
* :start_page, mandatory, a string to define a website url where crawler start
* :continue_regexp, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
* :capture_regexp, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
* :named_captures, mandatory, a string array to define the names of captured groups according to :capture_regexp
* :model, optional if :save_method defined, a string of result's model class
* :save_method, optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
* :headers, optional, a hash to define http headers
* :encoding, optional, a string of the coding of crawled page, the results will be converted to utf8
* :need_parse, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
* :logger, optional, true for logging to STDOUT, or a Logger object for logging to that logger

If the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as



[{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]

**************************************************************************

h2. Example

a script to synchronize your github projects except fork projects, please check example/github_projects.rb



require 'rubygems'

require 'regexp_crawler'

class Project

  attr_accessor :title, :description, :body, :url

  def initialize(options)

    options.each do |k, v|

      self.instance_variable_set("@#{k}", v)

    end

  end

end

projects = []

crawler = RegexpCrawler::Crawler.new(

  :start_page => "http://github.com/flyerhzm",

  :continue_regexp => %r{
[\s\n]*?}m,

  :capture_regexp => %r{(.*?).*?[\s\n]*?(.*?)[\s\n]*?.*?

)}m,

  :named_captures => ['title', 'description', 'body'],

  :logger => true,

  :save_method => Proc.new do |result, page|

    projects << Project.new(result.merge(:url => page))

  end,

  :need_parse => Proc.new do |page, response_body|

    !response_body.index(//)

  end)

crawler.start

projects.each do |project|

  puts project.url

  puts project.title

  puts project.description

end

The results are as follows:



D, [2010-02-06T18:59:32.487885 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm

D, [2010-02-06T18:59:34.877730 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm

D, [2010-02-06T18:59:34.878158 #11387] DEBUG -- : continue_page: /flyerhzm/regexp_crawler

D, [2010-02-06T18:59:34.878462 #11387] DEBUG -- : continue_page: /flyerhzm/css_sprite

D, [2010-02-06T18:59:34.878707 #11387] DEBUG -- : continue_page: /flyerhzm/chinese_permalink

D, [2010-02-06T18:59:34.878991 #11387] DEBUG -- : continue_page: /flyerhzm/contactlist

D, [2010-02-06T18:59:34.879299 #11387] DEBUG -- : continue_page: /flyerhzm/rails_best_practices

D, [2010-02-06T18:59:34.880802 #11387] DEBUG -- : continue_page: /flyerhzm/rfetion

D, [2010-02-06T18:59:34.881232 #11387] DEBUG -- : continue_page: /flyerhzm/bullet

D, [2010-02-06T18:59:34.881644 #11387] DEBUG -- : continue_page: /flyerhzm/metric_fu

D, [2010-02-06T18:59:34.882090 #11387] DEBUG -- : continue_page: /flyerhzm/exception_notification

D, [2010-02-06T18:59:34.882570 #11387] DEBUG -- : continue_page: /flyerhzm/activemerchant_patch_for_china

D, [2010-02-06T18:59:34.883087 #11387] DEBUG -- : continue_page: /flyerhzm/contactlist-client

D, [2010-02-06T18:59:34.883650 #11387] DEBUG -- : continue_page: /flyerhzm/taobao

D, [2010-02-06T18:59:34.884231 #11387] DEBUG -- : continue_page: /flyerhzm/monitor

D, [2010-02-06T18:59:34.884843 #11387] DEBUG -- : continue_page: /flyerhzm/sitemap

D, [2010-02-06T18:59:34.885491 #11387] DEBUG -- : continue_page: /flyerhzm/visual_partial

D, [2010-02-06T18:59:34.886370 #11387] DEBUG -- : continue_page: /flyerhzm/chinese_regions

D, [2010-02-06T18:59:34.887123 #11387] DEBUG -- : continue_page: /flyerhzm/codelinestatistics

D, [2010-02-06T18:59:34.888060 #11387] DEBUG -- : continue_page: /flyerhzm/rack

D, [2010-02-06T19:00:25.245306 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/regexp_crawler

D, [2010-02-06T19:00:27.168275 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/regexp_crawler

D, [2010-02-06T19:00:27.172163 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:27.172349 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/css_sprite

D, [2010-02-06T19:00:29.005109 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/css_sprite

D, [2010-02-06T19:00:29.008690 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:29.008882 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/chinese_permalink

D, [2010-02-06T19:00:30.672890 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/chinese_permalink

D, [2010-02-06T19:00:30.680095 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:30.680453 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/contactlist

D, [2010-02-06T19:00:32.332182 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/contactlist

D, [2010-02-06T19:00:32.336053 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:32.336222 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rails_best_practices

D, [2010-02-06T19:00:34.554523 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rails_best_practices

D, [2010-02-06T19:00:34.564731 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:34.565456 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rfetion

D, [2010-02-06T19:00:36.255873 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rfetion

D, [2010-02-06T19:00:36.260189 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:36.260389 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/bullet

D, [2010-02-06T19:00:39.847604 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/bullet

D, [2010-02-06T19:00:39.858775 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:39.859471 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/metric_fu

D, [2010-02-06T19:00:41.779917 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/metric_fu

D, [2010-02-06T19:00:41.780332 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/exception_notification

D, [2010-02-06T19:00:43.481367 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/exception_notification

D, [2010-02-06T19:00:43.481768 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/activemerchant_patch_for_china

D, [2010-02-06T19:00:45.111665 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/activemerchant_patch_for_china

D, [2010-02-06T19:00:45.114517 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:45.114687 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/contactlist-client

D, [2010-02-06T19:00:46.797493 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/contactlist-client

D, [2010-02-06T19:00:46.801662 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:46.801909 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/taobao

D, [2010-02-06T19:00:49.147218 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/taobao

D, [2010-02-06T19:00:49.147556 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/monitor

D, [2010-02-06T19:00:52.968478 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/monitor

D, [2010-02-06T19:00:52.971288 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:52.971458 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/sitemap

D, [2010-02-06T19:00:58.807052 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/sitemap

D, [2010-02-06T19:00:58.811199 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:00:58.811388 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/visual_partial

D, [2010-02-06T19:01:01.788958 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/visual_partial

D, [2010-02-06T19:01:01.793886 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:01:01.794191 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/chinese_regions

D, [2010-02-06T19:01:04.098727 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/chinese_regions

D, [2010-02-06T19:01:04.103930 #11387] DEBUG -- : response body captured

D, [2010-02-06T19:01:04.104248 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/codelinestatistics

D, [2010-02-06T19:01:06.304536 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/codelinestatistics

D, [2010-02-06T19:01:14.003714 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rack

D, [2010-02-06T19:01:16.551656 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rack

http://github.com/flyerhzm/regexp_crawler

regexp_crawler

A crawler which uses regular expression to catch data from website.

http://github.com/flyerhzm/css_sprite

css_sprite

A rails plugin to generate css sprite image automatically

http://github.com/flyerhzm/chinese_permalink

chinese_permalink

This plugin adds a capability for AR model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.

http://github.com/flyerhzm/contactlist

contactlist

java api to retrieve contact list of email(hotmail, gmail, yahoo, sohu, sina, 163, 126, tom, yeah, 189 and 139) and im(msn)

http://github.com/flyerhzm/rails_best_practices

rails_best_practices

rails_best_practices is a gem to check quality of rails app files according to ihower’s presentation from Kungfu RailsConf in Shanghai China

http://github.com/flyerhzm/rfetion

rfetion

rfetion is a ruby gem for China Mobile fetion service that you can send SMS free.

http://github.com/flyerhzm/bullet

bullet

A rails plugin/gem to kill N+1 queries and unused eager loading

http://github.com/flyerhzm/activemerchant_patch_for_china

activemerchant_patch_for_china

A rails plugin to add an active_merchant patch for china online payment platform including alipay (支付宝), 99bill (快钱) and tenpay (财付通)

http://github.com/flyerhzm/contactlist-client

contactlist-client

The contactlist-client gem is a ruby client to contactlist service which retrieves contact list of email(hotmail, gmail, yahoo, sohu, sina, 163, 126, tom, yeah, 189 and 139) and im(msn)

http://github.com/flyerhzm/monitor

monitor

Monitor gem can display ruby methods call stack on browser based on unroller

http://github.com/flyerhzm/sitemap

sitemap

This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb

http://github.com/flyerhzm/visual_partial

visual_partial

This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.

http://github.com/flyerhzm/chinese_regions

chinese_regions

provides all chinese regions, cities and districts