Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/flyerhzm/regexp_crawler
A crawler which uses regular expression to catch data from website.
https://github.com/flyerhzm/regexp_crawler
Last synced: 16 days ago
JSON representation
A crawler which uses regular expression to catch data from website.
- Host: GitHub
- URL: https://github.com/flyerhzm/regexp_crawler
- Owner: flyerhzm
- License: mit
- Created: 2009-07-08T12:20:26.000Z (over 15 years ago)
- Default Branch: master
- Last Pushed: 2010-02-06T14:43:04.000Z (almost 15 years ago)
- Last Synced: 2024-09-17T20:10:04.545Z (about 2 months ago)
- Language: Ruby
- Homepage:
- Size: 103 KB
- Stars: 45
- Watchers: 4
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.textile
- License: LICENSE
Awesome Lists containing this project
README
h1. RegexpCrawler
regexp_crawler is a crawler which uses regular expression to catch data from website. It is easy to use and less code if you are familiar with regular expression.
**************************************************************************
h2. Install
sudo gem install regexp_crawler**************************************************************************
h2. Usage
It's really easy to use, sometime just one line.
RegexpCrawler::Crawler.new(options).startoptions is a hash
*:start_page
, mandatory, a string to define a website url where crawler start
*:continue_regexp
, optional, a regexp to define what website urls the crawler continue to crawl, it is parsed by String#scan and get the first not nil result
*:capture_regexp
, mandatory, a regexp to define what contents the crawler crawl, it is parse by Regexp#match and get all group captures
*:named_captures
, mandatory, a string array to define the names of captured groups according to :capture_regexp
*:model
, optional if :save_method defined, a string of result's model class
*:save_method
, optional if :model defined, a proc to define how to save the result which the crawler crawled, the proc accept two parameters, first is one page crawled result, second is the crawled url
*:headers
, optional, a hash to define http headers
*:encoding
, optional, a string of the coding of crawled page, the results will be converted to utf8
*:need_parse
, optional, a proc if parsing the page by regexp or not, the proc accept two parameters, first is the crawled website uri, second is the response body of crawled page
*:logger
, optional, true for logging to STDOUT, or a Logger object for logging to that loggerIf the crawler define :model no :save_method, the RegexpCrawler::Crawler#start will return an array of results, such as
[{:model_name => {:attr_name => 'attr_value'}, :page => 'website url'}, {:model_name => {:attr_name => 'attr_value'}, :page => 'another website url'}]**************************************************************************
h2. Example
a script to synchronize your github projects except fork projects, please check
example/github_projects.rb
require 'rubygems'
require 'regexp_crawler'class Project
attr_accessor :title, :description, :body, :urldef initialize(options)
options.each do |k, v|
self.instance_variable_set("@#{k}", v)
end
end
endprojects = []
crawler = RegexpCrawler::Crawler.new(
:start_page => "http://github.com/flyerhzm",
:continue_regexp => %r{[\s\n]*?}m,
:capture_regexp => %r{(.*?).*?[\s\n]*?)}m,(.*?)[\s\n]*?.*?
:named_captures => ['title', 'description', 'body'],
:logger => true,
:save_method => Proc.new do |result, page|
projects << Project.new(result.merge(:url => page))
end,
:need_parse => Proc.new do |page, response_body|
!response_body.index(//)
end)
crawler.startprojects.each do |project|
puts project.url
puts project.title
puts project.description
endThe results are as follows:
D, [2010-02-06T18:59:32.487885 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm
D, [2010-02-06T18:59:34.877730 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm
D, [2010-02-06T18:59:34.878158 #11387] DEBUG -- : continue_page: /flyerhzm/regexp_crawler
D, [2010-02-06T18:59:34.878462 #11387] DEBUG -- : continue_page: /flyerhzm/css_sprite
D, [2010-02-06T18:59:34.878707 #11387] DEBUG -- : continue_page: /flyerhzm/chinese_permalink
D, [2010-02-06T18:59:34.878991 #11387] DEBUG -- : continue_page: /flyerhzm/contactlist
D, [2010-02-06T18:59:34.879299 #11387] DEBUG -- : continue_page: /flyerhzm/rails_best_practices
D, [2010-02-06T18:59:34.880802 #11387] DEBUG -- : continue_page: /flyerhzm/rfetion
D, [2010-02-06T18:59:34.881232 #11387] DEBUG -- : continue_page: /flyerhzm/bullet
D, [2010-02-06T18:59:34.881644 #11387] DEBUG -- : continue_page: /flyerhzm/metric_fu
D, [2010-02-06T18:59:34.882090 #11387] DEBUG -- : continue_page: /flyerhzm/exception_notification
D, [2010-02-06T18:59:34.882570 #11387] DEBUG -- : continue_page: /flyerhzm/activemerchant_patch_for_china
D, [2010-02-06T18:59:34.883087 #11387] DEBUG -- : continue_page: /flyerhzm/contactlist-client
D, [2010-02-06T18:59:34.883650 #11387] DEBUG -- : continue_page: /flyerhzm/taobao
D, [2010-02-06T18:59:34.884231 #11387] DEBUG -- : continue_page: /flyerhzm/monitor
D, [2010-02-06T18:59:34.884843 #11387] DEBUG -- : continue_page: /flyerhzm/sitemap
D, [2010-02-06T18:59:34.885491 #11387] DEBUG -- : continue_page: /flyerhzm/visual_partial
D, [2010-02-06T18:59:34.886370 #11387] DEBUG -- : continue_page: /flyerhzm/chinese_regions
D, [2010-02-06T18:59:34.887123 #11387] DEBUG -- : continue_page: /flyerhzm/codelinestatistics
D, [2010-02-06T18:59:34.888060 #11387] DEBUG -- : continue_page: /flyerhzm/rack
D, [2010-02-06T19:00:25.245306 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/regexp_crawler
D, [2010-02-06T19:00:27.168275 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/regexp_crawler
D, [2010-02-06T19:00:27.172163 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:27.172349 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/css_sprite
D, [2010-02-06T19:00:29.005109 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/css_sprite
D, [2010-02-06T19:00:29.008690 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:29.008882 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/chinese_permalink
D, [2010-02-06T19:00:30.672890 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/chinese_permalink
D, [2010-02-06T19:00:30.680095 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:30.680453 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/contactlist
D, [2010-02-06T19:00:32.332182 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/contactlist
D, [2010-02-06T19:00:32.336053 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:32.336222 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rails_best_practices
D, [2010-02-06T19:00:34.554523 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rails_best_practices
D, [2010-02-06T19:00:34.564731 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:34.565456 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rfetion
D, [2010-02-06T19:00:36.255873 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rfetion
D, [2010-02-06T19:00:36.260189 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:36.260389 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/bullet
D, [2010-02-06T19:00:39.847604 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/bullet
D, [2010-02-06T19:00:39.858775 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:39.859471 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/metric_fu
D, [2010-02-06T19:00:41.779917 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/metric_fu
D, [2010-02-06T19:00:41.780332 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/exception_notification
D, [2010-02-06T19:00:43.481367 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/exception_notification
D, [2010-02-06T19:00:43.481768 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/activemerchant_patch_for_china
D, [2010-02-06T19:00:45.111665 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/activemerchant_patch_for_china
D, [2010-02-06T19:00:45.114517 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:45.114687 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/contactlist-client
D, [2010-02-06T19:00:46.797493 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/contactlist-client
D, [2010-02-06T19:00:46.801662 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:46.801909 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/taobao
D, [2010-02-06T19:00:49.147218 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/taobao
D, [2010-02-06T19:00:49.147556 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/monitor
D, [2010-02-06T19:00:52.968478 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/monitor
D, [2010-02-06T19:00:52.971288 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:52.971458 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/sitemap
D, [2010-02-06T19:00:58.807052 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/sitemap
D, [2010-02-06T19:00:58.811199 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:00:58.811388 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/visual_partial
D, [2010-02-06T19:01:01.788958 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/visual_partial
D, [2010-02-06T19:01:01.793886 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:01:01.794191 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/chinese_regions
D, [2010-02-06T19:01:04.098727 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/chinese_regions
D, [2010-02-06T19:01:04.103930 #11387] DEBUG -- : response body captured
D, [2010-02-06T19:01:04.104248 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/codelinestatistics
D, [2010-02-06T19:01:06.304536 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/codelinestatistics
D, [2010-02-06T19:01:14.003714 #11387] DEBUG -- : crawling page: http://github.com/flyerhzm/rack
D, [2010-02-06T19:01:16.551656 #11387] DEBUG -- : crawling success: http://github.com/flyerhzm/rack
http://github.com/flyerhzm/regexp_crawler
regexp_crawler
A crawler which uses regular expression to catch data from website.
http://github.com/flyerhzm/css_sprite
css_sprite
A rails plugin to generate css sprite image automatically
http://github.com/flyerhzm/chinese_permalink
chinese_permalink
This plugin adds a capability for AR model to create a seo permalink with your chinese text. It will translate your chinese text to english url based on google translate.
http://github.com/flyerhzm/contactlist
contactlist
java api to retrieve contact list of email(hotmail, gmail, yahoo, sohu, sina, 163, 126, tom, yeah, 189 and 139) and im(msn)
http://github.com/flyerhzm/rails_best_practices
rails_best_practices
rails_best_practices is a gem to check quality of rails app files according to ihower’s presentation from Kungfu RailsConf in Shanghai China
http://github.com/flyerhzm/rfetion
rfetion
rfetion is a ruby gem for China Mobile fetion service that you can send SMS free.
http://github.com/flyerhzm/bullet
bullet
A rails plugin/gem to kill N+1 queries and unused eager loading
http://github.com/flyerhzm/activemerchant_patch_for_china
activemerchant_patch_for_china
A rails plugin to add an active_merchant patch for china online payment platform including alipay (支付宝), 99bill (快钱) and tenpay (财付通)
http://github.com/flyerhzm/contactlist-client
contactlist-client
The contactlist-client gem is a ruby client to contactlist service which retrieves contact list of email(hotmail, gmail, yahoo, sohu, sina, 163, 126, tom, yeah, 189 and 139) and im(msn)
http://github.com/flyerhzm/monitor
monitor
Monitor gem can display ruby methods call stack on browser based on unroller
http://github.com/flyerhzm/sitemap
sitemap
This plugin will generate a sitemap.xml from sitemap.rb whose format is very similar to routes.rb
http://github.com/flyerhzm/visual_partial
visual_partial
This plugin provides a way that you can see all the partial pages rendered. So it can prevent you from using partial page too much, which hurts the performance.
http://github.com/flyerhzm/chinese_regions
chinese_regions
provides all chinese regions, cities and districts