https://github.com/bfontaine/lazyscraper
The easy way to make lazy entity-oriented Web scrapers
https://github.com/bfontaine/lazyscraper
ruby scrapping
Last synced: 6 months ago
JSON representation
The easy way to make lazy entity-oriented Web scrapers
- Host: GitHub
- URL: https://github.com/bfontaine/lazyscraper
- Owner: bfontaine
- License: mit
- Created: 2013-08-18T21:04:52.000Z (about 12 years ago)
- Default Branch: master
- Last Pushed: 2014-02-18T09:54:25.000Z (over 11 years ago)
- Last Synced: 2025-03-23T23:43:43.599Z (7 months ago)
- Topics: ruby, scrapping
- Language: Ruby
- Homepage:
- Size: 171 KB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# LazyScraper
[](https://travis-ci.org/bfontaine/LazyScraper)
[](https://coveralls.io/r/bfontaine/LazyScraper)LazyScraper is the easy way to define lazy entity-oriented Web scrapers.
Note: This is only a proof-of-concept.
## Usage
Let’s say we want to fetch some reviews from FooBar website (which doesn’t have
a public API). Reviews are located at `'/review?product_id=something'` (we’ll
leave the domain part here).We start by creating a class which inherit from `LazyScraper::Entity`:
```rb
class FooBarReview < LazyScraper::Entity
end
```Then we’ll add some hooks. A hook map a set of attributes to an URL with a
parser. This is used to ensure that a webpage is fetched & parsed only once,
and only at the right time. Here, we’ll assume that each review has a product id
we know, a product name, a score, and a text. They are all located on the
same page, but LazyScraper also support hooks on multiple URLs.```rb
class FooBarReview < LazyScraper::Entity
attr_hook '/review?product_id=:product_id',
:product_name, :score, :text do |doc, attrs|attrs[:product_name] = doc.css('#product .name').text
attrs[:score] = doc.css('#score').text.to_i
attrs[:text] = doc.css('#text').text
end
end
```Here, `attr_hook` takes the path to the page, with a `:product_id` placeholder,
which will later be replaced by the actual `product_id` of a review. Then, we
gives it the list of attributes which depends on this webpage. This way, the
page will be fetched and parsed *only* the first time we access one of the
attributes. The last argument is a block which takes a Nokogiri document and a
hash we’ll populate in it.That’s all, we can now try our class:
```rb
# note how we’re given the product id
lazy_review = FooBarReview.new :product_id => 42# we haven’t fetched the page yet
lazy_review.text # this fetches the page and return the text
lazy_review.score # this returns the score without fetching the page again
```## Requirements
* Ruby 2.x