{"id":22769768,"url":"https://github.com/bfontaine/lazyscraper","last_synced_at":"2025-03-30T11:29:37.937Z","repository":{"id":144860231,"uuid":"12201773","full_name":"bfontaine/LazyScraper","owner":"bfontaine","description":"The easy way to make lazy entity-oriented Web scrapers","archived":false,"fork":false,"pushed_at":"2014-02-18T09:54:25.000Z","size":175,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-23T23:43:43.599Z","etag":null,"topics":["ruby","scrapping"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bfontaine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-08-18T21:04:52.000Z","updated_at":"2019-10-26T07:16:21.000Z","dependencies_parsed_at":"2023-03-22T13:12:35.379Z","dependency_job_id":null,"html_url":"https://github.com/bfontaine/LazyScraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bfontaine%2FLazyScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bfontaine%2FLazyScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bfontaine%2FLazyScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bfontaine%2FLazyScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bfontaine","download_url":"https://codeload.github.com/bfontaine/LazyScraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246313104,"owners_count":20757429,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ruby","scrapping"],"created_at":"2024-12-11T15:15:36.493Z","updated_at":"2025-03-30T11:29:37.915Z","avatar_url":"https://github.com/bfontaine.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# LazyScraper\n\n[![Build Status](https://travis-ci.org/bfontaine/LazyScraper.png)](https://travis-ci.org/bfontaine/LazyScraper)\n[![Coverage Status](https://coveralls.io/repos/bfontaine/LazyScraper/badge.png)](https://coveralls.io/r/bfontaine/LazyScraper)\n\nLazyScraper is the easy way to define lazy entity-oriented Web scrapers.\n\nNote: This is only a proof-of-concept.\n\n## Usage\n\nLet’s say we want to fetch some reviews from FooBar website (which doesn’t have\na public API). Reviews are located at `'/review?product_id=something'` (we’ll\nleave the domain part here).\n\nWe start by creating a class which inherit from `LazyScraper::Entity`:\n\n```rb\nclass FooBarReview \u003c LazyScraper::Entity\nend\n```\n\nThen we’ll add some hooks. A hook map a set of attributes to an URL with a\nparser.  This is used to ensure that a webpage is fetched \u0026 parsed only once,\nand only at the right time. Here, we’ll assume that each review has a product id\nwe know, a product name, a score, and a text. They are all located on the\nsame page, but LazyScraper also support hooks on multiple URLs.\n\n```rb\nclass FooBarReview \u003c LazyScraper::Entity\n  attr_hook '/review?product_id=:product_id',\n    :product_name, :score, :text do |doc, attrs|\n\n    attrs[:product_name] = doc.css('#product .name').text\n    attrs[:score]        = doc.css('#score').text.to_i\n    attrs[:text]         = doc.css('#text').text\n  end\nend\n```\n\nHere, `attr_hook` takes the path to the page, with a `:product_id` placeholder,\nwhich will later be replaced by the actual `product_id` of a review. Then, we\ngives it the list of attributes which depends on this webpage. This way, the\npage will be fetched and parsed *only* the first time we access one of the\nattributes. The last argument is a block which takes a Nokogiri document and a\nhash we’ll populate in it.\n\nThat’s all, we can now try our class:\n\n```rb\n# note how we’re given the product id\nlazy_review = FooBarReview.new :product_id =\u003e 42\n\n# we haven’t fetched the page yet\n\nlazy_review.text  # this fetches the page and return the text\nlazy_review.score # this returns the score without fetching the page again\n```\n\n## Requirements\n\n* Ruby 2.x\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbfontaine%2Flazyscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbfontaine%2Flazyscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbfontaine%2Flazyscraper/lists"}