https://github.com/tylerrick/scraper

A ruby scraping library using Mechanize
https://github.com/tylerrick/scraper

Last synced: about 2 months ago
JSON representation

A ruby scraping library using Mechanize

Host: GitHub
URL: https://github.com/tylerrick/scraper
Owner: TylerRick
Created: 2011-07-26T04:05:14.000Z (almost 15 years ago)
Default Branch: master
Last Pushed: 2011-08-31T18:02:47.000Z (almost 15 years ago)
Last Synced: 2025-01-12T22:28:30.848Z (over 1 year ago)
Language: Ruby
Homepage:
Size: 89.8 KB
Stars: 1
Watchers: 2
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: Readme.md

Awesome Lists containing this project

README

          Scraper

=======

Getting started

---------------

Add to your Gemfile:

    gem 'scraper', :git => 'git://github.com/TylerRick/scraper.git'

Subclass `Scraper::Page` and provide, at a minimum, a `process` and `continue` method.

Example:

    class ThingPage < Scraper::Page

      attr_reader :thing

      def process_page

        thing_id = doc.at('#thing_id').try(:inner_text) or raise UnexpectedPageStructureError.new("Couldn't find thing_id")

        @thing = Thing.find_by_thing_id(thing_id) || Thing.new(thing_id: thing_id)

        get_name

        get_url

        save_record

      end

      def continue

        doc.search('#children_things a').select do |a|

          a['href'] =~ %r(^/things/)

        end.each do |a|

          crawl_child ThingPage, a['href']

        end

      end

    end

`parent` will automatically be available to the next `Page` object when you use `crawl_child`.

To start crawling:

    ThingPage.new(url).crawl

Motivation

----------

After looking at the state of the other existing Ruby scraping libraries, I decided none of them really did what I needed. So I extracted some patterns from some of the existing scrapers I've written with Mechanize and Nokogiri and this library was born!

Other libraries I looked at:

* **scrubyt** (no longer maintained, doesn't even run on Ruby 1.9, but otherwise looked interesting)

* **scrapi** (nice DSL in some ways, but in the end, seemed like too much sugar and not enough meat; it was hard to figure out how to do anything beyond their simple examples; it didn't seem like it could help me do what I was trying to do; and it didn't use Nokogiri)

License

-------

This is free software available under the terms of the MIT license.

To do

-----

* Write tests

* etc.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tylerrick/scraper

Awesome Lists containing this project

README