https://github.com/afiore/extraloop

Ruby online data extraction toolkit
https://github.com/afiore/extraloop

Last synced: 8 months ago
JSON representation

Ruby online data extraction toolkit

Host: GitHub
URL: https://github.com/afiore/extraloop
Owner: afiore
Created: 2011-12-22T15:23:36.000Z (over 14 years ago)
Default Branch: master
Last Pushed: 2012-03-27T15:41:42.000Z (about 14 years ago)
Last Synced: 2025-10-12T08:43:03.052Z (8 months ago)
Language: Ruby
Homepage: https://github.com/afiore/extraloop
Size: 211 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: History.txt

Awesome Lists containing this project

README

          # Extra Loop

A Ruby library for extracting structured data from websites and web based APIs. 

Supports most common document formats (i.e. HTML, XML, CSV, and JSON), and comes with a handy mechanism 

for iterating over paginated datasets.

## Installation:

    gem install extraloop

## Usage:

A basic scraper that fetches the top 25 websites from [Alexa's daily top 100](www.alexa.com/topsites) list:

    alexa_scraper = ExtraLoop::ScraperBase.

      new("http://www.alexa.com/topsites").

      loop_on("li.site-listing").

        extract(:site_name, "h2").

        extract(:url, "h2 a").

        extract(:description, ".description").

      on(:data) { |data| { |record| puts record.site_name } }

    alexa_scraper.run

An iterative Scraper that fetches URL, title, and publisher from some 110 Google News articles mentioning the keyword _'Egypt'_.

    results = []

    ExtraLoop::IterativeScraper.

      new("https://www.google.com/search?tbm=nws&q=Egypt").

      set_iteration(:start, (1..101).step(10)).

      loop_on("h3") { |nodes| nodes.map(&:parent) }.

        extract(:title, "h3.r a").

        extract(:url, "h3.r a", :href).

        extract(:source, "br") { |node| node.next.text.split("-").first }.

      on(:data) { |data, response| data.each { |record| results << record } }.

      run()

## Scraper initialisation signature

    #new(urls, scraper_options, http_options)

- __urls__ - single url, or array of several urls.

- __scraper_options__ - hash of scraper options (see below).

- __http_options__ - hash of request options for `Typheous::Request#initialize` (see [API documentation](http://rubydoc.info/github/pauldix/typhoeus/master/Typhoeus/Request#initialize-instance_method) for details).

### scraper options:

* __format__ - Specifies the scraped document format; needed only if the Content-Type in the server response is not the correct one. Supported formats are: 'html', 'xml', 'json', and 'csv'. 

* __async__ - Specifies whether the scraper's HTTP requests should be run in parallel or in series (defaults to false). **Note:** currently only GET requests can be run asynchronously.

* __log__ - Logging options hash:

     * __loglevel__  - a symbol specifying the desired log level (defaults to `:info`).

     * __appenders__ - a list of Logging.appenders object (defaults to `Logging.appenders.sterr`).

## Extractors

ExtraLoop allows to fetch structured data from online documents by looping through a list of elements matching a given selector.

For each matched element, an arbitrary set of fields can be extracted. While the `loop_on` method sets up such loop, the `extract` 

method extracts a specific piece of information from an element (e.g. a story's title) and stores it into a record's field.

    # looping over a set of document elements using a CSS3 (or XPath) selector

    loop_on('div.post')

    # looping 

    loop_on { |doc| doc.search('div.post') }

    # using both a selector and a proc (the matched element list is passed in to the proc as its first argument )

    loop_on('div.post') { |posts| posts.reject { |post| post.attr(:class) == 'sticky' } }

Both the `loop_on` and the `extract` methods may be called with a selector, a block or a combination of the two. By default, when parsing DOM documents, `extract` will call

`Nokogiri::XML::Node#text()`. Alternatively, `extract` also accepts an attribute name and a block. The latter is evaluated in the context of the current iteration's element. 

    # extract a story's title 

    extract(:title, 'h3')

    # extract a story's url

    extract(:url, "a.link-to-story", :href)

    # extract a description text, separating paragraphs with newlines 

    extract(:description, "div.description") { |node| node.css("p").map(&:text).join("\n") }

### Extracting data from JSON Documents

While processing an HTTP response, ExtraLoop tries to automatically detect the scraped document format by looking at 

the `ContentType` header sent by the server. This value can be overriden by providing a `:format` key in the scraper's 

initialization options. When format is JSON, the document is parsed using the `yajl` JSON parser and converted into a hash. 

In this case, both the `loop_on` and the `extract` methods still behave as illustrated above, except it does not support 

CSS3/XPath selectors.

When working with JSON data, you can just use a block and have it return the document elements you want to loop on.

    # Fetch a portion of a document using a proc

    loop_on  { |data| data['query']['categorymembers'] })

Alternatively, the same loop can be defined by passing an array of keys pointing at a hash value located 

at several levels of depth down into the parsed document structure.

    # Same as above, using a hash path

    loop_on(['query', 'categorymembers'])

When fetching fields from a JSON document fragment, `extract` will often not need a block or an array of keys. If called with only

one argument, it will in fact try to fetch a hash value using the provided field name as key.

    # current node:

    #

    # {

    #  'from_user' => "johndoe", 

    #  'text' => 'bla bla bla',

    #  'from_user_id'..

    # }

    # >> extract(:from_user)

    # => "johndoe"

## Iteration methods

The `IterativeScraper` class comes with two methods that allow scrapers to loop over paginated content.

### set\_iteration

* __iteration_parameter__ - A symbol identifying the request parameter that the scraper will use as offset in order to iterate over the paginated content.

* __array_or_range_or_block__ - Either an explicit set of values or a block of code. If provided, the block is called with the parsed document object as its first argument. The block should return a non empty array, which will determine the value of the offset parameter during each iteration. If the block fails to return a non empty array, the iteration stops.

### continue\_with

The second iteration method, `#continue_with`, allows to continue an interation as long as a block of code returns a truthy, non-nil value (to be assigned to the iteration parameter).

* __iteration_parameter__ - the scraper' iteration parameter.

* __&block__ - An arbitrary block of ruby code, its return value will be used to determine the value of the next iteration's offset parameter.

## Running tests

ExtraLoop uses `rspec` and `rr` as its testing framework. The test suite can be run by calling the `rspec` executable from within the `spec` directory:

    cd spec

    rspec *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/afiore/extraloop

Awesome Lists containing this project

README