https://github.com/coryodaniel/klepto

A mean little DSL'd poltergeist (capybara) based web crawler that stuffs data into your Rails app.
https://github.com/coryodaniel/klepto

Last synced: over 1 year ago
JSON representation

A mean little DSL'd poltergeist (capybara) based web crawler that stuffs data into your Rails app.

Host: GitHub
URL: https://github.com/coryodaniel/klepto
Owner: coryodaniel
License: mit
Created: 2013-04-10T17:38:48.000Z (about 13 years ago)
Default Branch: master
Last Pushed: 2013-07-18T19:14:39.000Z (almost 13 years ago)
Last Synced: 2025-03-18T04:42:51.280Z (over 1 year ago)
Language: Ruby
Size: 309 KB
Stars: 21
Watchers: 2
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

README

          # Klepto

A mean little DSL'd capybara (poltergeist) based web scraper that structures data into ActiveRecord or wherever(TM).

## Features 

* CSS or XPath Syntax

* Full javascript processing via phantomjs / poltergeist

* All the fun of capybara

* Scrape multiple pages with a single bot

* Pretty nifty DSL

* Test coverage!

## Installing

You need at least PhantomJS 1.8.1.  There are *no other external

dependencies* (you don't need Qt, or a running X server, etc.)

### Mac ###

* *Homebrew*: `brew install phantomjs`

* *MacPorts*: `sudo port install phantomjs`

* *Manual install*: [Download this](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-macosx.zip&can=2&q=)

### Linux ###

* Download the [32

bit](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-linux-i686.tar.bz2&can=2&q=)

or [64

bit](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-linux-x86_64.tar.bz2&can=2&q=)

binary.

* Extract the tarball and copy `bin/phantomjs` into your `PATH`

### Windows ###

* Download the [precompiled binary](http://phantomjs.org/download.html) for Windows

### Manual compilation ###

Do this as a last resort if the binaries don't work for you. It will

take quite a long time as it has to build WebKit.

* Download [the source tarball](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-source.zip&can=2&q=)

* Extract and cd in

* `./build.sh`

(See also the [PhantomJS building guide](http://phantomjs.org/build.html).)

Then put klepto in your gemfile.

```ruby

gem 'klepto', '>= 0.2.5'

```

## Usage (All your content are belong to us)

Say you want a bunch of Bieb tweets! How is there not profit in that?

```ruby

# Fetch a web site or multiple. Bot#new takes a *splat!

@bot = Klepto::Bot.new("https://twitter.com/justinbieber"){

  # By default, it uses CSS selectors

  name      'h1.fullname'

  # If you love C# or you are over 40, XPath is an option!

  username "//span[contains(concat(' ',normalize-space(@class),' '),' screen-name ')]", :syntax => :xpath

  

  # By default Klepto uses the #text method, you can pass an :attr to use instead...

  #   or a block that will receive the Capybara Node or Result set.

  tweet_ids 'li.stream-item', :match => :all, :attr => 'data-item-id'

  

  # Want to match all the nodes for the selector? Pass :match => :all

  links 'span.url a', :match => :all do |node|

    node[:href]

  end

  # Nested structures? Let klepto know this is a resource

  last_tweet 'li.stream-item', :as => :resource do

    twitter_id do |node|

      node['data-item-id']

    end

    content '.content p'

    timestamp '._timestamp', :attr => 'data-time'

    permalink '.time a', :attr => :href

  end      

  # Multiple Nested structures? Let klepto know this is a collection of resources

  # Does bieber, tweet to much? Maybe. Lets only get the new stuff kids crave.

  tweets    'li.stream-item', :as => :collection, :limit => 10 do

    twitter_id do |node|

      node['data-item-id']

    end

    tweet '.content p', :css

    timestamp '._timestamp', :attr => 'data-time'

    permalink '.time a', :css, :attr => :href

  end     

  # Set some headers, why not.

  config.headers({

    'Referer'     => 'http://www.twitter.com'

  })  

  # on_http_status can take a splat of statuses or ~statuses(4xx,5xx)

  #   you can also have multiple handlers on a status

  #   Note: Capybara automatically follows redirects, so the statuses 3xx

  #   are never present. If you want to watch for a redirect pass see below

  config.on_http_status(:redirect){

    puts "Something redirected..."

  }

  config.on_http_status(200){

    puts "Expected this, NBD."

  }

  config.on_http_status('5xx','4xx'){

    puts "HOLY CRAP!"

  }

  config.after(:get) do |page|

    # This is fired after each HTTP GET. It receives a Capybara::Node

  end  

  # If you want to do something with each resource, like stick it in AR

  #   go for it here...

  config.after do |resource|

    @user = User.new

    @user.name = resource[:name]

    @user.username = resource[:username]

    @user.save

    resource[:tweets].each do |tweet|

      Tweet.create(tweet)

    end

  end #=> Profit!

}

# You can get an array of hashes(resources), so if you wanted to do something else 

# you could do it here...

@bot.resources.each do |resource|

  pp resource

end

```

## Got a string of HTML you don't need to crawl first?

```ruby

@html = Capybara::Node::Simple.new(@html_string)

@structure = Klepto::Structure.build(@html){

  # inside the build method, everything works the same as Bot.new

  name      'h1.fullname'

  username  'span.screen-name'

  links 'span.url a', :match => :all do |node|

    node[:href]

  end

  tweets    'li.stream-item', :as => :collection do

    twitter_id do |node|

      node['data-item-id']

    end

    tweet '.content p', :css

    timestamp '._timestamp', :attr => 'data-time'

    permalink '.time a', :css, :attr => :href

  end       

}

```

## Configuration Options

* config.headers - Hash; Sets request headers

* config.url    - String; Set URL to structure

* config.abort_on_failure - Boolean(Default: true); Should structuring be aborted on 4xx or 5xx

## Callbacks & Processing

* before

  * :get (browser, url)

* after

  * :structure (Hash) - receives the structure from the page

  * :get (browser, url) - called after each HTTP GET

  * :abort (browser, hash(details)) - called after a 4xx or 5xx if config.abort_on_failure is true (default)

## Stuff I'm going to add.

* Ensure after(:each) work at resource/collection level as well

* Add after(:all)

* :if, :unless for as: (:collection|:resource) to. context should be captured node that block is run against

* Access to hash from within a block (for bulk assignment of other attributes) ?

* config.allow_rescue_in_block #should exceptions in blocks be auto rescued with nil as the return value

* :default should be able to take a proc

Async 

--------

-> https://github.com/igrigorik/em-synchrony

Cookie Stuffing

-------------------

```ruby

cookies({

  'Has Fun' => true

})  

```

Pre-req Steps

--------------------  

```ruby

prepare [

  [:GET, 'http://example.com'],

  [:POST, 'http://example.com/login', {username: 'cory', password: '123456'}],

]

```

Page Assertions

--------------------

```ruby

assertions do

  #presence and value assertions...

end

on_assertion_failure{ |response, bot| }

```

Structure

:if

unless: lambda{|node| node.class.include?("newsflash")}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/coryodaniel/klepto

Awesome Lists containing this project

README