{"id":16701659,"url":"https://github.com/coryodaniel/klepto","last_synced_at":"2025-03-21T19:33:18.626Z","repository":{"id":7955676,"uuid":"9351905","full_name":"coryodaniel/klepto","owner":"coryodaniel","description":"A mean little DSL'd poltergeist (capybara) based web crawler that stuffs data into your Rails app.","archived":false,"fork":false,"pushed_at":"2013-07-18T19:14:39.000Z","size":316,"stargazers_count":21,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-18T04:42:51.280Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/coryodaniel.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-04-10T17:38:48.000Z","updated_at":"2023-03-30T18:05:36.000Z","dependencies_parsed_at":"2022-09-25T05:40:23.392Z","dependency_job_id":null,"html_url":"https://github.com/coryodaniel/klepto","commit_stats":null,"previous_names":[],"tags_count":45,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coryodaniel%2Fklepto","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coryodaniel%2Fklepto/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coryodaniel%2Fklepto/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/coryodaniel%2Fklepto/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/coryodaniel","download_url":"https://codeload.github.com/coryodaniel/klepto/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244855719,"owners_count":20521694,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T18:45:06.410Z","updated_at":"2025-03-21T19:33:18.153Z","avatar_url":"https://github.com/coryodaniel.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Klepto\n\nA mean little DSL'd capybara (poltergeist) based web scraper that structures data into ActiveRecord or wherever(TM).\n\n## Features \n\n* CSS or XPath Syntax\n* Full javascript processing via phantomjs / poltergeist\n* All the fun of capybara\n* Scrape multiple pages with a single bot\n* Pretty nifty DSL\n* Test coverage!\n\n## Installing\nYou need at least PhantomJS 1.8.1.  There are *no other external\ndependencies* (you don't need Qt, or a running X server, etc.)\n\n### Mac ###\n\n* *Homebrew*: `brew install phantomjs`\n* *MacPorts*: `sudo port install phantomjs`\n* *Manual install*: [Download this](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-macosx.zip\u0026can=2\u0026q=)\n\n### Linux ###\n\n* Download the [32\nbit](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-linux-i686.tar.bz2\u0026can=2\u0026q=)\nor [64\nbit](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-linux-x86_64.tar.bz2\u0026can=2\u0026q=)\nbinary.\n* Extract the tarball and copy `bin/phantomjs` into your `PATH`\n\n### Windows ###\n* Download the [precompiled binary](http://phantomjs.org/download.html) for Windows\n\n### Manual compilation ###\n\nDo this as a last resort if the binaries don't work for you. It will\ntake quite a long time as it has to build WebKit.\n\n* Download [the source tarball](http://code.google.com/p/phantomjs/downloads/detail?name=phantomjs-1.8.1-source.zip\u0026can=2\u0026q=)\n* Extract and cd in\n* `./build.sh`\n\n(See also the [PhantomJS building guide](http://phantomjs.org/build.html).)\n\nThen put klepto in your gemfile.\n\n```ruby\ngem 'klepto', '\u003e= 0.2.5'\n```\n\n\n\n## Usage (All your content are belong to us)\nSay you want a bunch of Bieb tweets! How is there not profit in that?\n\n```ruby\n# Fetch a web site or multiple. Bot#new takes a *splat!\n@bot = Klepto::Bot.new(\"https://twitter.com/justinbieber\"){\n  # By default, it uses CSS selectors\n  name      'h1.fullname'\n\n  # If you love C# or you are over 40, XPath is an option!\n  username \"//span[contains(concat(' ',normalize-space(@class),' '),' screen-name ')]\", :syntax =\u003e :xpath\n  \n  # By default Klepto uses the #text method, you can pass an :attr to use instead...\n  #   or a block that will receive the Capybara Node or Result set.\n  tweet_ids 'li.stream-item', :match =\u003e :all, :attr =\u003e 'data-item-id'\n  \n  # Want to match all the nodes for the selector? Pass :match =\u003e :all\n  links 'span.url a', :match =\u003e :all do |node|\n    node[:href]\n  end\n\n  # Nested structures? Let klepto know this is a resource\n  last_tweet 'li.stream-item', :as =\u003e :resource do\n    twitter_id do |node|\n      node['data-item-id']\n    end\n    content '.content p'\n    timestamp '._timestamp', :attr =\u003e 'data-time'\n    permalink '.time a', :attr =\u003e :href\n  end      \n\n  # Multiple Nested structures? Let klepto know this is a collection of resources\n  # Does bieber, tweet to much? Maybe. Lets only get the new stuff kids crave.\n  tweets    'li.stream-item', :as =\u003e :collection, :limit =\u003e 10 do\n    twitter_id do |node|\n      node['data-item-id']\n    end\n    tweet '.content p', :css\n    timestamp '._timestamp', :attr =\u003e 'data-time'\n    permalink '.time a', :css, :attr =\u003e :href\n  end     \n\n  # Set some headers, why not.\n  config.headers({\n    'Referer'     =\u003e 'http://www.twitter.com'\n  })  \n\n  # on_http_status can take a splat of statuses or ~statuses(4xx,5xx)\n  #   you can also have multiple handlers on a status\n  #   Note: Capybara automatically follows redirects, so the statuses 3xx\n  #   are never present. If you want to watch for a redirect pass see below\n  config.on_http_status(:redirect){\n    puts \"Something redirected...\"\n  }\n  config.on_http_status(200){\n    puts \"Expected this, NBD.\"\n  }\n\n  config.on_http_status('5xx','4xx'){\n    puts \"HOLY CRAP!\"\n  }\n\n  config.after(:get) do |page|\n    # This is fired after each HTTP GET. It receives a Capybara::Node\n  end  \n\n  # If you want to do something with each resource, like stick it in AR\n  #   go for it here...\n  config.after do |resource|\n    @user = User.new\n    @user.name = resource[:name]\n    @user.username = resource[:username]\n    @user.save\n\n    resource[:tweets].each do |tweet|\n      Tweet.create(tweet)\n    end\n  end #=\u003e Profit!\n}\n\n# You can get an array of hashes(resources), so if you wanted to do something else \n# you could do it here...\n@bot.resources.each do |resource|\n  pp resource\nend\n```\n\n## Got a string of HTML you don't need to crawl first?\n\n```ruby\n@html = Capybara::Node::Simple.new(@html_string)\n@structure = Klepto::Structure.build(@html){\n  # inside the build method, everything works the same as Bot.new\n  name      'h1.fullname'\n  username  'span.screen-name'\n\n  links 'span.url a', :match =\u003e :all do |node|\n    node[:href]\n  end\n\n  tweets    'li.stream-item', :as =\u003e :collection do\n    twitter_id do |node|\n      node['data-item-id']\n    end\n    tweet '.content p', :css\n    timestamp '._timestamp', :attr =\u003e 'data-time'\n    permalink '.time a', :css, :attr =\u003e :href\n  end       \n}\n```\n\n## Configuration Options\n* config.headers - Hash; Sets request headers\n* config.url    - String; Set URL to structure\n* config.abort_on_failure - Boolean(Default: true); Should structuring be aborted on 4xx or 5xx\n\n## Callbacks \u0026 Processing\n\n* before\n  * :get (browser, url)\n* after\n  * :structure (Hash) - receives the structure from the page\n  * :get (browser, url) - called after each HTTP GET\n  * :abort (browser, hash(details)) - called after a 4xx or 5xx if config.abort_on_failure is true (default)\n\n\n## Stuff I'm going to add.\n* Ensure after(:each) work at resource/collection level as well\n* Add after(:all)\n* :if, :unless for as: (:collection|:resource) to. context should be captured node that block is run against\n* Access to hash from within a block (for bulk assignment of other attributes) ?\n* config.allow_rescue_in_block #should exceptions in blocks be auto rescued with nil as the return value\n* :default should be able to take a proc\n\nAsync \n--------\n-\u003e https://github.com/igrigorik/em-synchrony\n\nCookie Stuffing\n-------------------\n```ruby\ncookies({\n  'Has Fun' =\u003e true\n})  \n```\n\nPre-req Steps\n--------------------  \n```ruby\nprepare [\n  [:GET, 'http://example.com'],\n  [:POST, 'http://example.com/login', {username: 'cory', password: '123456'}],\n]\n```\n\nPage Assertions\n--------------------\n```ruby\nassertions do\n  #presence and value assertions...\nend\non_assertion_failure{ |response, bot| }\n```\n\nStructure\n:if\nunless: lambda{|node| node.class.include?(\"newsflash\")}","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoryodaniel%2Fklepto","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcoryodaniel%2Fklepto","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcoryodaniel%2Fklepto/lists"}