{"id":15405704,"url":"https://github.com/zverok/cobb","last_synced_at":"2025-03-21T15:43:03.845Z","repository":{"id":136288662,"uuid":"20274430","full_name":"zverok/cobb","owner":"zverok","description":"Cobb is Yet Another Web Scraper, named after Firefly's Jayne Cobb","archived":false,"fork":false,"pushed_at":"2015-03-18T18:16:09.000Z","size":172,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-26T11:11:14.482Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zverok.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2014-05-28T22:26:06.000Z","updated_at":"2016-01-23T23:29:57.000Z","dependencies_parsed_at":"2023-03-13T11:04:06.476Z","dependency_job_id":null,"html_url":"https://github.com/zverok/cobb","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zverok%2Fcobb","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zverok%2Fcobb/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zverok%2Fcobb/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zverok%2Fcobb/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zverok","download_url":"https://codeload.github.com/zverok/cobb/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244823921,"owners_count":20516373,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-01T16:18:19.049Z","updated_at":"2025-03-21T15:43:03.800Z","avatar_url":"https://github.com/zverok.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Cobb\n\n**Cobb** is \"yet another\"™ web scraper library, extracted from a real project.\n\nIt's named after [Jayne Cobb](http://firefly.wikia.com/wiki/Jayne_Cobb), \nan infamous Serenity \"public relation guy\" from \n[Firefly](http://en.wikipedia.org/wiki/Firefly_(TV_series)) series.\n\nSome ideas (though, not the code) was taken from \n[sinew](https://github.com/gurgeous/sinew).\n\n## What the funny names?\n\nYou see, when we work with web scrapers domain, we'll never come with\nvery meaningful names, it always be like `Parser.parse`, \n`Evaluator.evaluate` or something like this.\n\nSo, I've decided to use names which are, though not domain specific,\nat least fun and consistent. So, in my gem, **Cobb** use his **guns**\nto **fire**, and makes lot of **victims**. \n\nAnd there comes **Vera**.\n\nDeal with it. Maybe I'll change my mind in future versions.\n\n## 1. Just Gun \n\n**aka \"OK, let's start with something\"**\n\nThe simplest usage of Cobb (also look at [`samples/01_simple.rb`](samples/01_simple.rb)):\n\n```ruby\nclass AmazonBook \u003c Cobb::Gun\n  def mechanizm\n    result(\n      title: html.at!('h1#title #productTitle').text,\n      author: html.at!('#byline .author a.contributorNameID').text,\n      price: html.at!('#MediaMatrix .swatchElement.selected .a-color-price').text_\n    )\n  end\nend\n\nvictim = AmazonBook.fire 'http://www.amazon.com/Cats-Cradle-Novel-Kurt-Vonnegut/dp/038533348X/'\n\npp victim.result \n# =\u003e {\"title\"=\u003e\"Cat's Cradle: A Novel\", \"author\"=\u003e\"Kurt Vonnegut\", \"price\"=\u003e\"$8.75\"}\n```\n\nWhat we see here?\n\n1. How to **define** gun: \n  - just inherit from `Cobb::Gun` \n  - and define `mechanizm` method, and everything would work\n2. what you get **inside** `mechanizm`:\n  - `html` is `Nokogiri::HTML` of the page \n    (with some `Nokogiri::More`, see below)\n  - `result` is method to merge some values to result\n3. How to **use** gun: \n  - just do `{Gun}.fire({url})`\n4. What you **obtain** from gun: \n  - victim, obviously, and its `result` \n  - it's what you `result`ed in gun. It's a \"mash\", descended from\n  `Hashie::Mash`, so you can `result['title']` or `result.title` now.\n\nWhat you can't see, yet it's still here: **requests caching**. \n\nIt's just so-called \"greedy\" caching: once performed, request to some URL is never\nrepeated again. You can control it just by removing `tmp/cache`, \nand (in future) by settings and commands. But for now it seems good enough:\nyou just develop some \"gun\" (parser), and run it as many times as you like,\nand results are just got from disk.\n\nIt's pretty obivous, yet useful (on my thought), but it's just a start.\n\n## 2. Array of results\n\nBut what if we have several data items on page? \nOh, that's easy (also look at [`samples/02_array.rb`](samples/02_array.rb)):\n\n```ruby\nclass AmazonAuthorBooks \u003c Cobb::Gun\n  def mechanizm\n    author = html.at!('#EntityName').text\n    html.css('#mainResults .result').each do |row|\n      result_row do\n        result(\n          title: row.at!('h3.title a').text,\n          link: row.at!('h3.title a').href,\n          author: author\n        )\n      end\n    end\n  end\nend\n\n# or:\nclass AmazonAuthorBooks \u003c Cobb::Gun\n  def mechanizm\n    author = html.at!('#EntityName').text\n    html.css('#mainResults .result').each do |row|\n      result_row(\n        title: row.at!('h3.title a').text,\n        link: row.at!('h3.title a').href,\n        author: author\n      )\n    end\n  end\nend\n\nvictim = AmazonAuthorBooks.fire 'http://www.amazon.com/Kurt-Vonnegut/e/B000APYE16/'\npp victim.results\n# =\u003e [{\"title\"=\u003e\"Slaughterhouse-Five\",\n#      \"link\"=\u003e\"http://www.amazon.com/Slaughterhouse-Five-Kurt-Vonnegut/dp/0440180295\",\n#      \"author\"=\u003e\"Kurt Vonnegut\"},\n#     {\"title\"=\u003e\"If This Isn't Nice, What Is?: Advice to the Young-The Graduation Speeches\",\n#      \"link\"=\u003e\"http://www.amazon.com/This-Isnt-Nice-What-Graduation/dp/1609805917\",\n#      \"author\"=\u003e\"Kurt Vonnegut\"},\n#     ... and so on ...\n```\n\nAs simple as that. \n\nYou call `result_row{some code}` or even \n`result_row(some_hash)` and you have `victim.results`. \n\nOne item - `victim.result`, many items - `victim.results`. \nNot too smart, not too dumb, obvious enough.\n\n## 3. Next targets \n\n**aka \"Interesting things goes from here!\"**\n\nOn my experience, typical real-world site scraping is like \"scrape list \nof items from this page, than follow links and scrape their description\", \nand so on. \n\nWith Cobb, you do it totally like this (also look at [`samples/03_next.rb`](samples/03_next.rb)): \n\n```ruby\nmodule Amazon\n  class AuthorBooks \u003c Cobb::Gun\n    def mechanizm\n      html.css('#mainResults .result').each do |row|\n        next_to Book, row.at!('h3.title a').href\n      end\n    end\n  end\n\n  class Book \u003c Cobb::Gun\n    def mechanizm\n      result(\n        title: html.at!('h1#title #productTitle').text,\n        author: html.at!('#byline .author a.contributorNameID').text,\n        price: html.at!('#MediaMatrix .swatchElement.selected .a-color-price').text_\n      )\n    end\n  end\nend\n\nvictim = Amazon::AuthorBooks.fire 'http://www.amazon.com/Kurt-Vonnegut/e/B000APYE16/'\npp victim.next_targets\n# =\u003e [#\u003cCobb::Target: Amazon::Book.fire(http://www.amazon.com/Slaughterhouse-Five-Kurt-Vonnegut/dp/0440180295)\u003e,\n#     #\u003cCobb::Target: Amazon::Book.fire(http://www.amazon.com/This-Isnt-Nice-What-Graduation/dp/1609805917)\u003e,\n#     ... several of them ...\n#     #\u003cCobb::Target: Amazon::Book.fire(http://www.amazon.com/Suckers-Portfolio-Collection-Previously-Unpublished/dp/1611099587)\u003e]\n\nvictim2 = victim.next_targets.first.fire_at!\npp victim2.result\n# =\u003e {\"title\"=\u003e\"Slaughterhouse-Five\", \"author\"=\u003e\"Kurt Vonnegut\", \"price\"=\u003e\"$4.83\"}\n```\n\nHighlights:\n\n1. `next_to({Gun}, {url})` -- tells victim the next target and what gun\n  to fire at it\n2. So, the victim has method `next_targets`, which return targets \n  (instances of `Cobb::Target` class)\n3. Target knows what gun should fire and to which URL\n4. Target has method `Target#fire_at!` to be fired with specified gun\n  (cynical enough, no? now you're in love with my namings?..)\n\nBecames cooler, nah? It's just a beginning.\n\n## 4. Context\n\nWhen you fire at something, you can provide a context, and your gun has\naccess to it:\n\n```ruby\nmodule Amazon\n  class Book \u003c Cobb::Gun\n    def mechanizm\n      result(\n        title: html.at!('h1#title #productTitle').text,\n        author: html.at!('#byline .author a.contributorNameID').text,\n        price: html.at!('#MediaMatrix .swatchElement.selected .a-color-price').text_,\n        author_bio: context.author_bio\n      )\n    end\n  end\nend\n\nvictim = Amazon::Book.fire 'http://www.amazon.com/Cats-Cradle-Novel-Kurt-Vonnegut/dp/038533348X/', \n  author_bio: 'Sample author bio.'\npp victim.data \n# =\u003e {\"title\"=\u003e\"Cat's Cradle: A Novel\",\n#     \"author\"=\u003e\"Kurt Vonnegut\",\n#     \"price\"=\u003e\"$8.75\",\n#     \"author_bio\"=\u003e\"Sample author bio.\"}\n```\n\nIt don't looks too cool, until you mix it with `next_to`:\n\n```ruby\nmodule Amazon\n  class AuthorBooks \u003c Cobb::Gun\n    def mechanizm\n      bio = html.at!('#artistCentralBio_officialFullBioContent').text\n      html.css('#mainResults .result').each do |row|\n        next_to Book, row.at!('h3.title a').href, \n          author_bio: bio # here goes the context!\n      end\n    end\n  end\nend\n\nvictim = Amazon::AuthorBooks.fire(url)\npp victim.next_targets.first\n# =\u003e #\u003cCobb::Target: Amazon::Book.fire(http://www.amazon.com/Slaughterhouse-Five-Kurt-Vonnegut/dp/0440180295) \n#      with #\u003cCobb::Mash \n#              author_bio=\"Kurt Vonnegut was born in Indianapolis in 1922. He studied at the universities of Chicago and Tennessee and later began to write short stories for magazines. His first novel, Player Piano, was published in 1951 and since then he has written many novels, among them: The Sirens of Titan (1959), Mother Night (1961), Cat's Cradle (1963), God Bless You Mr Rosewater (1964), Welcome to the Monkey House; a collection of short stories (1968), Breakfast of Champions (1973), Slapstick, or Lonesome No More (1976), Jailbird (1979), Deadeye Dick (1982), Galapagos (1985), Bluebeard (1988) and Hocus Pocus (1990). During the Second World War he was held prisoner in Germany and was present at the bombing of Dresden, an experience which provided the setting for his most famous work to date, Slaughterhouse Five (1969). He has also published a volume of autobiography entitled Palm Sunday (1981) and a collection of essays and speeches, Fates Worse Than Death (1991).\"\n#            \u003e\n#     \u003e\n\nvictim2 = victim.next_targets.first.fire_at!\n\npp victim2.result \n# =\u003e {\"title\"=\u003e\"Slaughterhouse-Five\",\n#     \"author\"=\u003e\"Kurt Vonnegut\",\n#     \"price\"=\u003e\"$4.83\",\n#     \"author_bio\"=\u003e\"Kurt Vonnegut was born in Indianapolis in 1922. He studied at the universities of Chicago and Tennessee and later began to write short stories for magazines. His first novel, Player Piano, was published in 1951 and since then he has written many novels, among them: The Sirens of Titan (1959), Mother Night (1961), Cat's Cradle (1963), God Bless You Mr Rosewater (1964), Welcome to the Monkey House; a collection of short stories (1968), Breakfast of Champions (1973), Slapstick, or Lonesome No More (1976), Jailbird (1979), Deadeye Dick (1982), Galapagos (1985), Bluebeard (1988) and Hocus Pocus (1990). During the Second World War he was held prisoner in Germany and was present at the bombing of Dresden, an experience which provided the setting for his most famous work to date, Slaughterhouse Five (1969). He has also published a volume of autobiography entitled Palm Sunday (1981) and a collection of essays and speeches, Fates Worse Than Death (1991).\"\n#    }\n```\n\nLook at [`samples/04_context.rb`](samples/04_context.rb) for complete sample.\n\n## 5. Auto-fire\n\n**aka \"Don't make me think of targets!\"**\n\nCobb allows you to define gun target, while describing the gun.\nLike this:\n\n```ruby\nclass Vonneguth \u003c Cobb::Gun\n  target 'http://www.amazon.com/Kurt-Vonnegut/e/B000APYE16/'\n  \n  def mechanizm\n    bio = html.at!('#artistCentralBio_officialFullBioContent').text\n    \n    html.css('#mainResults .result').each do |row|\n      next_to Book, row.at!('h3.title a').href\n    end\n  end\nend\n```\n\nNow, if you really wanted only Vonnegut's books (and why you shouldn't?),\nyou can just:\n\n```ruby\nvictims = Amazon::Vonneguth.auto_fire # without any URL\npp victims.map(\u0026:next_targets).flatten\n# =\u003e [#\u003cCobb::Target: Amazon::Book.fire(http://www.amazon.com/Slaughterhouse-Five-Kurt-Vonnegut/dp/0440180295) with #\u003cCobb::Mash author_bio=\"Kurt Vonnegut was born in Indianapolis in 1922. He studied at the universities of Chicago and Tennessee and later began to write short stories for magazines. His first novel, Player Piano, was published in 1951 and since then he has written many novels, among them: The Sirens of Titan (1959), Mother Night (1961), Cat's Cradle (1963), God Bless You Mr Rosewater (1964), Welcome to the Monkey House; a collection of short stories (1968), Breakfast of Champions (1973), Slapstick, or Lonesome No More (1976), Jailbird (1979), Deadeye Dick (1982), Galapagos (1985), Bluebeard (1988) and Hocus Pocus (1990). During the Second World War he was held prisoner in Germany and was present at the bombing of Dresden, an experience which provided the setting for his most famous work to date, Slaughterhouse Five (1969). He has also published a volume of autobiography entitled Palm Sunday (1981) and a collection of essays and speeches, Fates Worse Than Death (1991).\"\u003e\u003e,\n#     #\u003cCobb::Target: Amazon::Book.fire(http://www.amazon.com/This-Isnt-Nice-What-Graduation/dp/1609805917) with #\u003cCobb::Mash author_bio=\"Kurt Vonnegut was born in Indianapolis in 1922. He studied at the universities of Chicago and Tennessee and later began to write short stories for magazines. His first novel, Player Piano, was published in 1951 and since then he has written many novels, among them: The Sirens of Titan (1959), Mother Night (1961), Cat's Cradle (1963), God Bless You Mr Rosewater (1964), Welcome to the Monkey House; a collection of short stories (1968), Breakfast of Champions (1973), Slapstick, or Lonesome No More (1976), Jailbird (1979), Deadeye Dick (1982), Galapagos (1985), Bluebeard (1988) and Hocus Pocus (1990). During the Second World War he was held prisoner in Germany and was present at the bombing of Dresden, an experience which provided the setting for his most famous work to date, Slaughterhouse Five (1969). He has also published a volume of autobiography entitled Palm Sunday (1981) and a collection of essays and speeches, Fates Worse Than Death (1991).\"\u003e\u003e,\n#     ... and so on ...\n```\n\nThe real miracles became later, when you want book:\n\n```ruby\nclass Book \u003c Cobb::Gun\n  target Vonnegut # see?\n  \n  def mechanizm\n    # as per previous examples\n  end\nend\n\n# Aaaaand, without any preparations:\nvictims = Amazon::Book.auto_fire\n```\n\nIn latter case, `auto_fire` would first run `Vonnegut.auto_fire`,\nwhich extracts from its designated URL. Then we'll process all URLs which \nVonnegut put in its `next_targets`.\n\nSee [`samples/05_auto_fire.rb`](samples/05_auto_fire.rb) for full example.\n\n## 6. Vera: the cutest gun ever\n\n\u003e Six men came to kill me one time. And the best of 'em carried this. \n\u003e It's a Callahan full-bore auto-lock. Customized trigger, double \n\u003e cartridge thorough gauge. It is my very favorite gun … This is the best \n\u003e gun made by man. It has extreme sentimental value … I call her Vera\u003cbr/\u003e\n\u003e -- Jayne Cobb\n\nThe `auto_fire` is almost enough for little simple demonstration, but in\nreal, actual, factual world you need more. And here comes \n[Vera](http://firefly.wikia.com/wiki/Vera).\n\nYou just do something like:\n\n```ruby\nCobb::Vera.new(Amazon::Book).birst!(progress: true)\n```\n\nAnd Vera does the things. \n\nShe shows progress bars. She processeses cyclic\ndependencies and checks every gun/url shot only once. \nShe forces all gun in gun chain to be shot. She is beautiful. \n\nShe is all Cobb loves.\n\nAssume this:\n\n```ruby\nclass Vonneguth \u003c Cobb::Gun\n  target 'http://www.amazon.com/Kurt-Vonnegut/e/B000APYE16/'\n  \n  def mechanizm\n    bio = html.at!('#artistCentralBio_officialFullBioContent').text\n    \n    html.css('#mainResults .result').each do |row|\n      next_to Book, row.at!('h3.title a').href\n    end\n\n    html.at?('#pagn #pagnNextLink').tap{|a|\n      next_to Vonnegut, a.href\n    }\n  end\nend\n\nclass Book \u003c Cobb::Gun\n  # as shown above\nend\n\n# and then, just:\nvictims = Cobb::Vera.new(Amazon::Book).birst!(progress: true)\npp victims2.map(\u0026:result)\n```\n\n(Full sample at [`samples/06_vera.rb`](samples/06_vera.rb) is almost \nlike this, but considering some Amazon's gotchas.)\n\nThis example will:\n\n* Grab ALL of Vonnegut book listing pages (see at `next_to Vonnegut`); \n* Then extract books from them all;\n* It will be all books from all the pages _(while `Book.auto_fire` will\n  get you books from as least targets as it can)_.\n\nTry it, I'm serious. Just give it a try.\n\n## 7. What else Jayne Cobb does for me, if I pay enough?\n\n### Nokogiri::More\n\nNokogiri::More is, for now, a part of Cobb, though it will be separated\ninto different gem in nearest future. It's some extensions and monkey-patches\nto Nokogiri, which makes it easier for complex production-ready parsers.\n\n#### methods with ! and ?\n\nAssume we are looking for `a.title` inside `div#intro`, and there's no\nsuch a link.\n\n```ruby\ndiv = html.at('div#intro')\n\n# Nokogiri original method, throws not very informative\n# NoMethodError: undefined method `[]' for nil:NilClass\ndiv.at('a.title')['href']  \n\n# Bang method: node SHOULD be here\n# Throws pretty NodeNotFound(\"\u003cdiv\u003e has no node at 'a.title'\")\ndiv.at!('a.title')['href'] \n\n# Question method: node may not be there, and it should not be considered\n# as an error.\n# Returns blackhole NullObject, which eats all further messages silently\ndiv.at?('a.title')['href'] \n```\n\n### Train with your gun\n\n### Settings\n\n### Some useful shortcuts\n\n* `repeat url [, context]` is just like `next_to` with same class (useful\n  for paging and other such things);\n  * so in example 6 for Vera above we could just `repeat a.href` instead of\n    `next_to Vonnegut, a.href`\n* Cobb is not only for HTML pages parsing! \n  * `raw` is already here for every gun, containing raw request body\n  * `json` is calculated when you first call it and is equivalent to `JSON.parse(raw)`\n  * Also, seems useful to provide `xml` alongside with `html` method, \n    though I haven't done it yet\n\n## Current state of a gem\n\nAt the one hand, it's real 0.0.1: no specs, no docs in code, and some\nsolutions are questionable.\n\nAt the other hand, it's exctraction of real-world large production\nproject, and many different sites have seen my Vera.\n\nAt the third hand, while being extracted, it overcomed some rewritings \nand renamings, so, may be broken at a moment.\n\nAnd the fourth hand, all examples are working.\n\nTrust the hand you like more, look at the code, write your opinions at\nrequests to zverok.offline@gmail.com.\n\n## TODO\n\n* Specs!!!\n* Instead of dumb custom WebClient, just use Faraday caching middleware\n  and make it customizable\n* Separate Nokogiri::More to another gem\n* More settings\n* More logs\n\n## Dependencies\n\n* naught\n* faraday\n* typhoeus\n* hashie\n* addressable\n\nSoft dependencies (Cobb will not install thouse himself):\n\n* nokogiri\n* json\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzverok%2Fcobb","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzverok%2Fcobb","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzverok%2Fcobb/lists"}