{"id":13879533,"url":"https://github.com/rubycdp/vessel","last_synced_at":"2025-04-04T07:07:38.602Z","repository":{"id":44797451,"uuid":"208985908","full_name":"rubycdp/vessel","owner":"rubycdp","description":"Fast high-level web crawling Ruby framework","archived":false,"fork":false,"pushed_at":"2024-07-03T14:56:40.000Z","size":94,"stargazers_count":660,"open_issues_count":3,"forks_count":12,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-04-03T11:12:28.788Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://vessel.rubycdp.com","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rubycdp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"rubycdp"}},"created_at":"2019-09-17T07:19:57.000Z","updated_at":"2025-03-27T08:49:27.000Z","dependencies_parsed_at":"2023-01-23T22:01:50.525Z","dependency_job_id":"f3437133-d833-496b-9b29-517a4bec2185","html_url":"https://github.com/rubycdp/vessel","commit_stats":{"total_commits":68,"total_committers":8,"mean_commits":8.5,"dds":0.3529411764705882,"last_synced_commit":"1f94e920000dde79cd4449753c76f6fba6149116"},"previous_names":["route/vessel"],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubycdp%2Fvessel","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubycdp%2Fvessel/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubycdp%2Fvessel/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rubycdp%2Fvessel/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rubycdp","download_url":"https://codeload.github.com/rubycdp/vessel/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247135144,"owners_count":20889421,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-06T08:02:24.173Z","updated_at":"2025-04-04T07:07:38.585Z","avatar_url":"https://github.com/rubycdp.png","language":"Ruby","funding_links":["https://github.com/sponsors/rubycdp"],"categories":["Ruby"],"sub_categories":[],"readme":"# Vessel - high-level web crawling framework\n\n#### Fast as Chrome, dead simple and yet extendable.\n\nIt is Ruby high-level web crawling framework based on\n[Ferrum](https://github.com/rubycdp/ferrum) for extracting the data you need\nfrom websites. It can be used in a wide range of scenarios, like data mining,\nmonitoring or historical archival. For automated testing we recommend\n[Cuprite](https://github.com/rubycdp/cuprite).\n\n\n## Install\n\nAdd this to your Gemfile:\n\n```ruby\ngem \"vessel\"\n```\n\n\n## A look around\n\nIn order to show you how Vessel works we are going to crawl together\n[famous quotes website](http://quotes.toscrape.com):\n\n```ruby\nrequire \"json\"\nrequire \"vessel\"\n\nclass QuotesToScrapeCom \u003c Vessel::Cargo\n  domain \"quotes.toscrape.com\"\n  start_urls \"https://quotes.toscrape.com/tag/humor/\"\n\n  def parse\n    css(\"div.quote\").each do |quote|\n      yield({\n        author: quote.at_xpath(\"span/small\").text,\n        text: quote.at_css(\"span.text\").text\n      })\n    end\n\n    if next_page = at_xpath(\"//li[@class='next']/a[@href]\")\n      url = absolute_url(next_page.attribute(:href))\n      yield request(url: url, handler: :parse)\n    end\n  end\nend\n\nquotes = []\nQuotesToScrapeCom.run { |q| quotes \u003c\u003c q }\nputs JSON.generate(quotes)\n```\n\nSave this to `quotes.rb` file and run `bundle exec ruby quotes.rb \u003e quotes.json`.\nWhen this finishes you will have a list of the quotes in JSON format in the\n`quotes.json` file.\n\nHow it all works? First Vessel using Ferrum spawns Chrome which goes to one or\nmore urls in `start_urls`, in our case it's only one. After Chrome reports back\nthat page is loaded with all the resources it needs the first default handler\n`parse` is invoked. In the parse handler, we loop through the quote elements\nusing a CSS Selector, yield a Hash with the extracted quote text and author and\nlook for a link to the next page and schedule another request using the same\nparse method as a handler.\n\nNotice that all requests are scheduled and handled concurrently. We use thread\npool to work with all your requests with one page per core by default or add\n`threads max: n` to a class. If you yield more than one request Ruby will send\nthem to Chrome which will load pages in parallel. Thus crawler is lightweight\nand speedy.\n\n\n## Settings\n\n* domain\n* start_urls\n* driver\n* delay\n* [headers](https://github.com/rubycdp/vessel#headers)\n* cookies\n* threads\n* middleware\n* proxy\n* blacklist\n* whitelist\n\n### Headers\n\n```ruby\nclass MyScraper \u003c Vessel::Cargo\n  headers \"Content-Type\" =\u003e \"text/plain\",\n          \"Referer\" =\u003e \"http://example.com\"\nend\n```\n\n### Headful mode\n\nYou can disable headless mode by passing `driver_options` settings:\n\n```ruby\nMyScraper.run(driver_options: { headless: false })\n```\n\n## Selectors\n\n* at_css\n* css\n* at_xpath\n* xpath\n\n\n## Middleware\n\nTo be continued\n\n\n## License\n\nThe gem is available as open source under the terms of the\n[MIT License](https://opensource.org/licenses/MIT).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frubycdp%2Fvessel","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frubycdp%2Fvessel","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frubycdp%2Fvessel/lists"}