Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rubycdp/vessel
Fast high-level web crawling Ruby framework
https://github.com/rubycdp/vessel
Last synced: 7 days ago
JSON representation
Fast high-level web crawling Ruby framework
- Host: GitHub
- URL: https://github.com/rubycdp/vessel
- Owner: rubycdp
- License: mit
- Created: 2019-09-17T07:19:57.000Z (about 5 years ago)
- Default Branch: main
- Last Pushed: 2023-12-31T14:03:50.000Z (11 months ago)
- Last Synced: 2024-04-23T21:36:52.763Z (7 months ago)
- Language: Ruby
- Homepage: https://vessel.rubycdp.com
- Size: 89.8 KB
- Stars: 603
- Watchers: 12
- Forks: 11
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Vessel - high-level web crawling framework
#### Fast as Chrome, dead simple and yet extendable.
It is Ruby high-level web crawling framework based on
[Ferrum](https://github.com/rubycdp/ferrum) for extracting the data you need
from websites. It can be used in a wide range of scenarios, like data mining,
monitoring or historical archival. For automated testing we recommend
[Cuprite](https://github.com/rubycdp/cuprite).## Install
Add this to your Gemfile:
```ruby
gem "vessel"
```## A look around
In order to show you how Vessel works we are going to crawl together
[famous quotes website](http://quotes.toscrape.com):```ruby
require "json"
require "vessel"class QuotesToScrapeCom < Vessel::Cargo
domain "quotes.toscrape.com"
start_urls "https://quotes.toscrape.com/tag/humor/"def parse
css("div.quote").each do |quote|
yield({
author: quote.at_xpath("span/small").text,
text: quote.at_css("span.text").text
})
endif next_page = at_xpath("//li[@class='next']/a[@href]")
url = absolute_url(next_page.attribute(:href))
yield request(url: url, handler: :parse)
end
end
endquotes = []
QuotesToScrapeCom.run { |q| quotes << q }
puts JSON.generate(quotes)
```Save this to `quotes.rb` file and run `bundle exec ruby quotes.rb > quotes.json`.
When this finishes you will have a list of the quotes in JSON format in the
`quotes.json` file.How it all works? First Vessel using Ferrum spawns Chrome which goes to one or
more urls in `start_urls`, in our case it's only one. After Chrome reports back
that page is loaded with all the resources it needs the first default handler
`parse` is invoked. In the parse handler, we loop through the quote elements
using a CSS Selector, yield a Hash with the extracted quote text and author and
look for a link to the next page and schedule another request using the same
parse method as a handler.Notice that all requests are scheduled and handled concurrently. We use thread
pool to work with all your requests with one page per core by default or add
`threads max: n` to a class. If you yield more than one request Ruby will send
them to Chrome which will load pages in parallel. Thus crawler is lightweight
and speedy.## Settings
* domain
* start_urls
* driver
* delay
* [headers](https://github.com/rubycdp/vessel#headers)
* cookies
* threads
* middleware
* proxy
* blacklist
* whitelist### Headers
```ruby
class MyScraper < Vessel::Cargo
headers "Content-Type" => "text/plain",
"Referer" => "http://example.com"
end
```### Headful mode
You can disable headless mode by passing `driver_options` settings:
```ruby
MyScraper.run(driver_options: { headless: false })
```## Selectors
* at_css
* css
* at_xpath
* xpath## Middleware
To be continued
## License
The gem is available as open source under the terms of the
[MIT License](https://opensource.org/licenses/MIT).