Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/gjtorikian/vore
Vore gobbles up webpages and spits out their content.
https://github.com/gjtorikian/vore
Last synced: about 1 month ago
JSON representation
Vore gobbles up webpages and spits out their content.
- Host: GitHub
- URL: https://github.com/gjtorikian/vore
- Owner: gjtorikian
- License: mit
- Created: 2024-07-17T13:30:39.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-09-04T21:17:32.000Z (2 months ago)
- Last Synced: 2024-09-29T00:41:14.310Z (about 2 months ago)
- Language: Ruby
- Homepage:
- Size: 213 KB
- Stars: 2
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Funding: .github/funding.yml
- License: LICENSE.txt
Awesome Lists containing this project
README
# Vore
![Vore, by LewdBacon](https://github.com/user-attachments/assets/0923cc84-4cca-4d95-8a0e-4dad650525d2)
Vore quickly crawls websites and spits out text sans tags. It's written in Ruby and powered by Rust.
## Installation
Install the gem and add to the application's Gemfile by executing:
$ bundle add vore
If bundler is not being used to manage dependencies, install the gem by executing:
$ gem install vore
## Usage
```ruby
crawler = Vore::Crawler.new
crawler.scrape_each_page("https://choosealicense.com") do |page|
puts page
end
```Each `page` is a simple class consisting of the following values:
* `content`: the text of the HTML document, sans tags
* `title`: the title of the HTML document (if any)
* `meta`: the document's meta tags (if any)
* `path`: the document's pathThe scraping is managed by [`spider-rs`](https://github.com/spider-rs/spider), so you know it's fast.
### Configuration
| Name | Description | Default |
| ----------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | ------- |
| `delay` | A value (in milliseconds) which introduces an artifical delay when crawling. Useful for situations where there's rate limiting involved. | `0` |
| `output_dir` | Where the resulting HTML files are stored. | `"tmp/vore"` |
| `delete_after_yield` | Whether the downloaded HTML files are deleted after the yield block finishes. | `true` |
| `log_level` | The logging level. | `:warn` |### Processing pages
Vore processes HTML using handlers. By default, there are two:
* The `MetaExtractor`, which extracts information from your `title` and `meta` tags
* The `TagRemover`, which removes unnecessary elements like `header`, `footer`, `script`If you wish to process the HTML further, you can provide your own handler:
```ruby
Vore::Crawler.new(handlers: [MySpecialHandler.new])
```Handlers are defined using [Selma](https://github.com/gjtorikian/selma?tab=readme-ov-file#defining-handlers). Note that the `MetaExtractor` is always included and defined first, but if you pass in anything to the `handler` array, it'll overwrite Vore's other default handlers. You can of course choose to include them manually:
```ruby
# preserve Vore's default content handler while adding your own;
# `MetaExtractor` is prefixed to the front
Vore::Crawler.new(handlers: [Vore::Handlers::TagRemover.new, MySpecialHandler.new])
```### In tests
Since the actual HTTP calls occur in a separate process, Vore will not integrate with libraries like VCR or Webmock by default. You'll need to `require "vore/minitest_helper"` to get a function that emulates the HTTP `GET` requests in a way Ruby can interpret.
Based on your needs, you can overwrite any of the existing methods to suit your application's needs. For example, if you prefer HTML to be generated by Faker, you can create and require a file that looks like the following:
```ruby
require "vore/minitest_helper"
module Vore
module TestHelperExtension
DOCUMENT_TITLES = [
"Hello, I need help",
"I need to update my payment information",
]
DOCUMENT_CONTENT = [
"Hey, I'm having trouble with my computer. Can you help me?",
# v--- always creates three page chunks
"I need to update my payment information. Like, now. Right now. Now. Can you help me? Please? Now?" + "Can you help me? Please? Now?" * 100,
]def content
@counter = -1 unless defined?(@counter)
@counter += 1html = "#{DOCUMENT_TITLES[@counter]}"
meta_tag_count.times do # arbitrarily set to 5
html += ""
endhtml += ""
html += "
#{DOCUMENT_CONTENT[@counter]}
"html += ""
html
end
endVore::TestHelper.prepend(Vore::TestHelperExtension)
end
```## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake test` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/gjtorikian/vore.
## License
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).