Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/bkeepers/spiderman
your friendly neighborhood web crawler
https://github.com/bkeepers/spiderman
crawler crawler-engine http httprb nokogiri ruby spider spider-framework web-crawler web-scraping webcrawler webscraping
Last synced: 21 days ago
JSON representation
your friendly neighborhood web crawler
- Host: GitHub
- URL: https://github.com/bkeepers/spiderman
- Owner: bkeepers
- License: mit
- Created: 2020-03-22T11:42:18.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2022-07-26T13:15:48.000Z (over 2 years ago)
- Last Synced: 2024-11-30T01:23:13.449Z (28 days ago)
- Topics: crawler, crawler-engine, http, httprb, nokogiri, ruby, spider, spider-framework, web-crawler, web-scraping, webcrawler, webscraping
- Language: Ruby
- Homepage:
- Size: 39.1 KB
- Stars: 18
- Watchers: 5
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
your friendly neighborhood web crawler
Spiderman is a Ruby gem for crawling and processing web pages.
## Installation
Add this line to your application's Gemfile:
```ruby
gem 'spiderman'
```And then execute:
$ bundle install
Or install it yourself as:
$ gem install spiderman
## Usage
```ruby
class HackerNewsCrawler
include Spidermancrawl "https://news.ycombinator.com/" do |response|
response.css('a.storylink').each do |a|
process! a["href"], :story
end
endprocess :story do |response|
logging.info "#{response.uri} #{response.css('title').text}"
save_page(response)
enddef save_page(page)
# logic here for saving the page
end
end
```Run the crawler:
```ruby
HackerNewsCrawler.crawl!
```### ActiveJob
Spiderman works with [ActiveJob](https://edgeguides.rubyonrails.org/active_job_basics.html) out of the box. If your crawler class inherits from `ActiveJob:Base`, then requests will be made in your background worker. Each request will run as a separate job.
```ruby
class MyCrawer < ActiveJob::Base
queue_as :crawlercrawl "https://example.com" do |response|
response.css('a').each {|a| process! a["href"], :link }
endprocess :link do |response|
logger.info "Processing #{response.uri}"
end
end
```## Development
After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and tags, and push the `.gem` file to [rubygems.org](https://rubygems.org).
## Contributing
Bug reports and pull requests are welcome on GitHub at https://github.com/bkeepers/spiderman.
## License
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).