Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
https://github.com/postmodern/spidr
crawler ruby scraper spider spider-links web web-crawler web-scraper web-scraping web-spider
Last synced: 2 days ago
JSON representation
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
- Host: GitHub
- URL: https://github.com/postmodern/spidr
- Owner: postmodern
- License: mit
- Created: 2009-03-08T10:58:50.000Z (almost 16 years ago)
- Default Branch: master
- Last Pushed: 2024-12-18T23:00:18.000Z (23 days ago)
- Last Synced: 2025-01-02T12:01:54.180Z (9 days ago)
- Topics: crawler, ruby, scraper, spider, spider-links, web, web-crawler, web-scraper, web-scraping, web-spider
- Language: Ruby
- Homepage:
- Size: 678 KB
- Stars: 810
- Watchers: 28
- Forks: 107
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Spidr
[![CI](https://github.com/postmodern/spidr/actions/workflows/ruby.yml/badge.svg)](https://github.com/postmodern/spidr/actions/workflows/ruby.yml)
* [Homepage](https://github.com/postmodern/spidr#readme)
* [Source](https://github.com/postmodern/spidr)
* [Issues](https://github.com/postmodern/spidr/issues)
* [Mailing List](http://groups.google.com/group/spidr)## Description
Spidr is a versatile Ruby web spidering library that can spider a site,
multiple domains, certain links or infinitely. Spidr is designed to be fast
and easy to use.## Features
* Follows:
* `a` tags.
* `iframe` tags.
* `frame` tags.
* Cookie protected links.
* HTTP 300, 301, 302, 303 and 307 Redirects.
* Meta-Refresh Redirects.
* HTTP Basic Auth protected links.
* Black-list or white-list URLs based upon:
* URL scheme.
* Host name
* Port number
* Full link
* URL extension
* Optional `/robots.txt` support.
* Provides callbacks for:
* Every visited Page.
* Every visited URL.
* Every visited URL that matches a specified pattern.
* Every origin and destination URI of a link.
* Every URL that failed to be visited.
* Provides action methods to:
* Pause spidering.
* Skip processing of pages.
* Skip processing of links.
* Restore the spidering queue and history from a previous session.
* Custom User-Agent strings.
* Custom proxy settings.
* HTTPS support.## Examples
Start spidering from a URL:
```ruby
Spidr.start_at('http://tenderlovemaking.com/') do |agent|
# ...
end
```Spider a host:
```ruby
Spidr.host('solnic.eu') do |agent|
# ...
end
```Spider a domain (and any sub-domains):
```ruby
Spidr.domain('ruby-lang.org') do |agent|
# ...
end
```Spider a site:
```ruby
Spidr.site('http://www.rubyflow.com/') do |agent|
# ...
end
```Spider multiple hosts:
```ruby
Spidr.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|
# ...
end
```Do not spider certain links:
```ruby
Spidr.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|
# ...
end
```Do not spider links on certain ports:
```ruby
Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|
# ...
end
```Do not spider links blacklisted in robots.txt:
```ruby
Spidr.site('http://company.com/', robots: true) do |agent|
# ...
end
```Print out visited URLs:
```ruby
Spidr.site('http://www.rubyinside.com/') do |spider|
spider.every_url { |url| puts url }
end
```Build a URL map of a site:
```ruby
url_map = Hash.new { |hash,key| hash[key] = [] }Spidr.site('http://intranet.com/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
```Print out the URLs that could not be requested:
```ruby
Spidr.site('http://company.com/') do |spider|
spider.every_failed_url { |url| puts url }
end
```Finds all pages which have broken links:
```ruby
url_map = Hash.new { |hash,key| hash[key] = [] }spider = Spidr.site('http://intranet.com/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
endspider.failures.each do |url|
puts "Broken link #{url} found in:"url_map[url].each { |page| puts " #{page}" }
end
```Search HTML and XML pages:
```ruby
Spidr.site('http://company.com/') do |spider|
spider.every_page do |page|
puts ">>> #{page.url}"page.search('//meta').each do |meta|
name = (meta.attributes['name'] || meta.attributes['http-equiv'])
value = meta.attributes['content']puts " #{name} = #{value}"
end
end
end
```Print out the titles from every page:
```ruby
Spidr.site('https://www.ruby-lang.org/') do |spider|
spider.every_html_page do |page|
puts page.title
end
end
```Print out every HTTP redirect:
```ruby
Spidr.host('company.com') do |spider|
spider.every_redirect_page do |page|
puts "#{page.url} -> #{page.headers['Location']}"
end
end
```Find what kinds of web servers a host is using, by accessing the headers:
```ruby
servers = Set[]Spidr.host('company.com') do |spider|
spider.all_headers do |headers|
servers << headers['server']
end
end
```Pause the spider on a forbidden page:
```ruby
Spidr.host('company.com') do |spider|
spider.every_forbidden_page do |page|
spider.pause!
end
end
```Skip the processing of a page:
```ruby
Spidr.host('company.com') do |spider|
spider.every_missing_page do |page|
spider.skip_page!
end
end
```Skip the processing of links:
```ruby
Spidr.host('company.com') do |spider|
spider.every_url do |url|
if url.path.split('/').find { |dir| dir.to_i > 1000 }
spider.skip_link!
end
end
end
```## Requirements
* [ruby] >= 2.0.0
* [nokogiri] ~> 1.3## Install
```shell
$ gem install spidr
```## License
See {file:LICENSE.txt} for license information.
[ruby]: https://www.ruby-lang.org/
[nokogiri]: http://www.nokogiri.org/