https://github.com/postmodern/spidr

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
https://github.com/postmodern/spidr

crawler ruby scraper spider spider-links web web-crawler web-scraper web-scraping web-spider

Last synced: 7 months ago
JSON representation

A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.

Host: GitHub
URL: https://github.com/postmodern/spidr
Owner: postmodern
License: mit
Created: 2009-03-08T10:58:50.000Z (almost 17 years ago)
Default Branch: master
Last Pushed: 2025-02-03T07:58:13.000Z (10 months ago)
Last Synced: 2025-04-27T20:02:06.738Z (8 months ago)
Topics: crawler, ruby, scraper, spider, spider-links, web, web-crawler, web-scraper, web-scraping, web-spider
Language: Ruby
Homepage:
Size: 685 KB
Stars: 816
Watchers: 27
Forks: 107
Open Issues: 16
Metadata Files:
- Readme: README.md
- Changelog: ChangeLog.md
- License: LICENSE.txt

Awesome Lists containing this project

awesome-selenium - Spidr - web spidering library that can spider a site, multiple domains, certain links or infinitely. (Resources / Tools)
awesome-ruby - Spidr - A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (Web Crawling)
awesome-crawler - Spidr - Spider a site, multiple domains, certain links or infinitely. (Ruby)
fucking-awesome-ruby - Spidr - A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (Web Crawling)
awesome-ruby - Spidr - A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use. (Web Crawling)
awesome-crawler-cn - Spidr - 全站数据采集，支持无限的网站链接地址采集. (Ruby)

README

          # Spidr

[![CI](https://github.com/postmodern/spidr/actions/workflows/ruby.yml/badge.svg)](https://github.com/postmodern/spidr/actions/workflows/ruby.yml)

* [Homepage](https://github.com/postmodern/spidr#readme)

* [Source](https://github.com/postmodern/spidr)

* [Issues](https://github.com/postmodern/spidr/issues)

* [Mailing List](http://groups.google.com/group/spidr)

## Description

Spidr is a versatile Ruby web spidering library that can spider a site,

multiple domains, certain links or infinitely. Spidr is designed to be fast

and easy to use.

## Features

* Follows:

  * `a` tags.

  * `iframe` tags.

  * `frame` tags.

  * Cookie protected links.

  * HTTP 300, 301, 302, 303 and 307 Redirects.

  * Meta-Refresh Redirects.

  * HTTP Basic Auth protected links.

* Black-list or white-list URLs based upon:

  * URL scheme.

  * Host name

  * Port number

  * Full link

  * URL extension

  * Optional `/robots.txt` support.

* Provides callbacks for:

  * Every visited Page.

  * Every visited URL.

  * Every visited URL that matches a specified pattern.

  * Every origin and destination URI of a link.

  * Every URL that failed to be visited.

* Provides action methods to:

  * Pause spidering.

  * Skip processing of pages.

  * Skip processing of links.

* Restore the spidering queue and history from a previous session.

* Custom User-Agent strings.

* Custom proxy settings.

* HTTPS support.

## Examples

Start spidering from a URL:

```ruby

Spidr.start_at('http://tenderlovemaking.com/') do |agent|

  # ...

end

```

Spider a host:

```ruby

Spidr.host('solnic.eu') do |agent|

  # ...

end

```

Spider a domain (and any sub-domains):

```ruby

Spidr.domain('ruby-lang.org') do |agent|

  # ...

end

```

Spider a site:

```ruby

Spidr.site('http://www.rubyflow.com/') do |agent|

  # ...

end

```

Spider multiple hosts:

```ruby

Spidr.start_at('http://company.com/', hosts: ['company.com', /host[\d]+\.company\.com/]) do |agent|

  # ...

end

```

Do not spider certain links:

```ruby

Spidr.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|

  # ...

end

```

Do not spider links on certain ports:

```ruby

Spidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|

  # ...

end

```

Do not spider links blacklisted in robots.txt:

```ruby

Spidr.site('http://company.com/', robots: true) do |agent|

  # ...

end

```

Print out visited URLs:

```ruby

Spidr.site('http://www.rubyinside.com/') do |spider|

  spider.every_url { |url| puts url }

end

```

Build a URL map of a site:

```ruby

url_map = Hash.new { |hash,key| hash[key] = [] }

Spidr.site('http://intranet.com/') do |spider|

  spider.every_link do |origin,dest|

    url_map[dest] << origin

  end

end

```

Print out the URLs that could not be requested:

```ruby

Spidr.site('http://company.com/') do |spider|

  spider.every_failed_url { |url| puts url }

end

```

Finds all pages which have broken links:

```ruby

url_map = Hash.new { |hash,key| hash[key] = [] }

spider = Spidr.site('http://intranet.com/') do |spider|

  spider.every_link do |origin,dest|

    url_map[dest] << origin

  end

end

spider.failures.each do |url|

  puts "Broken link #{url} found in:"

  url_map[url].each { |page| puts "  #{page}" }

end

```

Search HTML and XML pages:

```ruby

Spidr.site('http://company.com/') do |spider|

  spider.every_page do |page|

    puts ">>> #{page.url}"

    page.search('//meta').each do |meta|

      name = (meta.attributes['name'] || meta.attributes['http-equiv'])

      value = meta.attributes['content']

      puts "  #{name} = #{value}"

    end

  end

end

```

Print out the titles from every page:

```ruby

Spidr.site('https://www.ruby-lang.org/') do |spider|

  spider.every_html_page do |page|

    puts page.title

  end

end

```

Print out every HTTP redirect:

```ruby

Spidr.host('company.com') do |spider|

  spider.every_redirect_page do |page|

    puts "#{page.url} -> #{page.headers['Location']}"

  end

end

```

Find what kinds of web servers a host is using, by accessing the headers:

```ruby

servers = Set[]

Spidr.host('company.com') do |spider|

  spider.all_headers do |headers|

    servers << headers['server']

  end

end

```

Pause the spider on a forbidden page:

```ruby

Spidr.host('company.com') do |spider|

  spider.every_forbidden_page do |page|

    spider.pause!

  end

end

```

Skip the processing of a page:

```ruby

Spidr.host('company.com') do |spider|

  spider.every_missing_page do |page|

    spider.skip_page!

  end

end

```

Skip the processing of links:

```ruby

Spidr.host('company.com') do |spider|

  spider.every_url do |url|

    if url.path.split('/').find { |dir| dir.to_i > 1000 }

      spider.skip_link!

    end

  end

end

```

## Requirements

* [ruby] >= 2.0.0

* [nokogiri] ~> 1.3

## Install

```shell

$ gem install spidr

```

## License

See {file:LICENSE.txt} for license information.

[ruby]: https://www.ruby-lang.org/

[nokogiri]: http://www.nokogiri.org/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/postmodern/spidr

Awesome Lists containing this project

README