{"id":13464609,"url":"https://github.com/postmodern/spidr","last_synced_at":"2025-05-13T19:14:46.992Z","repository":{"id":517641,"uuid":"145726","full_name":"postmodern/spidr","owner":"postmodern","description":"A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.","archived":false,"fork":false,"pushed_at":"2025-02-03T07:58:13.000Z","size":701,"stargazers_count":816,"open_issues_count":16,"forks_count":107,"subscribers_count":27,"default_branch":"master","last_synced_at":"2025-04-27T20:02:06.738Z","etag":null,"topics":["crawler","ruby","scraper","spider","spider-links","web","web-crawler","web-scraper","web-scraping","web-spider"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/postmodern.png","metadata":{"files":{"readme":"README.md","changelog":"ChangeLog.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"postmodern"}},"created_at":"2009-03-08T10:58:50.000Z","updated_at":"2025-04-07T13:01:47.000Z","dependencies_parsed_at":"2024-01-05T21:51:05.707Z","dependency_job_id":"74ef37c2-250d-4247-a559-fc21d78d07af","html_url":"https://github.com/postmodern/spidr","commit_stats":{"total_commits":868,"total_committers":16,"mean_commits":54.25,"dds":0.03917050691244239,"last_synced_commit":"5ab77427ab28b6df0c07d49ddb2bc316edee7547"},"previous_names":[],"tags_count":30,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/postmodern%2Fspidr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/postmodern%2Fspidr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/postmodern%2Fspidr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/postmodern%2Fspidr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/postmodern","download_url":"https://codeload.github.com/postmodern/spidr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254010818,"owners_count":21998995,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","ruby","scraper","spider","spider-links","web","web-crawler","web-scraper","web-scraping","web-spider"],"created_at":"2024-07-31T14:00:47.267Z","updated_at":"2025-05-13T19:14:46.976Z","avatar_url":"https://github.com/postmodern.png","language":"Ruby","funding_links":["https://github.com/sponsors/postmodern"],"categories":["All","Ruby","Resources","Web Crawling"],"sub_categories":["Tools"],"readme":"# Spidr\n\n[![CI](https://github.com/postmodern/spidr/actions/workflows/ruby.yml/badge.svg)](https://github.com/postmodern/spidr/actions/workflows/ruby.yml)\n\n* [Homepage](https://github.com/postmodern/spidr#readme)\n* [Source](https://github.com/postmodern/spidr)\n* [Issues](https://github.com/postmodern/spidr/issues)\n* [Mailing List](http://groups.google.com/group/spidr)\n\n## Description\n\nSpidr is a versatile Ruby web spidering library that can spider a site,\nmultiple domains, certain links or infinitely. Spidr is designed to be fast\nand easy to use.\n\n## Features\n\n* Follows:\n  * `a` tags.\n  * `iframe` tags.\n  * `frame` tags.\n  * Cookie protected links.\n  * HTTP 300, 301, 302, 303 and 307 Redirects.\n  * Meta-Refresh Redirects.\n  * HTTP Basic Auth protected links.\n* Black-list or white-list URLs based upon:\n  * URL scheme.\n  * Host name\n  * Port number\n  * Full link\n  * URL extension\n  * Optional `/robots.txt` support.\n* Provides callbacks for:\n  * Every visited Page.\n  * Every visited URL.\n  * Every visited URL that matches a specified pattern.\n  * Every origin and destination URI of a link.\n  * Every URL that failed to be visited.\n* Provides action methods to:\n  * Pause spidering.\n  * Skip processing of pages.\n  * Skip processing of links.\n* Restore the spidering queue and history from a previous session.\n* Custom User-Agent strings.\n* Custom proxy settings.\n* HTTPS support.\n\n## Examples\n\nStart spidering from a URL:\n\n```ruby\nSpidr.start_at('http://tenderlovemaking.com/') do |agent|\n  # ...\nend\n```\n\nSpider a host:\n\n```ruby\nSpidr.host('solnic.eu') do |agent|\n  # ...\nend\n```\n\nSpider a domain (and any sub-domains):\n\n```ruby\nSpidr.domain('ruby-lang.org') do |agent|\n  # ...\nend\n```\n\nSpider a site:\n\n```ruby\nSpidr.site('http://www.rubyflow.com/') do |agent|\n  # ...\nend\n```\n\nSpider multiple hosts:\n\n```ruby\nSpidr.start_at('http://company.com/', hosts: ['company.com', /host[\\d]+\\.company\\.com/]) do |agent|\n  # ...\nend\n```\n\nDo not spider certain links:\n\n```ruby\nSpidr.site('http://company.com/', ignore_links: [%{^/blog/}]) do |agent|\n  # ...\nend\n```\n\nDo not spider links on certain ports:\n\n```ruby\nSpidr.site('http://company.com/', ignore_ports: [8000, 8010, 8080]) do |agent|\n  # ...\nend\n```\n\nDo not spider links blacklisted in robots.txt:\n\n```ruby\nSpidr.site('http://company.com/', robots: true) do |agent|\n  # ...\nend\n```\n\nPrint out visited URLs:\n\n```ruby\nSpidr.site('http://www.rubyinside.com/') do |spider|\n  spider.every_url { |url| puts url }\nend\n```\n\nBuild a URL map of a site:\n\n```ruby\nurl_map = Hash.new { |hash,key| hash[key] = [] }\n\nSpidr.site('http://intranet.com/') do |spider|\n  spider.every_link do |origin,dest|\n    url_map[dest] \u003c\u003c origin\n  end\nend\n```\n\nPrint out the URLs that could not be requested:\n\n```ruby\nSpidr.site('http://company.com/') do |spider|\n  spider.every_failed_url { |url| puts url }\nend\n```\n\nFinds all pages which have broken links:\n\n```ruby\nurl_map = Hash.new { |hash,key| hash[key] = [] }\n\nspider = Spidr.site('http://intranet.com/') do |spider|\n  spider.every_link do |origin,dest|\n    url_map[dest] \u003c\u003c origin\n  end\nend\n\nspider.failures.each do |url|\n  puts \"Broken link #{url} found in:\"\n\n  url_map[url].each { |page| puts \"  #{page}\" }\nend\n```\n\nSearch HTML and XML pages:\n\n```ruby\nSpidr.site('http://company.com/') do |spider|\n  spider.every_page do |page|\n    puts \"\u003e\u003e\u003e #{page.url}\"\n\n    page.search('//meta').each do |meta|\n      name = (meta.attributes['name'] || meta.attributes['http-equiv'])\n      value = meta.attributes['content']\n\n      puts \"  #{name} = #{value}\"\n    end\n  end\nend\n```\n\nPrint out the titles from every page:\n\n```ruby\nSpidr.site('https://www.ruby-lang.org/') do |spider|\n  spider.every_html_page do |page|\n    puts page.title\n  end\nend\n```\n\nPrint out every HTTP redirect:\n\n```ruby\nSpidr.host('company.com') do |spider|\n  spider.every_redirect_page do |page|\n    puts \"#{page.url} -\u003e #{page.headers['Location']}\"\n  end\nend\n```\n\nFind what kinds of web servers a host is using, by accessing the headers:\n\n```ruby\nservers = Set[]\n\nSpidr.host('company.com') do |spider|\n  spider.all_headers do |headers|\n    servers \u003c\u003c headers['server']\n  end\nend\n```\n\nPause the spider on a forbidden page:\n\n```ruby\nSpidr.host('company.com') do |spider|\n  spider.every_forbidden_page do |page|\n    spider.pause!\n  end\nend\n```\n\nSkip the processing of a page:\n\n```ruby\nSpidr.host('company.com') do |spider|\n  spider.every_missing_page do |page|\n    spider.skip_page!\n  end\nend\n```\n\nSkip the processing of links:\n\n```ruby\nSpidr.host('company.com') do |spider|\n  spider.every_url do |url|\n    if url.path.split('/').find { |dir| dir.to_i \u003e 1000 }\n      spider.skip_link!\n    end\n  end\nend\n```\n\n## Requirements\n\n* [ruby] \u003e= 2.0.0\n* [nokogiri] ~\u003e 1.3\n\n## Install\n\n```shell\n$ gem install spidr\n```\n\n## License\n\nSee {file:LICENSE.txt} for license information.\n\n[ruby]: https://www.ruby-lang.org/\n[nokogiri]: http://www.nokogiri.org/\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpostmodern%2Fspidr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpostmodern%2Fspidr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpostmodern%2Fspidr/lists"}