{"id":15323297,"url":"https://github.com/krishpranav/spider","last_synced_at":"2026-02-14T02:02:19.780Z","repository":{"id":109908002,"uuid":"387103808","full_name":"krishpranav/spider","owner":"krishpranav","description":"A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely","archived":false,"fork":false,"pushed_at":"2021-07-19T03:45:41.000Z","size":81,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-09-17T10:02:11.436Z","etag":null,"topics":["crawler","ruby","spider","web-crawler","web-scraping"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/krishpranav.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-07-18T06:07:52.000Z","updated_at":"2021-10-19T12:13:09.000Z","dependencies_parsed_at":"2023-05-11T15:15:48.270Z","dependency_job_id":null,"html_url":"https://github.com/krishpranav/spider","commit_stats":{"total_commits":75,"total_committers":2,"mean_commits":37.5,"dds":"0.013333333333333308","last_synced_commit":"d40bf94d2907d786b8917747f6e5412d3fec9e85"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/krishpranav/spider","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishpranav%2Fspider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishpranav%2Fspider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishpranav%2Fspider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishpranav%2Fspider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/krishpranav","download_url":"https://codeload.github.com/krishpranav/spider/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/krishpranav%2Fspider/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29431593,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-13T22:20:51.549Z","status":"online","status_checked_at":"2026-02-14T02:00:07.626Z","response_time":53,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","ruby","spider","web-crawler","web-scraping"],"created_at":"2024-10-01T09:19:30.160Z","updated_at":"2026-02-14T02:02:19.758Z","avatar_url":"https://github.com/krishpranav.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# spider\nA ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely\n\n[![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://forthebadge.com)\n\n# Installation\n```\ngit clone https://github.com/krishpranav/spider\ncd spider\nbundle install\n```\n\n## Examples\n\nStart spidering from a URL:\n```rb\n    spider.start_at('http://google.com/')\n```\n\nSpider a host:\n```rb\n\n    spider.host('google.com')\n```\n\nSpider a site:\n```rb\n    spider.site('http://www.rubyflow.com/')\n```\n\nSpider multiple hosts:\n```rb\n\n    spider.start_at(\n      'http://company.com/',\n      hosts: [\n        'company.com',\n        /host[\\d]+\\.company\\.com/\n      ]\n    )\n```\n\nDo not spider certain links:\n```rb\n    spider.site('http://company.com/', ignore_links: [%{^/blog/}])\n```\n\nDo not spider links on certain ports:\n```rb\n    spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080])\n```\n\nDo not spider links blacklisted in robots.txt:\n```rb\n    spider.site(\n      'http://company.com/',\n      robots: true\n    )\n```\n\nPrint out visited URLs:\n```rb\n    spider.site('http://www.rubyinside.com/') do |spider|\n      spider.every_url { |url| puts url }\n    end\n```\n\nBuild a URL map of a site:\n```rb\n    url_map = Hash.new { |hash,key| hash[key] = [] }\n\n    spider.site('http://intranet.com/') do |spider|\n      spider.every_link do |origin,dest|\n        url_map[dest] \u003c\u003c origin\n      end\n    end\n```\n\nPrint out the URLs that could not be requested:\n```rb\n    spider.site('http://company.com/') do |spider|\n      spider.every_failed_url { |url| puts url }\n    end\n```\n\nFinds all pages which have broken links:\n```rb\n    url_map = Hash.new { |hash,key| hash[key] = [] }\n\n    spider = spider.site('http://intranet.com/') do |spider|\n      spider.every_link do |origin,dest|\n        url_map[dest] \u003c\u003c origin\n      end\n    end\n\n    spider.failures.each do |url|\n      puts \"Broken link #{url} found in:\"\n\n      url_map[url].each { |page| puts \"  #{page}\" }\n    end\n```\n\nSearch HTML and XML pages:\n```rb\n    spider.site('http://company.com/') do |spider|\n      spider.every_page do |page|\n        puts \"\u003e\u003e\u003e #{page.url}\"\n\n        page.search('//meta').each do |meta|\n          name = (meta.attributes['name'] || meta.attributes['http-equiv'])\n          value = meta.attributes['content']\n\n          puts \"  #{name} = #{value}\"\n        end\n      end\n    end\n```\n\nPrint out the titles from every page:\n```rb\n    spider.site('https://www.ruby-lang.org/') do |spider|\n      spider.every_html_page do |page|\n        puts page.title\n      end\n    end\n```\n\nFind what kinds of web servers a host is using, by accessing the headers:\n```rb\n    servers = Set[]\n\n    spider.host('company.com') do |spider|\n      spider.all_headers do |headers|\n        servers \u003c\u003c headers['server']\n      end\n    end\n```\n\nPause the spider on a forbidden page:\n```rb\n    spider.host('company.com') do |spider|\n      spider.every_forbidden_page do |page|\n        spider.pause!\n      end\n    end\n```\n\nSkip the processing of a page:\n```rb\n    spider.host('company.com') do |spider|\n      spider.every_missing_page do |page|\n        spider.skip_page!\n      end\n    end\n```\n\nSkip the processing of links:\n```rb\n    spider.host('company.com') do |spider|\n      spider.every_url do |url|\n        if url.path.split('/').find { |dir| dir.to_i \u003e 1000 }\n          spider.skip_link!\n        end\n      end\n    end\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishpranav%2Fspider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkrishpranav%2Fspider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkrishpranav%2Fspider/lists"}