Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/krishpranav/spider
A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely
https://github.com/krishpranav/spider
crawler ruby spider web-crawler web-scraping
Last synced: 29 days ago
JSON representation
A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely
- Host: GitHub
- URL: https://github.com/krishpranav/spider
- Owner: krishpranav
- Created: 2021-07-18T06:07:52.000Z (over 3 years ago)
- Default Branch: master
- Last Pushed: 2021-07-19T03:45:41.000Z (over 3 years ago)
- Last Synced: 2024-10-15T22:21:15.920Z (3 months ago)
- Topics: crawler, ruby, spider, web-crawler, web-scraping
- Language: Ruby
- Homepage:
- Size: 79.1 KB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# spider
A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely[![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://forthebadge.com)
# Installation
```
git clone https://github.com/krishpranav/spider
cd spider
bundle install
```## Examples
Start spidering from a URL:
```rb
spider.start_at('http://google.com/')
```Spider a host:
```rbspider.host('google.com')
```Spider a site:
```rb
spider.site('http://www.rubyflow.com/')
```Spider multiple hosts:
```rbspider.start_at(
'http://company.com/',
hosts: [
'company.com',
/host[\d]+\.company\.com/
]
)
```Do not spider certain links:
```rb
spider.site('http://company.com/', ignore_links: [%{^/blog/}])
```Do not spider links on certain ports:
```rb
spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080])
```Do not spider links blacklisted in robots.txt:
```rb
spider.site(
'http://company.com/',
robots: true
)
```Print out visited URLs:
```rb
spider.site('http://www.rubyinside.com/') do |spider|
spider.every_url { |url| puts url }
end
```Build a URL map of a site:
```rb
url_map = Hash.new { |hash,key| hash[key] = [] }spider.site('http://intranet.com/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
end
```Print out the URLs that could not be requested:
```rb
spider.site('http://company.com/') do |spider|
spider.every_failed_url { |url| puts url }
end
```Finds all pages which have broken links:
```rb
url_map = Hash.new { |hash,key| hash[key] = [] }spider = spider.site('http://intranet.com/') do |spider|
spider.every_link do |origin,dest|
url_map[dest] << origin
end
endspider.failures.each do |url|
puts "Broken link #{url} found in:"url_map[url].each { |page| puts " #{page}" }
end
```Search HTML and XML pages:
```rb
spider.site('http://company.com/') do |spider|
spider.every_page do |page|
puts ">>> #{page.url}"page.search('//meta').each do |meta|
name = (meta.attributes['name'] || meta.attributes['http-equiv'])
value = meta.attributes['content']puts " #{name} = #{value}"
end
end
end
```Print out the titles from every page:
```rb
spider.site('https://www.ruby-lang.org/') do |spider|
spider.every_html_page do |page|
puts page.title
end
end
```Find what kinds of web servers a host is using, by accessing the headers:
```rb
servers = Set[]spider.host('company.com') do |spider|
spider.all_headers do |headers|
servers << headers['server']
end
end
```Pause the spider on a forbidden page:
```rb
spider.host('company.com') do |spider|
spider.every_forbidden_page do |page|
spider.pause!
end
end
```Skip the processing of a page:
```rb
spider.host('company.com') do |spider|
spider.every_missing_page do |page|
spider.skip_page!
end
end
```Skip the processing of links:
```rb
spider.host('company.com') do |spider|
spider.every_url do |url|
if url.path.split('/').find { |dir| dir.to_i > 1000 }
spider.skip_link!
end
end
end
```