https://github.com/krishpranav/spider

A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely
https://github.com/krishpranav/spider

crawler ruby spider web-crawler web-scraping

Last synced: about 1 month ago
JSON representation

A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely

Host: GitHub
URL: https://github.com/krishpranav/spider
Owner: krishpranav
Created: 2021-07-18T06:07:52.000Z (almost 4 years ago)
Default Branch: master
Last Pushed: 2021-07-19T03:45:41.000Z (almost 4 years ago)
Last Synced: 2025-03-04T13:46:30.067Z (4 months ago)
Topics: crawler, ruby, spider, web-crawler, web-scraping
Language: Ruby
Homepage:
Size: 79.1 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # spider

A ruby web spidering tool that can spider a site, multiple domains, certain links or infinitely

[![forthebadge](https://forthebadge.com/images/badges/made-with-ruby.svg)](https://forthebadge.com)

# Installation

```

git clone https://github.com/krishpranav/spider

cd spider

bundle install

```

## Examples

Start spidering from a URL:

```rb

    spider.start_at('http://google.com/')

```

Spider a host:

```rb

    spider.host('google.com')

```

Spider a site:

```rb

    spider.site('http://www.rubyflow.com/')

```

Spider multiple hosts:

```rb

    spider.start_at(

      'http://company.com/',

      hosts: [

        'company.com',

        /host[\d]+\.company\.com/

      ]

    )

```

Do not spider certain links:

```rb

    spider.site('http://company.com/', ignore_links: [%{^/blog/}])

```

Do not spider links on certain ports:

```rb

    spider.site('http://company.com/', ignore_ports: [8000, 8010, 8080])

```

Do not spider links blacklisted in robots.txt:

```rb

    spider.site(

      'http://company.com/',

      robots: true

    )

```

Print out visited URLs:

```rb

    spider.site('http://www.rubyinside.com/') do |spider|

      spider.every_url { |url| puts url }

    end

```

Build a URL map of a site:

```rb

    url_map = Hash.new { |hash,key| hash[key] = [] }

    spider.site('http://intranet.com/') do |spider|

      spider.every_link do |origin,dest|

        url_map[dest] << origin

      end

    end

```

Print out the URLs that could not be requested:

```rb

    spider.site('http://company.com/') do |spider|

      spider.every_failed_url { |url| puts url }

    end

```

Finds all pages which have broken links:

```rb

    url_map = Hash.new { |hash,key| hash[key] = [] }

    spider = spider.site('http://intranet.com/') do |spider|

      spider.every_link do |origin,dest|

        url_map[dest] << origin

      end

    end

    spider.failures.each do |url|

      puts "Broken link #{url} found in:"

      url_map[url].each { |page| puts "  #{page}" }

    end

```

Search HTML and XML pages:

```rb

    spider.site('http://company.com/') do |spider|

      spider.every_page do |page|

        puts ">>> #{page.url}"

        page.search('//meta').each do |meta|

          name = (meta.attributes['name'] || meta.attributes['http-equiv'])

          value = meta.attributes['content']

          puts "  #{name} = #{value}"

        end

      end

    end

```

Print out the titles from every page:

```rb

    spider.site('https://www.ruby-lang.org/') do |spider|

      spider.every_html_page do |page|

        puts page.title

      end

    end

```

Find what kinds of web servers a host is using, by accessing the headers:

```rb

    servers = Set[]

    spider.host('company.com') do |spider|

      spider.all_headers do |headers|

        servers << headers['server']

      end

    end

```

Pause the spider on a forbidden page:

```rb

    spider.host('company.com') do |spider|

      spider.every_forbidden_page do |page|

        spider.pause!

      end

    end

```

Skip the processing of a page:

```rb

    spider.host('company.com') do |spider|

      spider.every_missing_page do |page|

        spider.skip_page!

      end

    end

```

Skip the processing of links:

```rb

    spider.host('company.com') do |spider|

      spider.every_url do |url|

        if url.path.split('/').find { |dir| dir.to_i > 1000 }

          spider.skip_link!

        end

      end

    end

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/krishpranav/spider

Awesome Lists containing this project

README