Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dblock/ruby-link-checker

Fast ruby link checker.
https://github.com/dblock/ruby-link-checker

Last synced: 16 days ago
JSON representation

Fast ruby link checker.

Awesome Lists containing this project

README

        

Fast Ruby Link Checker
======================

[![Gem Version](http://img.shields.io/gem/v/ruby-link-checker.svg)](http://badge.fury.io/rb/ruby-link-checker)
[![Build Status](https://github.com/dblock/ruby-link-checker/workflows/test/badge.svg?branch=main)](https://github.com/dblock/ruby-link-checker/actions)
[![Code Climate](https://codeclimate.com/github/dblock/ruby-link-checker.svg)](https://codeclimate.com/github/dblock/ruby-link-checker)
[![Test Coverage](https://api.codeclimate.com/v1/badges/164f1e23fc706b6efa63/test_coverage)](https://codeclimate.com/github/dblock/ruby-link-checker/test_coverage)

A fast Ruby link checker with support for multiple HTTP libraries. Does not parse documents, just checks links. Fast. Anecdotal benchmarking on a M1 mac and T1 Internet yields ~50 URLs per second with `LinkChecker::Typhoeus::Hydra`.

## Table of Contents

- [Usage](#usage)
- [Dependencies](#dependencies)
- [Basic Usage](#basic-usage)
- [Passing Options](#passing-options)
- [Checkers](#checkers)
- [LinkChecker::Typhoeus::Hydra](#linkcheckertyphoeushydra)
- [LinkChecker::Net::HTTP](#linkcheckernethttp)
- [Options](#options)
- [Retries](#retries)
- [Results](#results)
- [Methods](#methods)
- [Logger](#logger)
- [User-Agent](#user-agent)
- [Global Configuration](#global-configuration)
- [Callbacks and Events](#callbacks-and-events)
- [Contributing](#contributing)
- [Copyright and License](#copyright-and-license)

## Usage

### Dependencies

The [`LinkChecker::Typhoeus::Hydra`](lib/ruby-link-checker/typhoeus/hydra/checker.rb) link checker is recommended.

Add `typhoeus` and `ruby-link-checker` to your `Gemfile` and run `bundle install`.

```ruby
gem 'typhoeus'
gem 'ruby-link-checker'
```

### Basic Usage

```ruby
require 'typhoeus'
require 'ruby-link-checker'

# create a new checker instance
checker = LinkChecker::Typhoeus::Hydra::Checker.new

# queue URLs to check
links = [...]
links.each do |url|
checker.check url
end

# run the checks
checker.run

# display buckets of results
checker.results.each_pair do |bucket, results|
puts "#{bucket}: #{results.size}"
end
```

### Passing Options

You can pipe custom options through `check` and retrieve them in events as follows.

```ruby
checker.check 'https://www.example.org', { location: 'page.html' }

checker.on :success do |result|
result.options # contains { location: 'page.html' }
end
```

### Checkers

#### [LinkChecker::Typhoeus::Hydra](lib/ruby-link-checker/typhoeus/hydra/checker.rb)

Fast link checker that uses [Typhoeus](https://typhoeus.github.io/).

```ruby
require 'typhoeus'
require 'ruby-link-checker'

# create a new instance of a checker
checker = LinkChecker::Typhoeus::Hydra::Checker.new(
hydra: {
# lower than the Typhoeus default of 200, seems to start breaking around 50+
max_concurrency: 25
}
)

# log requests and response codes
checker.logger.level = Logger::INFO

links = [...] # array of URLs
links.each do |url|
checker.check url
end

# examine failures and errors as they come
checker.on :error, :failure do |result|
puts "FAIL: #{result}"
end

# execute Hydra#run, will block until all requests have completed
checker.run

# examine results
checker.results.each_pair do |bucket, results|
puts "#{bucket}: #{results.size}"
end
```

You can pass `Typhoeus` timeout options into a new instance of a checker, or configure timeouts globally.

```ruby
LinkChecker::Typhoeus::Hydra.configure do |config|
config.timeout = 5
config.connecttimeout = 10
end
```

#### [LinkChecker::Net::HTTP](lib/ruby-link-checker/net/http/checker.rb)

Slow, sequential checker.

```ruby
require 'net/http'
require 'ruby-link-checker'

# create a new instance of a checker
checker = LinkChecker::Net::HTTP::Checker.new

# log requests and response codes
checker.logger.level = Logger::INFO

links = [...] # array of URLs
links.each do |url|
checker.check url
end

# examine results
checker.results.each_pair do |bucket, results|
puts "#{bucket}: #{results.size}"
end
```

You can pass `Net::HTTP` timeout options into a new instance of a checker, or configure timeouts globally.

```ruby
LinkChecker::Net::HTTP.configure do |config|
config.read_timeout = 5
config.open_timeout = 10
end
```

### Options

#### Retries

By default link checkers do not retry. You can set a number of times to retry all errors and failures with `retries`.

```ruby
checker = LinkChecker::Net::HTTP::Checker.new(retry: 1)
```

#### Results

By default checkers collect results.

```ruby
checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run

checker.results # => { error: [...], failure: [...], success: [...] }
```

You can disable this with `results: false`.

```ruby
checker = LinkChecker::Net::HTTP::Checker.new(results: false)
...
checker.run

checker.results # => nil
```

#### Methods

By default checkers try a `HEAD` request, followed by a `GET` if `HEAD` fails. You can change this behavior by specifying other methods.

The following examples disables `GET` and only makes `HEAD` requests.

```ruby
checker = LinkChecker::Net::HTTP::Checker.new(methods: %w[HEAD])
```

#### Logger

Pass your own logger.

```ruby
checker = LinkChecker::Net::HTTP::Checker.new(logger: Logger.new(STDOUT))
```

#### User-Agent

Pass your own user-agent. Default is `Ruby Link Checker/x.y.z`.

```ruby
checker = LinkChecker::Net::HTTP::Checker.new(user_agent: 'Custom Agent/1.0')
```

### Global Configuration

All options can also be configured globally.

```ruby
LinkChecker.configure do |config|
config.user_agent = 'Custom Agent/1.0'
config.methods = ['HEAD', 'GET']
config.logger = ::Logger.new(STDOUT)
end
```

### Callbacks and Events

Events enable processing of results as they become available.

```ruby
checker.on :result do |result|
puts result # any result
end

checker.on :error, :failure do |result|
puts result # error or failure
end
```

Checkers support the following events.

| Event | Description |
|----------|----------------------------------------------------------------|
| :retry | A request is being retried on failure or error. |
| :result | A new result, any of success, failure, or error. |
| :success | A valid URL, usually a 2xx response from the server. |
| :failure | A failed URL, usually a 4xx or a 5xx response from the server. |
| :error | An error, such as an invalid URL or a network timeout. |

Events are called with results, which contain the following properties.

| Property | Description |
|-------------------|-----------------------------------------------------------------|
| :url | The original URL before redirects. |
| :result_url | The last URL, different from `url` in case of redirects. |
| :method | The result HTTP method. |
| :code | HTTP error code. |
| :request_headers | Request headers. |
| :redirect_to | A redirect URL in case of redirects. |
| :error | A raised error in case of errors. |

See [result.rb](lib/ruby-link-checker/result.rb) for more details.

## Contributing

You're encouraged to contribute to ruby-link-checker. See [CONTRIBUTING](CONTRIBUTING.md) for details.

## Copyright and License

Copyright (c) Daniel Doubrovkine and [Contributors](CHANGELOG.md).

This project is licensed under the [MIT License](LICENSE.md).