Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/tarao/perl5-www-robotrules-parser-multivalue


https://github.com/tarao/perl5-www-robotrules-parser-multivalue

perl

Last synced: about 1 month ago
JSON representation

Awesome Lists containing this project

README

        

[![Build Status](https://travis-ci.org/tarao/perl5-WWW-RobotRules-Parser-MultiValue.svg?branch=master)](https://travis-ci.org/tarao/perl5-WWW-RobotRules-Parser-MultiValue)
# NAME

WWW::RobotRules::Parser::MultiValue - Parse robots.txt

# SYNOPSIS

use WWW::RobotRules::Parser::MultiValue;
use LWP::Simple qw(get);

my $url = 'http://example.com/robots.txt';
my $robots_txt = get $url;

my $rules = WWW::RobotRules::Parser::MultiValue->new(
agent => 'TestBot/1.0',
);
$rules->parse($url, $robots_txt);

if ($rules->allows('http://example.com/some/path')) {
my $delay = $rules->delay_for('http://example.com/');
sleep $delay;
...
}

my $hash = $rules->rules_for('http://example.com/');
my @list_of_allowed_paths = $hash->get_all('allow');
my @list_of_custom_rule_value = $hash->get_all('some-rule');

# DESCRIPTION

`WWW::RobotRules::Parser::MultiValue` is a parser for `robots.txt`.

Parsed rules for the specified user agent is stored as a
[Hash::MultiValue](https://metacpan.org/pod/Hash::MultiValue), where the key is a lower case rule name.

`Request-rate` rule is handled specially. It is normalized to
`Crawl-delay` rule.

# METHODS

- new

$rules = WWW::RobotRules::Parser::MultiValue->new(
aget => $user_agent
);
$rules = WWW::RobotRules::Parser::MultiValue->new(
aget => $user_agent,
ignore_default => 1,
);

Creates a new object to handle rules in `robots.txt`. The object
parses rules match with `$user_agent`. The rules of `User-agent: *`
always match and have a lower precedence than the rules explicitly
matched with `$user_agent`. If `ignore_default` option is
specified, rules of `User-agent: *` are simply ignored.

- parse

$rules->parse($uri, $text);

Parses a text content `$text` whose URI is `$uri`.

- match\_ua

$rules->match_ua($pattern);

Test if the user agent matches with `$pattern`.

- rules\_for

$hash = $rules->rules_for($uri);

Returns a `Hash::MultiValue`, which describes the rules of the domain
of `$uri`.

- allows

$test = $rules->allows($uri);

Tests if the user agent is allowed to visit `$uri`. If there is
'Allow' rule for the path of `$uri`, then the `$uri` is allowed to
visit. If there is 'Disallow' rule for the path of `$uri`, then the
`$uri` is not allowed to visit. Otherwise, the `$uri` is allowed to
visit.

- delay\_for

$delay = $rules->delay_for($uri);
$delay_in_milliseconds = $rules->delay_for($uri, 1000);

Calculate a crawl delay for the specified `$uri`. The value is
determined by 'Crawl-delay' rule or 'Request-rate' rule. The second
argument specifies the base of the return value.

# SEE ALSO

[Hash::MultiValue](https://metacpan.org/pod/Hash::MultiValue)

# LICENSE

Copyright (C) INA Lintaro

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.

# AUTHOR

INA Lintaro