https://github.com/jamesfrost/robots.io

Robots.txt parsing library
https://github.com/jamesfrost/robots.io

Last synced: 2 months ago
JSON representation

Robots.txt parsing library

Host: GitHub
URL: https://github.com/jamesfrost/robots.io
Owner: JamesFrost
License: gpl-3.0
Created: 2014-12-24T17:15:23.000Z (over 10 years ago)
Default Branch: master
Last Pushed: 2015-01-22T20:59:30.000Z (over 10 years ago)
Last Synced: 2023-02-26T07:55:49.266Z (over 2 years ago)
Language: Java
Homepage:
Size: 1.35 MB
Stars: 9
Watchers: 2
Forks: 3
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        robots.io

=========

Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.

## How to use

The RobotsParser class provides all the functionality to use robots.io.

The Javadoc for Robots.io can be found here.

## Examples

### Connecting

To parse the robots.txt for Google with the User-Agent string "test":

```java

RobotsParser robotsParser = new RobotsParser("test");

robotsParser.connect("http://google.com");

```

Alternatively, to parse with no User-Agent, simply leave the constructor blank.


You can also pass a domain with a path.

```java

robotsParser.connect("http://google.com/example.htm"); //This would also be valid

```

Note: Domains can either be passed in string form or as a URL object to all methods.

### Querying

To check if a URL is allowed:

```java

robotsParser.isAllowed("http://google.com/test"); //Returns true if allowed

```

Or, to get all the rules parsed from the file:

```java

robotsParser.getDisallowedPaths(); //This will return an ArrayList of Strings

```

The results parsed are cached in the robotsParser object until the ```connect()``` method is called again, overwriting the previously parsed data

### Politeness

In the event that all access is denied, a ```RobotsDisallowedException``` will be thrown.

## URL Normalisation

Domains passed to RobotsParser are normalised to always end in a forward slash.

Disallowed Paths returned will never begin with a forward slash.

This is so that URL's can easily be constructed. For example:

```java

robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htm

```

## Licensing

Robots.io is distributed under the GPL.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jamesfrost/robots.io

Awesome Lists containing this project

README