Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/jamesfrost/robots.io

Robots.txt parsing library
https://github.com/jamesfrost/robots.io

Last synced: about 1 month ago
JSON representation

Robots.txt parsing library

Awesome Lists containing this project

README

        

robots.io
=========
Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.

## How to use

The RobotsParser class provides all the functionality to use robots.io.


The Javadoc for Robots.io can be found here.

## Examples

### Connecting
To parse the robots.txt for Google with the User-Agent string "test":
```java
RobotsParser robotsParser = new RobotsParser("test");
robotsParser.connect("http://google.com");
```
Alternatively, to parse with no User-Agent, simply leave the constructor blank.

You can also pass a domain with a path.
```java
robotsParser.connect("http://google.com/example.htm"); //This would also be valid
```
Note: Domains can either be passed in string form or as a URL object to all methods.

### Querying
To check if a URL is allowed:
```java
robotsParser.isAllowed("http://google.com/test"); //Returns true if allowed
```

Or, to get all the rules parsed from the file:
```java
robotsParser.getDisallowedPaths(); //This will return an ArrayList of Strings
```

The results parsed are cached in the robotsParser object until the ```connect()``` method is called again, overwriting the previously parsed data

### Politeness
In the event that all access is denied, a ```RobotsDisallowedException``` will be thrown.

## URL Normalisation
Domains passed to RobotsParser are normalised to always end in a forward slash.
Disallowed Paths returned will never begin with a forward slash.
This is so that URL's can easily be constructed. For example:
```java
robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htm
```

## Licensing
Robots.io is distributed under the GPL.