Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/jamesfrost/robots.io
Robots.txt parsing library
https://github.com/jamesfrost/robots.io
Last synced: about 1 month ago
JSON representation
Robots.txt parsing library
- Host: GitHub
- URL: https://github.com/jamesfrost/robots.io
- Owner: JamesFrost
- License: gpl-3.0
- Created: 2014-12-24T17:15:23.000Z (almost 10 years ago)
- Default Branch: master
- Last Pushed: 2015-01-22T20:59:30.000Z (almost 10 years ago)
- Last Synced: 2023-02-26T07:55:49.266Z (almost 2 years ago)
- Language: Java
- Homepage:
- Size: 1.35 MB
- Stars: 9
- Watchers: 2
- Forks: 3
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
robots.io
=========
Robots.io is a Java library designed to make parsing a websites 'robots.txt' file easy.## How to use
The RobotsParser class provides all the functionality to use robots.io.
The Javadoc for Robots.io can be found here.## Examples
### Connecting
To parse the robots.txt for Google with the User-Agent string "test":
```java
RobotsParser robotsParser = new RobotsParser("test");
robotsParser.connect("http://google.com");
```
Alternatively, to parse with no User-Agent, simply leave the constructor blank.You can also pass a domain with a path.
```java
robotsParser.connect("http://google.com/example.htm"); //This would also be valid
```
Note: Domains can either be passed in string form or as a URL object to all methods.### Querying
To check if a URL is allowed:
```java
robotsParser.isAllowed("http://google.com/test"); //Returns true if allowed
```Or, to get all the rules parsed from the file:
```java
robotsParser.getDisallowedPaths(); //This will return an ArrayList of Strings
```The results parsed are cached in the robotsParser object until the ```connect()``` method is called again, overwriting the previously parsed data
### Politeness
In the event that all access is denied, a ```RobotsDisallowedException``` will be thrown.## URL Normalisation
Domains passed to RobotsParser are normalised to always end in a forward slash.
Disallowed Paths returned will never begin with a forward slash.
This is so that URL's can easily be constructed. For example:
```java
robotsParser.getDomain() + robotsParser.getDisallowedPaths().get(0); // http://google.com/example.htm
```## Licensing
Robots.io is distributed under the GPL.