https://github.com/ekalinin/robots.js

Parser for robots.txt for node.js
https://github.com/ekalinin/robots.js

javascript nodejs parser robots robots-txt

Last synced: 3 months ago
JSON representation

Parser for robots.txt for node.js

Host: GitHub
URL: https://github.com/ekalinin/robots.js
Owner: ekalinin
License: mit
Created: 2011-08-04T07:39:40.000Z (almost 14 years ago)
Default Branch: master
Last Pushed: 2021-03-31T15:40:14.000Z (over 4 years ago)
Last Synced: 2025-04-17T20:13:55.272Z (3 months ago)
Topics: javascript, nodejs, parser, robots, robots-txt
Language: JavaScript
Homepage:
Size: 71.3 KB
Stars: 67
Watchers: 5
Forks: 21
Open Issues: 8
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        robots.js

=========

robots.js — is parser for [robots.txt](www.robotstxt.org) files for node.js.

Installation

------------

It's recommended to install via [npm](https://github.com/isaacs/npm/):

```bash

$ npm install -g robots

```

Usage

-----

Here's an example of using robots.js:

```javascript

var robots = require('robots')

  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {

  if(success) {

    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access) {

      if (access) {

        // parse url

      }

    });

  }

});

```

Default crawler user-agent is:

    Mozilla/5.0 (X11; Linux i686; rv:5.0) Gecko/20100101 Firefox/5.0

Here's an example of using another user-agent and more detailed callback:

```javascript

var robots = require('robots')

  , parser = new robots.RobotsParser(

                'http://nodeguide.ru/robots.txt',

                'Mozilla/5.0 (compatible; RobotTxtBot/1.0)',

                after_parse

            );

            

function after_parse(parser, success) {

  if(success) {

    parser.canFetch('*', '/doc/dailyjs-nodepad/', function (access, url, reason) {

      if (access) {

        console.log(' url: '+url+', access: '+access);

        // parse url ...

      }

    });

  }

};

```

Here's an example of getting list of sitemaps:

```javascript

var robots = require('robots')

  , parser = new robots.RobotsParser();

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {

  if(success) {

    parser.getSitemaps(function(sitemaps) {

      // sitemaps — array

    });

  }

});

```

Here's an example of getCrawlDelay usage:

```javascript

    var robots = require('robots')

      , parser = new robots.RobotsParser();

    // for example:

    //

    // $ curl -s http://nodeguide.ru/robots.txt

    //

    // User-agent: Google-bot

    // Disallow: / 

    // Crawl-delay: 2

    //

    // User-agent: *

    // Disallow: /

    // Crawl-delay: 2

    parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {

      if(success) {

        var GoogleBotDelay = parser.getCrawlDelay("Google-bot");

        // ...

      }

    });

```

An example of passing options to the HTTP request:

```javascript

var options = {

  headers:{

    Authorization:"Basic " + new Buffer("username:password").toString("base64")}

}

var robots = require('robots')

  , parser = new robots.RobotsParser(null, options);

parser.setUrl('http://nodeguide.ru/robots.txt', function(parser, success) {

  ...

});

```

API

---

RobotsParser — main class. This class provides a set of methods to read,

parse and answer questions about a single robots.txt file.

  * **setUrl(url, read)** — sets the URL referring to a robots.txt file.

    by default, invokes read() method.

    If read is a function, it is called once the remote file is downloaded and parsed, and it

      takes in two arguments: the first is the parser itself, and the second is a boolean

      which is True if the the remote file was successfully parsed.

  * **read(after_parse)** — reads the robots.txt URL and feeds it to the parser

  * **parse(lines)** — parse the input lines from a robots.txt file

  * **canFetch(userAgent, url, callback)** — using the parsed robots.txt decide if

    userAgent can fetch url. Callback function:

    ``function callback(access, url, reason) { ... }``

    where:

    * *access* — can this url be fetched. true/false.

    * *url* — target url

    * *reason* — reason for ``access``. Object:

      * type — valid values: 'statusCode', 'entry', 'defaultEntry', 'noRule'

      * entry — an instance of ``lib/Entry.js:``. Only for types: 'entry', 'defaultEntry'

      * statusCode — http response status code for url. Only for type 'statusCode'

  * **canFetchSync(userAgent, url)** — using the parsed robots.txt decide if

    userAgent can fetch url. Return true/false.

  * **getCrawlDelay(userAgent)** — returns Crawl-delay for the certain userAgent

  * **getSitemaps(sitemaps)** — gets Sitemaps from parsed robots.txt

  * **getDisallowedPaths(userAgent)** — gets paths explictly disallowed for the user agent specified AND *

License

-------

See [LICENSE](https://github.com/ekalinin/robots.js/blob/master/LICENSE)

file.

Resources

=========

  * [Robots.txt Specifications by Google](http://code.google.com/web/controlcrawlindex/docs/robots_txt.html)

  * [Robots.txt parser for python](http://docs.python.org/library/robotparser.html)

  * [A Standard for Robot Exclusion](http://www.robotstxt.org/orig.html)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ekalinin/robots.js

Awesome Lists containing this project

README