https://github.com/b4dnewz/robots-parse

A lightweight and simple robots.txt parser in node
https://github.com/b4dnewz/robots-parse

osint parser robots-parser robots-txt

Last synced: about 1 month ago
JSON representation

A lightweight and simple robots.txt parser in node

Host: GitHub
URL: https://github.com/b4dnewz/robots-parse
Owner: b4dnewz
License: mit
Created: 2017-09-21T22:08:07.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-02-27T16:30:52.000Z (over 2 years ago)
Last Synced: 2025-04-23T19:41:31.323Z (about 1 month ago)
Topics: osint, parser, robots-parser, robots-txt
Language: TypeScript
Homepage:
Size: 647 KB
Stars: 7
Watchers: 1
Forks: 0
Open Issues: 17
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # robots-parse

[![NPM version][npm-image]][npm-url] [![Build Status][travis-image]][travis-url] [![Dependency Status][daviddm-image]][daviddm-url] [![Coverage percentage][coveralls-image]][coveralls-url]

> A lightweight and simple robots.txt parser in node.

[![NPM](https://nodei.co/npm/robots-parse.png)](https://nodei.co/npm/robots-parse/)

## Installation

```

npm install robots-parse

```

## Usage

You can use the module to scan a **domain** for robots file like in the example below:

```js

const robotsParse = require('robots-parse');

robotsParse('github.com', (err, res) => {

  console.log('Result:', res);

});

```

You can also use it with __promises__ if the callback is not specified:

```js

import robotsParse from 'robots-parse'

(async () => {

  const res = await robotsParse('github.com');

  console.log('Result:', res);

})().catch(console.error)

```

Or you can use the built-in parser to parse an existing robots.txt file, for example a **local file** or a **string**. The parser works __in sync__ so you don't have to use callback or promises.

```js

const {parser} = require('robots-parse');

request('google.com/robots.txt', (err, res, body) => {

  const object = parser(body);

  console.log(object);

});

```

Parsing an existing local robots.txt file:

```js

import {parser} from 'robots-parse'

const content = fs.readFileSync('./robots.txt', 'utf-8');

const object = parser(content);

console.log(object);

```

## How it works?

By default the script will get and parse the `robots.txt` file for a given website or domain and it will search for various rules:

-   **Agents**: A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent.

-   **Host**: Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show.

-   **Allow**: The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.

-   **Disallow**: The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.

-   **Sitemap**: An absolute url that points to a Sitemap, Sitemap Index file or equivalent URL.

It returns, _if the robots files were successfully retrieved and parsed_, an object containing the properties mentioned above, inside every agent found you will find agent-specific **allow** and **disallow** rules, which also will be stored in **allow** and **disallow**  root properties containing all of them indistinctly.

You can read more about the specifications of the robots file on it's [Google Reference Page](https://developers.google.com/search/reference/robots_txt).

---

## Contributing

1.  Create an issue and describe your idea

2.  Fork the project ()

3.  Create your feature branch (`git checkout -b my-new-feature`)

4.  Commit your changes (`git commit -am 'Add some feature'`)

5.  Write tests for your code (`npm run test`)

6.  Publish the branch (`git push origin my-new-feature`)

7.  Create a new Pull Request

## License

MIT © [b4dnewz](https://b4dnewz.github.io/)

[npm-image]: https://badge.fury.io/js/robots-parse.svg

[npm-url]: https://npmjs.org/package/robots-parse

[travis-image]: https://travis-ci.org/b4dnewz/robots-parse.svg?branch=master

[travis-url]: https://travis-ci.org/b4dnewz/robots-parse

[daviddm-image]: https://david-dm.org/b4dnewz/robots-parse.svg?theme=shields.io

[daviddm-url]: https://david-dm.org/b4dnewz/robots-parse

[coveralls-image]: https://coveralls.io/repos/b4dnewz/robots-parse/badge.svg

[coveralls-url]: https://coveralls.io/r/b4dnewz/robots-parse

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/b4dnewz/robots-parse

Awesome Lists containing this project

README