Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/b4dnewz/robots-parse
A lightweight and simple robots.txt parser in node
https://github.com/b4dnewz/robots-parse
osint parser robots-parser robots-txt
Last synced: about 2 months ago
JSON representation
A lightweight and simple robots.txt parser in node
- Host: GitHub
- URL: https://github.com/b4dnewz/robots-parse
- Owner: b4dnewz
- License: mit
- Created: 2017-09-21T22:08:07.000Z (over 7 years ago)
- Default Branch: master
- Last Pushed: 2023-02-27T16:30:52.000Z (almost 2 years ago)
- Last Synced: 2024-10-28T14:53:46.979Z (2 months ago)
- Topics: osint, parser, robots-parser, robots-txt
- Language: TypeScript
- Homepage:
- Size: 647 KB
- Stars: 7
- Watchers: 2
- Forks: 0
- Open Issues: 17
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# robots-parse
[![NPM version][npm-image]][npm-url] [![Build Status][travis-image]][travis-url] [![Dependency Status][daviddm-image]][daviddm-url] [![Coverage percentage][coveralls-image]][coveralls-url]
> A lightweight and simple robots.txt parser in node.
[![NPM](https://nodei.co/npm/robots-parse.png)](https://nodei.co/npm/robots-parse/)
## Installation
```
npm install robots-parse
```## Usage
You can use the module to scan a **domain** for robots file like in the example below:
```js
const robotsParse = require('robots-parse');robotsParse('github.com', (err, res) => {
console.log('Result:', res);
});
```You can also use it with __promises__ if the callback is not specified:
```js
import robotsParse from 'robots-parse'(async () => {
const res = await robotsParse('github.com');
console.log('Result:', res);
})().catch(console.error)
```Or you can use the built-in parser to parse an existing robots.txt file, for example a **local file** or a **string**. The parser works __in sync__ so you don't have to use callback or promises.
```js
const {parser} = require('robots-parse');request('google.com/robots.txt', (err, res, body) => {
const object = parser(body);
console.log(object);
});
```Parsing an existing local robots.txt file:
```js
import {parser} from 'robots-parse'const content = fs.readFileSync('./robots.txt', 'utf-8');
const object = parser(content);console.log(object);
```## How it works?
By default the script will get and parse the `robots.txt` file for a given website or domain and it will search for various rules:
- **Agents**: A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent.
- **Host**: Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show.
- **Allow**: The allow directive specifies paths that may be accessed by the designated crawlers. When no path is specified, the directive is ignored.
- **Disallow**: The disallow directive specifies paths that must not be accessed by the designated crawlers. When no path is specified, the directive is ignored.
- **Sitemap**: An absolute url that points to a Sitemap, Sitemap Index file or equivalent URL.It returns, _if the robots files were successfully retrieved and parsed_, an object containing the properties mentioned above, inside every agent found you will find agent-specific **allow** and **disallow** rules, which also will be stored in **allow** and **disallow** root properties containing all of them indistinctly.
You can read more about the specifications of the robots file on it's [Google Reference Page](https://developers.google.com/search/reference/robots_txt).
---
## Contributing
1. Create an issue and describe your idea
2. Fork the project ()
3. Create your feature branch (`git checkout -b my-new-feature`)
4. Commit your changes (`git commit -am 'Add some feature'`)
5. Write tests for your code (`npm run test`)
6. Publish the branch (`git push origin my-new-feature`)
7. Create a new Pull Request## License
MIT © [b4dnewz](https://b4dnewz.github.io/)
[npm-image]: https://badge.fury.io/js/robots-parse.svg
[npm-url]: https://npmjs.org/package/robots-parse
[travis-image]: https://travis-ci.org/b4dnewz/robots-parse.svg?branch=master
[travis-url]: https://travis-ci.org/b4dnewz/robots-parse
[daviddm-image]: https://david-dm.org/b4dnewz/robots-parse.svg?theme=shields.io
[daviddm-url]: https://david-dm.org/b4dnewz/robots-parse
[coveralls-image]: https://coveralls.io/repos/b4dnewz/robots-parse/badge.svg
[coveralls-url]: https://coveralls.io/r/b4dnewz/robots-parse