An open API service indexing awesome lists of open source software.

https://github.com/zbo14/web-tree-crawler

A web crawler that builds a tree of URLs.
https://github.com/zbo14/web-tree-crawler

http https tree url web-crawler

Last synced: 4 months ago
JSON representation

A web crawler that builds a tree of URLs.

Awesome Lists containing this project

README

          

# web-tree-crawler

A naive web crawler that builds a tree of URLs under a domain using [web-tree](https://www.npmjs.com/package/web-tree).

**Note:** This software is intended for personal learning and testing purposes.

## How it works

You pass `web-tree-crawler` a URL and it tries to discover/visit as many URLs under that domain name as it can within a time limit. When time's up or it's run out of URLs, `web-tree-crawler` spits out a tree of URLs it visited. There are several configuration options - see the usage sections below.

## Install

`npm i web-tree-crawler`

## CLI

### Usage

```
Usage: [option=] web-tree-crawler

Options:
format , f The output format of the tree (default="string")
headers , h File containing headers to send with each request
numRequests, n The number of requests to send at a time (default=200)
outFile , o Write the tree to file instead of stdout
pathList , p File containing paths to initially crawl
timeLimit , t The max number of seconds to run (default=120)
verbose , v Log info and progress to stdout
```

### Examples

#### Crawl and print tree to stdout

```
$ h=/path/to/file web-tree-crawler

.com
.domain
.subdomain1
/foo
/bar
.subdomain-of-subdomain1
/baz
?q=1
.subdomain2
...
```

And to print an HTML tree...

```
$ f=html web-tree-crawler

...
```

#### Crawl and write tree to file

```
$ o=/path/to/file web-tree-crawler

Wrote tree to file!
```

#### Crawl with verbose logging

```
$ v=true web-tree-crawler

Visited ""
Visited ""
...
```

## JS

### Usage

```js
/**
* This is the main exported function that crawls and resolves the URL tree.
*
* @param {String} url
* @param {Object} [opts = {}]
* @param {Object} [opts.headers] - headers to send with each request
* @param {Number} [opts.numRequests = 200] - the number of requests to send at a time
* @param {String[]} [opts.startPaths] - paths to initially crawl
* @param {Number} [opts.timeLimit = 120] - the max number of seconds to run for
* @param {Boolean} [opts.verbose] - if true, logs info and progress to stdout
* @param {} [opts....] - additional options for #lib.request()
*
* @return {Promise}
*/
```

### Example

```js
'use strict'

const crawl = require('web-tree-crawler')

crawl(url, opts)
.then(tree => { ... })
.catch(err => { ... })
```

### Test

`npm test`

### Lint

`npm run lint`

### Documentation

`npm run doc`

Generate the docs and open in browser.

## Contributing

Please do!

If you find a bug, want a feature added, or just have a question, feel free to [open an issue](https://github.com/zbo14/web-tree-crawler/issues/new). In addition, you're welcome to [create a pull request](https://github.com/zbo14/web-tree-crawler/compare/develop...) addressing an issue. You should push your changes to a feature branch and request merge to `develop`.

Make sure linting and tests pass and coverage is 💯 before creating a pull request!