https://github.com/zbo14/web-tree-crawler
A web crawler that builds a tree of URLs.
https://github.com/zbo14/web-tree-crawler
http https tree url web-crawler
Last synced: 4 months ago
JSON representation
A web crawler that builds a tree of URLs.
- Host: GitHub
- URL: https://github.com/zbo14/web-tree-crawler
- Owner: zbo14
- License: mit
- Created: 2019-09-13T19:37:46.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2023-03-03T06:42:48.000Z (almost 3 years ago)
- Last Synced: 2024-11-22T20:39:03.655Z (about 1 year ago)
- Topics: http, https, tree, url, web-crawler
- Language: JavaScript
- Homepage:
- Size: 408 KB
- Stars: 0
- Watchers: 3
- Forks: 3
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# web-tree-crawler
A naive web crawler that builds a tree of URLs under a domain using [web-tree](https://www.npmjs.com/package/web-tree).
**Note:** This software is intended for personal learning and testing purposes.
## How it works
You pass `web-tree-crawler` a URL and it tries to discover/visit as many URLs under that domain name as it can within a time limit. When time's up or it's run out of URLs, `web-tree-crawler` spits out a tree of URLs it visited. There are several configuration options - see the usage sections below.
## Install
`npm i web-tree-crawler`
## CLI
### Usage
```
Usage: [option=] web-tree-crawler
Options:
format , f The output format of the tree (default="string")
headers , h File containing headers to send with each request
numRequests, n The number of requests to send at a time (default=200)
outFile , o Write the tree to file instead of stdout
pathList , p File containing paths to initially crawl
timeLimit , t The max number of seconds to run (default=120)
verbose , v Log info and progress to stdout
```
### Examples
#### Crawl and print tree to stdout
```
$ h=/path/to/file web-tree-crawler
.com
.domain
.subdomain1
/foo
/bar
.subdomain-of-subdomain1
/baz
?q=1
.subdomain2
...
```
And to print an HTML tree...
```
$ f=html web-tree-crawler
...
```
#### Crawl and write tree to file
```
$ o=/path/to/file web-tree-crawler
Wrote tree to file!
```
#### Crawl with verbose logging
```
$ v=true web-tree-crawler
Visited ""
Visited ""
...
```
## JS
### Usage
```js
/**
* This is the main exported function that crawls and resolves the URL tree.
*
* @param {String} url
* @param {Object} [opts = {}]
* @param {Object} [opts.headers] - headers to send with each request
* @param {Number} [opts.numRequests = 200] - the number of requests to send at a time
* @param {String[]} [opts.startPaths] - paths to initially crawl
* @param {Number} [opts.timeLimit = 120] - the max number of seconds to run for
* @param {Boolean} [opts.verbose] - if true, logs info and progress to stdout
* @param {} [opts....] - additional options for #lib.request()
*
* @return {Promise}
*/
```
### Example
```js
'use strict'
const crawl = require('web-tree-crawler')
crawl(url, opts)
.then(tree => { ... })
.catch(err => { ... })
```
### Test
`npm test`
### Lint
`npm run lint`
### Documentation
`npm run doc`
Generate the docs and open in browser.
## Contributing
Please do!
If you find a bug, want a feature added, or just have a question, feel free to [open an issue](https://github.com/zbo14/web-tree-crawler/issues/new). In addition, you're welcome to [create a pull request](https://github.com/zbo14/web-tree-crawler/compare/develop...) addressing an issue. You should push your changes to a feature branch and request merge to `develop`.
Make sure linting and tests pass and coverage is 💯 before creating a pull request!