https://github.com/aduth/crawl-domain
Crawl to discover all paths under a given URL domain
https://github.com/aduth/crawl-domain
Last synced: about 1 year ago
JSON representation
Crawl to discover all paths under a given URL domain
- Host: GitHub
- URL: https://github.com/aduth/crawl-domain
- Owner: aduth
- License: mit
- Created: 2020-10-18T19:31:29.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2020-10-21T01:47:32.000Z (over 5 years ago)
- Last Synced: 2025-03-13T22:35:06.415Z (over 1 year ago)
- Language: JavaScript
- Size: 38.1 KB
- Stars: 3
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.md
Awesome Lists containing this project
README
# crawl-domain
Crawl to discover all paths under a given URL domain.
## Example
```js
import crawl from 'crawl-domain';
for await (const url of crawl('http://localhost:3000')) {
console.log(url);
}
// → http://localhost:3000/
// → http://localhost:3000/b
// → http://localhost:3000/c
// → http://localhost:3000/d
```
## Installation
`crawl-domain` is authored as an [ESM module](https://nodejs.org/api/esm.html), and therefore requires Node 12.0 or newer.
Install using NPM or Yarn:
```
npm install crawl-domain
```
```
yarn add crawl-domain
```
## Usage
The default export will return a [Node pipeline stream](https://nodejs.org/api/stream.html#stream_stream) when called, which can be [iterated asynchronously with `for await`](https://2ality.com/2019/11/nodejs-streams-async-iteration.html) to operate on crawled links as soon as they're discovered:
```js
import crawl from 'crawl-domain';
for await (const url of crawl('http://localhost:3000')) {
console.log(url);
}
```
If you'd prefer not to use streams, there are also Node-style callback and promise forms available, where the resolved value will be an array of all discovered URLs:
```js
import crawl from 'crawl-domain';
crawl('http://localhost:3000', (error, urls) => {
console.log(urls);
});
```
```js
import { promise as crawl } from 'crawl-domain';
const urls = await crawl('http://localhost:3000');
console.log(urls);
```
## Options
`crawl` can optionally receive an options object as the second argument.
The following options are supported:
- `concurrency`: HTTP client concurrency. Defaults to `10`.
- `timeout`: HTTP client timeout, in milliseconds. Defaults to `10000`.
## API
```ts
export function stream(
rootURL: string,
options?: Partial | undefined,
callback?: CrawlCallback | undefined
): import('stream').Readable;
export const promise: CrawlPromise;
export default stream;
export type CrawlOptions = {
concurrency: number;
timeout: number;
};
export type CrawlCallback = (error: Error | null, urls: string[]) => void;
export type CrawlPromise = (
rootURL: string,
options?: CrawlOptions | undefined
) => Promise;
```
## License
Copyright 2020 Andrew Duthie
Released under the MIT License. See [LICENSE.md](./LICENSE.md).