https://github.com/stevenvachon/broken-link-checker

Find broken links, missing images, etc within your HTML.
https://github.com/stevenvachon/broken-link-checker
html5 http link-checker links nodejs seo urls whatwg
Last synced: 5 months ago
JSON representation
Find broken links, missing images, etc within your HTML.
Host: GitHub
URL: https://github.com/stevenvachon/broken-link-checker
Owner: stevenvachon
License: mit
Created: 2015-01-10T04:31:07.000Z (over 10 years ago)
Default Branch: main
Last Pushed: 2024-01-08T09:26:30.000Z (over 1 year ago)
Last Synced: 2025-05-14T02:15:28.905Z (5 months ago)
Topics: html5, http, link-checker, links, nodejs, seo, urls, whatwg
Language: JavaScript
Homepage:
Size: 416 KB
Stars: 2,013
Watchers: 36
Forks: 307
Open Issues: 78
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- License: license
Awesome Lists containing this project

awesome-bbht - broken-link-checker - (Other / XSS)
awesome-bbht - broken-link-checker - (Other / XSS)
awesome-list - broken-link-checker
awesome-hacking-lists - stevenvachon/broken-link-checker - Find broken links, missing images, etc within your HTML. (JavaScript)
README

          # broken-link-checker [![NPM Version][npm-image]][npm-url] ![Build Status][ci-image] [![Coverage Status][coveralls-image]][coveralls-url] [![Dependency Monitor][greenkeeper-image]][greenkeeper-url]

> Find broken links, missing images, etc within your HTML.

* ✅ **Complete**: Unicode, redirects, compression, basic authentication, absolute/relative/local URLs.

* ⚡️ **Fast**: Concurrent, streamed and cached.

* 🍰 **Easy**: Convenient defaults and very configurable.

Other features:

* Support for many HTML elements and attributes; not only `` and ``.

* Support for relative URLs with ``.

* WHATWG specifications-compliant [HTML](https://html.spec.whatwg.org) and [URL](https://url.spec.whatwg.org) parsing.

* Honor robot exclusions (robots.txt, headers and `rel`), optionally.

* Detailed information for reporting and maintenance.

* URL keyword filtering with simple wildcards.

* Pause/Resume at any time.


## Installation

[Node.js](https://nodejs.org) `>= 14` is required. There're two ways to use it:

### Command Line Usage

To install, type this at the command line:

```shell

npm install broken-link-checker -g

```

After that, check out the help for available options:

```shell

blc --help

```

A typical site-wide check might look like:

```shell

blc http://yoursite.com -ro

# or

blc path/to/index.html -ro

```

**Note:** HTTP proxies are not directly supported. If your network is configured incorrectly with no resolution in sight, you could try using a [container with proxy settings](https://docs.docker.com/network/proxy/).

### Programmatic API

To install, type this at the command line:

```shell

npm install broken-link-checker

```

The remainder of this document will assist you in using the API.

## Classes

While all classes have been exposed for custom use, the one that you need will most likely be [`SiteChecker`](#sitechecker).

### `HtmlChecker`

Scans an HTML document to find broken links. All methods from [`EventEmitter`](https://nodejs.org/api/events.html#events_class_eventemitter) are available.

```js

const {HtmlChecker} = require('broken-link-checker');

const htmlChecker = new HtmlChecker(options)

  .on('error', (error) => {})

  .on('html', (tree, robots) => {})

  .on('queue', () => {})

  .on('junk', (result) => {})

  .on('link', (result) => {})

  .on('complete', () => {});

htmlChecker.scan(html, baseURL);

```

#### Methods & Properties

* `.clearCache()` will remove any cached URL responses.

* `.isPaused` returns `true` if the internal link queue is paused and `false` if not.

* `.numActiveLinks` returns the number of links with active requests.

* `.numQueuedLinks` returns the number of links that currently have no active requests.

* `.pause()` will pause the internal link queue, but will not pause any active requests.

* `.resume()` will resume the internal link queue.

* `.scan(html, baseURL)` parses & scans a single HTML document and returns a `Promise`. Calling this function while a previous scan is in progress will result in a thrown error. Arguments:

  * `html` must be either a [`Stream`](https://nodejs.org/api/stream.html) or a string.

  * `baseURL` must be a [`URL`](https://mdn.io/URL). Without this value, links to relative URLs will be given a `BLC_INVALID` reason for being broken (unless an absolute `` is found).

#### Events

* `'complete'` is emitted after the last result or zero results.

* `'error'` is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:

  * `error` is the `Error`.

* `'html'` is emitted after the HTML document has been fully parsed. Arguments:

  * `tree` is supplied by [parse5](https://npmjs.com/parse5).

  * `robots` is an instance of [robot-directives](https://npmjs.com/robot-directives) containing any `` robot exclusions.

* `'junk'` is emitted on each skipped/unchecked link, as configured in options. Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

* `'link'` is emitted with the result of each checked/unskipped link (broken or not). Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

* `'queue'` is emitted when a link is internally queued, dequeued or made active.

### `HtmlUrlChecker`

Scans the HTML content at each queued URL to find broken links. All methods from [`EventEmitter`](https://nodejs.org/api/events.html#events_class_eventemitter) are available.

```js

const {HtmlUrlChecker} = require('broken-link-checker');

const htmlUrlChecker = new HtmlUrlChecker(options)

  .on('error', (error) => {})

  .on('html', (tree, robots, response, pageURL, customData) => {})

  .on('queue', () => {})

  .on('junk', (result, customData) => {})

  .on('link', (result, customData) => {})

  .on('page', (error, pageURL, customData) => {})

  .on('end', () => {});

htmlUrlChecker.enqueue(pageURL, customData);

```

#### Methods & Properties

* `.clearCache()` will remove any cached URL responses.

* `.dequeue(id)` removes a page from the queue. Returns `true` on success or `false` on failure.

* `.enqueue(pageURL, customData)` adds a page to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:

  * `pageURL` must be a [`URL`](https://mdn.io/URL).

  * `customData` is optional data (of any type) that is stored in the queue item for the page.

* `.has(id)` returns `true` if the queue contains an active or queued page tagged with `id` and `false` if not.

* `.isPaused` returns `true` if the queue is paused and `false` if not.

* `.numActiveLinks` returns the number of links with active requests.

* `.numPages` returns the total number of pages in the queue.

* `.numQueuedLinks` returns the number of links that currently have no active requests.

* `.pause()` will pause the queue, but will not pause any active requests.

* `.resume()` will resume the queue.

#### Events

* `'end'` is emitted when the end of the queue has been reached.

* `'error'` is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:

  * `error` is the `Error`.

* `'html'` is emitted after a page's HTML document has been fully parsed. Arguments:

  * `tree` is supplied by [parse5](https://npmjs.com/parse5).

  * `robots` is an instance of [robot-directives](https://npmjs.com/robot-directives) containing any `` and `X-Robots-Tag` robot exclusions.

  * `response` is the full HTTP response for the page, excluding the body.

  * `pageURL` is the `URL` to the current page being scanned.

  * `customData` is whatever was queued.

* `'junk'` is emitted on each skipped/unchecked link, as configured in options. Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

  * `customData` is whatever was queued.

* `'link'` is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

  * `customData` is whatever was queued.

* `'page'` is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:

  * `error` will be an `Error` if such occurred or `null` if not.

  * `pageURL` is the `URL` to the current page being scanned.

  * `customData` is whatever was queued.

* `'queue'` is emitted when a URL (link or page) is queued, dequeued or made active.

### `SiteChecker`

Recursively scans (crawls) the HTML content at each queued URL to find broken links. All methods from [`EventEmitter`](https://nodejs.org/api/events.html#events_class_eventemitter) are available.

```js

const {SiteChecker} = require('broken-link-checker');

const siteChecker = new SiteChecker(options)

  .on('error', (error) => {})

  .on('robots', (robots, customData) => {})

  .on('html', (tree, robots, response, pageURL, customData) => {})

  .on('queue', () => {})

  .on('junk', (result, customData) => {})

  .on('link', (result, customData) => {})

  .on('page', (error, pageURL, customData) => {})

  .on('site', (error, siteURL, customData) => {})

  .on('end', () => {});

siteChecker.enqueue(siteURL, customData);

```

#### Methods & Properties

* `.clearCache()` will remove any cached URL responses.

* `.dequeue(id)` removes a site from the queue. Returns `true` on success or `false` on failure.

* `.enqueue(siteURL, customData)` adds [the first page of] a site to the queue. Queue items are auto-dequeued when their requests are complete. Returns a queue ID on success. Arguments:

  * `siteURL` must be a [`URL`](https://mdn.io/URL).

  * `customData` is optional data (of any type) that is stored in the queue item for the site.

* `.has(id)` returns `true` if the queue contains an active or queued site tagged with `id` and `false` if not.

* `.isPaused` returns `true` if the queue is paused and `false` if not.

* `.numActiveLinks` returns the number of links with active requests.

* `.numPages` returns the total number of pages in the queue.

* `.numQueuedLinks` returns the number of links that currently have no active requests.

* `.numSites` returns the total number of sites in the queue.

* `.pause()` will pause the queue, but will not pause any active requests.

* `.resume()` will resume the queue.

#### Events

* `'end'` is emitted when the end of the queue has been reached.

* `'error'` is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:

  * `error` is the `Error`.

* `'html'` is emitted after a page's HTML document has been fully parsed. Arguments:

  * `tree` is supplied by [parse5](https://npmjs.com/parse5).

  * `robots` is an instance of [robot-directives](https://npmjs.com/robot-directives) containing any `` and `X-Robots-Tag` robot exclusions.

  * `response` is the full HTTP response for the page, excluding the body.

  * `pageURL` is the `URL` to the current page being scanned.

  * `customData` is whatever was queued.

* `'junk'` is emitted on each skipped/unchecked link, as configured in options. Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

  * `customData` is whatever was queued.

* `'link'` is emitted with the result of each checked/unskipped link (broken or not) within the current page. Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

  * `customData` is whatever was queued.

* `'page'` is emitted after a page's last result, on zero results, or if the HTML could not be retrieved. Arguments:

  * `error` will be an `Error` if such occurred or `null` if not.

  * `pageURL` is the `URL` to the current page being scanned.

  * `customData` is whatever was queued.

* `'queue'` is emitted when a URL (link, page or site) is queued, dequeued or made active.

* `'robots'` is emitted after a site's robots.txt has been downloaded. Arguments:

  * `robots` is an instance of [robots-txt-guard](https://npmjs.com/robots-txt-guard).

  * `customData` is whatever was queued.

* `'site'` is emitted after a site's last result, on zero results, or if the *initial* HTML could not be retrieved. Arguments:

  * `error` will be an `Error` if such occurred or `null` if not.

  * `siteURL` is the `URL` to the current site being crawled.

  * `customData` is whatever was queued.

**Note:** the `filterLevel` option is used for determining which links are recursive.

### `UrlChecker`

Requests each queued URL to determine if they are broken. All methods from [`EventEmitter`](https://nodejs.org/api/events.html#events_class_eventemitter) are available.

```js

const {UrlChecker} = require('broken-link-checker');

const urlChecker = new UrlChecker(options)

  .on('error', (error) => {})

  .on('queue', () => {})

  .on('link', (result, customData) => {})

  .on('end', () => {});

urlChecker.enqueue(url, customData);

```

#### Methods & Properties

* `.clearCache()` will remove any cached URL responses.

* `.dequeue(id)` removes a URL from the queue. Returns `true` on success or `false` on failure.

* `.enqueue(url, customData)` adds a URL to the queue. Queue items are auto-dequeued when their requests are completed. Returns a queue ID on success. Arguments:

  * `url` must be a [`URL`](https://mdn.io/URL).

  * `customData` is optional data (of any type) that is stored in the queue item for the URL.

* `.has(id)` returns `true` if the queue contains an active or queued URL tagged with `id` and `false` if not.

* `.isPaused` returns `true` if the queue is paused and `false` if not.

* `.numActiveLinks` returns the number of links with active requests.

* `.numQueuedLinks` returns the number of links that currently have no active requests.

* `.pause()` will pause the queue, but will not pause any active requests.

* `.resume()` will resume the queue.

#### Events

* `'end'` is emitted when the end of the queue has been reached.

* `'error'` is emitted when an error occurs within any of your event handlers and will prevent the current scan from failing. Arguments:

  * `error` is the `Error`.

* `'junk'` is emitted for each skipped/unchecked result, as configured in options. Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

  * `customData` is whatever was queued.

* `'link'` is emitted for each checked/unskipped result (broken or not). Arguments:

  * `result` is a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js).

  * `customData` is whatever was queued.

* `'queue'` is emitted when a URL is queued, dequeued or made active.

## Options

### `cacheMaxAge`

Type: `Number`  

Default Value: `3_600_000` (1 hour)  

The number of milliseconds in which a cached response should be considered valid. This is only relevant if the `cacheResponses` option is enabled.

### `cacheResponses`

Type: `Boolean`  

Default Value: `true`  

URL request results will be cached when `true`. This will ensure that each unique URL will only be checked once.

### `excludedKeywords`

Type: `Array`  

Default value: `[]`  

Will not check links that match the keywords and glob patterns within this list. The only wildcards supported are [`*` and `!`](https://npmjs.com/matcher).

This option does *not* apply to `UrlChecker`.

### `excludeExternalLinks`

Type: `Boolean`  

Default value: `false`  

Will not check external links (different protocol and/or host) when `true`; relative links with a remote `` included.

This option does *not* apply to `UrlChecker`.

### `excludeInternalLinks`

Type: `Boolean`  

Default value: `false`  

Will not check internal links (same protocol and host) when `true`.

This option does *not* apply to `UrlChecker` nor `SiteChecker`'s *crawler*.

### `excludeLinksToSamePage`

Type: `Boolean`  

Default value: `false`  

Will not check links to the same page; relative and absolute fragments/hashes included. This is only relevant if the `cacheResponses` option is disabled.

This option does *not* apply to `UrlChecker`.

### `filterLevel`

Type: `Number`  

Default value: `1`  

The tags and attributes that are considered links for checking, split into the following levels:

* `0`: clickable links

* `1`: clickable links, media, frames, meta refreshes

* `2`: clickable links, media, frames, meta refreshes, stylesheets, scripts, forms

* `3`: clickable links, media, frames, meta refreshes, stylesheets, scripts, forms, metadata

Recursive links have a slightly different filter subset. To see the exact breakdown of both, check out the [tag map](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/tags.js). `` is not listed because it is not a link, though it is always parsed.

This option does *not* apply to `UrlChecker`.

### `honorRobotExclusions`

Type: `Boolean`  

Default value: `true`  

Will not scan pages that search engine crawlers would not follow. Such will have been specified with any of the following:

* ``

* ``

* ``

* ``

* ``

* `X-Robots-Tag: noindex,nofollow,…`

* `X-Robots-Tag: googlebot: noindex,nofollow,…`

* `X-Robots-Tag: otherbot: noindex,nofollow,…`

* `X-Robots-Tag: unavailable_after: …`

* robots.txt


This option does *not* apply to `UrlChecker`.

### `includedKeywords`

Type: `Array`  

Default value: `[]`  

Will only check links that match the keywords and glob patterns within this list, _if any_. The only wildcard supported is `*`.

This option does *not* apply to `UrlChecker`.

### `includeLink`

Type: `Function`  

Default value: `link => true`  

A synchronous callback that is called after all other filters have been performed. Return `true` to include `link` (a [`Link`](https://github.com/stevenvachon/broken-link-checker/blob/master/lib/internal/Link.js)) in the list of links to be checked, or return `false` to have it skipped.

This option does *not* apply to `UrlChecker`.

### `includePage`

Type: `Function`  

Default value: `url => true`  

A synchronous callback that is called after all other filters have been performed. Return `true` to include `url` (a [`URL`](https://mdn.io/URL)) in the list of pages to be crawled, or return `false` to have it skipped.

This option does *not* apply to `UrlChecker` nor `HtmlUrlChecker`.

### `maxSockets`

Type: `Number`  

Default value: `Infinity`  

The maximum number of links to check at any given time.

### `maxSocketsPerHost`

Type: `Number`  

Default value: `2`  

The maximum number of links per host/port to check at any given time. This avoids overloading a single target host with too many concurrent requests. This will not limit concurrent requests to other hosts.

### `rateLimit`

Type: `Number`  

Default value: `0`  

The number of milliseconds to wait before each request.

### `requestMethod`

Type: `String`  

Default value: `'head'`  

The HTTP request method used in checking links. If you experience problems, try using `'get'`, however the `retryHeadFail` option should have you covered.

### `retryHeadCodes`

Type: `Array`  

Default value: `[405]`  

The list of HTTP status codes for the `retryHeadFail` option to reference.

### `retryHeadFail`

Type: `Boolean`  

Default value: `true`  

Some servers do not respond correctly to a `'head'` request method. When `true`, a link resulting in an HTTP status code listed within the `retryHeadCodes` option will be re-requested using a `'get'` method before deciding that it is broken. This is only relevant if the `requestMethod` option is set to `'head'`.

### `userAgent`

Type: `String`  

Default value: `'broken-link-checker/0.8.0 Node.js/14.16.0 (OS X; x64)'` (or similar)  

The HTTP user-agent to use when checking links as well as retrieving pages and robot exclusions.

## Handling Broken/Excluded Links

A broken link will have an `isBroken` value of `true` and a reason code defined in `brokenReason`. A link that was not checked (emitted as `'junk'`) will have a `wasExcluded` value of `true`, a reason code defined in `excludedReason` and a `isBroken` value of `null`.

```js

if (link.get('isBroken')) {

  console.log(link.get('brokenReason'));

  //-> HTTP_406

} else if (link.get('wasExcluded')) {

  console.log(link.get('excludedReason'));

  //-> BLC_ROBOTS

}

```

Additionally, more descriptive messages are available for each reason code:

```js

const {reasons} = require('broken-link-checker');

console.log(reasons.BLC_ROBOTS);       //-> Robots exclusion

console.log(reasons.ERRNO_ECONNRESET); //-> connection reset by peer (ECONNRESET)

console.log(reasons.HTTP_404);         //-> Not Found (404)

// List all

console.log(reasons);

```

Putting it all together:

```js

if (link.get('isBroken')) {

  console.log(reasons[link.get('brokenReason')]);

} else if (link.get('wasExcluded')) {

  console.log(reasons[link.get('excludedReason')]);

}

```

Finally, **it is important** to analyze links excluded with the `BLC_UNSUPPORTED` reason as it's possible for them to be broken.

## Roadmap Features

* `'info'` event with messaging such as 'Site does not support HTTP HEAD method' (regarding `retryHeadFail` option)

* add cheerio support by using parse5's htmlparser2 tree adaptor?

* load sitemap.xml at *start* of each `SiteChecker` site (since cache can expire) to possibly check pages that were not linked to, removing from list as *discovered* links are checked

* change order of checking to: tcp error, 4xx code (broken), 5xx code (undetermined), 200

* abort download of body when `options.retryHeadFail===true`

* option to retry broken links a number of times (default=0)

* option to scrape `response.body` for erroneous sounding text (using [fathom](https://npmjs.com/fathom-web)?), since an error page could be presented but still have code 200

* option to detect parked domain (302 with no redirect?)

* option to check broken link on archive.org for archived version (using [this lib](https://npmjs.com/archive.org))

* option to run `HtmlUrlChecker` checks on page load (using [jsdom](https://npmjs.com/jsdom)) to include links added with JavaScript?

* option to check if hashes exist in target URL document?

* option to parse Markdown in `HtmlChecker` for links

* option to check plain text URLs

* add throttle profiles (0–9, -1 for "custom") for easy configuring

* check [ftp:](https://npmjs.com/ftp), [sftp:](https://npmjs.com/ssh2) (for downloadable files)

* check ~~mailto:~~, news:, nntp:, telnet:?

* check that data URLs are valid (with [valid-data-url](https://www.npmjs.com/valid-data-url))?

* supply CORS error for file:// links on sites with a different protocol

* create an example with http://astexplorer.net

* use [debug](https://npmjs.com/debug)

* use [bunyan](https://npmjs.com/bunyan) with JSON output for CLI

* store request object/headers (or just auth) in `Link`?

* supply basic auth for "page" events?

* add option for `URLCache` normalization profiles

[npm-image]: https://img.shields.io/npm/v/broken-link-checker.svg

[npm-url]: https://npmjs.org/package/broken-link-checker

[ci-image]: https://img.shields.io/github/workflow/status/stevenvachon/broken-link-checker/tests

[coveralls-image]: https://img.shields.io/coveralls/stevenvachon/broken-link-checker.svg

[coveralls-url]: https://coveralls.io/github/stevenvachon/broken-link-checker

[greenkeeper-image]: https://badges.greenkeeper.io/stevenvachon/broken-link-checker.svg

[greenkeeper-url]: https://greenkeeper.io/
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stevenvachon/broken-link-checker

Awesome Lists containing this project

README