Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/timaschew/link-checker
🚀Superfast link checker for HTML pages
https://github.com/timaschew/link-checker
broken-links link-checker
Last synced: 2 months ago
JSON representation
🚀Superfast link checker for HTML pages
- Host: GitHub
- URL: https://github.com/timaschew/link-checker
- Owner: timaschew
- License: mit
- Created: 2018-04-01T21:50:25.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2023-09-21T07:21:16.000Z (over 1 year ago)
- Last Synced: 2024-11-05T16:39:29.682Z (2 months ago)
- Topics: broken-links, link-checker
- Language: JavaScript
- Homepage:
- Size: 106 KB
- Stars: 18
- Watchers: 3
- Forks: 10
- Open Issues: 16
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- project-awesome - timaschew/link-checker - 🚀Superfast link checker for HTML pages (JavaScript)
README
# Link Checker
Link checker for HTML pages which checks `href` attributes including the anchor in the target.
The Command Line Interface expects a directory on your local file system which will be scanned.**Why did I wrote this tool?**
I was using a nice CLI called [html-proofer](https://github.com/gjtorikian/html-proofer), but was using a preprocessing step in order to get Javadoc and Scaladoc working because of the iframe setup. At some point it didn't scale anymore. Scaladoc link checker with html-proofer took 5 minutes.
`link-checker` is using [cheerio](https://github.com/cheeriojs/cheerio) for parsing HTML, which is using the fastest HTML parser for Node.js: [htmlparser2](https://github.com/fb55/htmlparser2). Same Scaladoc which took 5 minutes with html-proofer takes now 5 seconds with `link-checker`. Also URL transformation for iframes can be turned on on-the-fly via `--javadoc`. In this mode links like `/index.html#com.org.company.product.library.Main@init` will check for a HTML in the path`com/org/company/product/library/Main.html` and the anchor `init`.
## FAQ
##### I need to check links on a website via http(s)
Just use a [website-scraper](https://github.com/website-scraper/node-website-scraper) and download all the pages
to your file system.I've used the module with this options:
```javascript
{
urls: [urlToScrape],
directory: outputDirectory,
recursive: true,
filenameGenerator: 'bySiteStructure',
urlFilter: function(url) {
return url.indexOf(urlToScrape) != -1;
}
}
```## Installation
### NPM
You can install it via npm
```
npm install -g link-checker
```You can also install it without `-g` but then you need to put the binary,
located in `node_modules/.bin/link-checker` to your `$PATH`.### Docker
```
docker pull timaschew/link-checker
```## Usage
```
You need to pass exactly one path where to check links
Usage: link-checker path [options]Options:
--version Show version number [boolean]
--allow-hash-href If `true`, ignores the `href` `#` [boolean]
--disable-external disable checks HTTP links [boolean]
--external-only check HTTP links only [boolean]
--file-ignore RegExp to ignore files to scan [array]
--url-ignore RegExp to ignore URLs [array]
--url-swap RegExp for URLs which can be replaced on the fly [array]
--limit-scope forbid to follow URLs which are out of provided path,
like ../somewhere [boolean]
--mkdocs transforming URLS from foo/#bar to foo/index.html#bar
[boolean]
--javadoc Enable special URL transforming which allows to check
iframe deeplinks for local javadoc and scaladoc[boolean]
--javadoc-external Domain or base URL to do URL transformation to check
iframe deeplinks [array]
--http-status-ignore pass HTTP status code which will be ignore, by default
only 2xx are allowed [array]
--json print errors as JSON [boolean]
--http-redirects Amount of allowed HTTP redirects [default: 0]
--http-timeout HTTP timeout in milliseconds [default: 5000]
--http-always-get Use always HTTP GET requests, by default HEAD is used
for pages without any anchors [boolean]
--warn-name-attr show warning if name attribute instead of id was used
for an anchor [boolean]
--http-cache Directory to store the non failing HTTP responses. If
none is specified responses won't be cached. [string]
--http-cache-max-age Invalidate the cache after the given period. Allowed
values: https://www.npmjs.com/package/ms [default: "1w"]
-h, --help Show help [boolean]Examples:
link-checker path/to/html/files checks directory with HTMLfiles for broken
links and anchors
```## `linkcheckerrc` configuration
The above configuration can, alternatively or in addition, be provided by a `.linkcheckerrc`
in the project root:```json
{
"allow-hash-href": true,
"disable-external": true,
...
}
```In addition, this format also provides means to override these settings based on URL regular expression matching:
```json
{
"overrides": {
"https://www\\.google.com/#": {
"allow-hash-href": true,
"http-status-ignore": [403, 404]
},
"marketplace\\.visualstudio\\.com": {
"http-always-get": true
}
}
}
```