Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/victoriadrake/hydra-link-checker
Hydra: a multithreaded site-crawling link checker in Python standard library
https://github.com/victoriadrake/hydra-link-checker
ci-cd link-checker link-checking python website
Last synced: 17 days ago
JSON representation
Hydra: a multithreaded site-crawling link checker in Python standard library
- Host: GitHub
- URL: https://github.com/victoriadrake/hydra-link-checker
- Owner: victoriadrake
- License: mit
- Created: 2020-02-06T15:07:54.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2023-02-20T14:32:51.000Z (over 1 year ago)
- Last Synced: 2024-10-12T00:46:48.731Z (about 1 month ago)
- Topics: ci-cd, link-checker, link-checking, python, website
- Language: Python
- Homepage:
- Size: 30.3 KB
- Stars: 122
- Watchers: 5
- Forks: 26
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Hydra: multithreaded site-crawling link checker in Python
![Tests status badge](https://github.com/victoriadrake/hydra-link-checker/workflows/test/badge.svg)
A Python program that ~~crawls~~ slithers 🐍 a website for links and prints a YAML report of broken links.
## Requires
Python 3.6 or higher.
There are no external dependencies, Neo.
## Usage
```sh
$ python hydra.py -h
usage: hydra.py [-h] [--config CONFIG] URL
```Positional arguments:
- `URL`: The URL of the website to crawl. Ensure `URL` is absolute including schema, e.g. `https://example.com`.
Optional arguments:
- `-h`, `--help`: Show help message and exit
- `--config CONFIG`, `-c CONFIG`: Path to a configuration fileA broken links report will be output to stdout, so you may like to redirect this to a file.
The report will be [YAML](https://yaml.org/) formatted. To save the output to a file, run:
```sh
python hydra.py [URL] > [PATH/TO/FILE.yaml]
```You can add the current date to the filename using a command substitution, such as:
```sh
python hydra.py [URL] > /path/to/$(date '+%Y_%m_%d')_report.yaml
```To see how long Hydra takes to check your site, add `time`:
```sh
time python hydra.py [URL]
```### GitHub Action
You can easily incorporate Hydra as part of an automated process using the [link-snitch](https://github.com/victoriadrake/link-snitch) action.
## Configuration
Hydra can accept an optional JSON configuration file for specific parameters, for example:
```json
{
"OK": [
200,
999,
403
],
"attrs": [
"href"
],
"exclude_scheme_prefixes": [
"tel"
],
"tags": [
"a",
"img"
],
"threads": 25,
"timeout": 30,
"graceful_exit": "True"
}
```To use a configuration file, supply the filename:
```sh
python hydra.py https://example.com --config ./hydra-config.json
```Possible settings:
- `OK` - [HTTP response codes](https://developer.mozilla.org/en-US/docs/Web/HTTP/Status) to consider as a successful link check. Defaults to `[200, 999]`.
- `attrs` - Attributes of the HTML tags to check for links. Defaults to `["href", "src"]`.
- `exclude_scheme_prefixes` - HTTP scheme prefixes to exclude from checking. Defaults to `["tel:", "javascript:"]`.
- `tags` - HTML tags to check for links. Defaults to `["a", "link", "img", "script"]`.
- `threads` - Maximum workers to run. Defaults to `50`.
- `timeout` - Maximum seconds to wait for HTTP response. Defaults to `60`.
- `graceful_exit` - If set to `True`, and there are broken links present return `exit code 0` else return `exit code 1`.## Test
Run:
```sh
python -m unittest tests/test.py
```