https://github.com/ma3ke/roadtonowhere
Check documents for broken urls.
https://github.com/ma3ke/roadtonowhere
Last synced: 3 months ago
JSON representation
Check documents for broken urls.
- Host: GitHub
- URL: https://github.com/ma3ke/roadtonowhere
- Owner: ma3ke
- License: mit
- Created: 2023-03-03T00:21:15.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-03-17T18:47:15.000Z (over 2 years ago)
- Last Synced: 2025-02-13T10:17:50.148Z (4 months ago)
- Language: Python
- Size: 18.6 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Road to Nowhere πΆββοΈ
A script to check files for broken http urls.
It finds the urls in each of the specified documents, and performs a get request for each of the urls.
It reports the exit code of the request to stdout, making it clear which and how many urls are broken.## Installation
To install _roadtonowhere_, clone this repository, go into the directory, and install it using _pip_.
```console
git clone https://github.com/koenwestendorp/roadtonowhere
cd roadtonowhere
pip install .
```## Usage
```
usage: roadtonowhere [-h] path [path ...]check a document for broken urls
positional arguments:
path a file to be checkedoptions:
-h, --help show this help message and exit
```### Example
To check the files in the _examples_ directory, run
```console
$ roadtonowhere examples/*
```And the output will look something like this:
```
Parsing 'examples/example.html'... found 1 urls. Checking for broken urls...
ok: [200] https://www.iana.org/domains/example
Found 0 broken urls in 'examples/example.html'.Parsing 'examples/example.md'... found 5 urls. Checking for broken urls...
ok: [200] https://hachyderm.io/@ma3ke
BROKEN: [404] https://example.com/this_page_does_not_exist.html
ok: [200] https://dwangschematiek.nl/
ok: [200] https://twitter.com/
BROKEN: [404] http://example.com/some_more_requests_to_non-existent_pages.html
Found 2 broken urls in 'examples/example.md'.
```## Timeout
If a request takes more than 10 seconds, it will timeout.
The timeout is reported in the output as such:```
...
ok: [200] https://github.com/robertdavidgraham/masscan
ok: [200] https://github.com/gvb84/pbscan
timeout: request took more than 10 seconds http://www.hping.org/
ok: [200] https://github.com/traviscross/mtr
ok: [200] https://github.com/mehrdadrad/mylg
...
```## Filetypes
Currently, the script has good support for extracting urls from html files.
For finding urls in html files, `html.parser.HTMLParser` is used.For Markdown and other filetypes, there is naive support:
- `.md` **Markdown** (heuristic: starts with `(http{,s}://`, ends with `)`)
- other filetypes (heuristic: starts with `http{,s}://`, ends with a whitespace character)These heuristics are imperfect, but they do get the job done in most circumstances.
Created by ma3ke/Koen Westendorp, 2023. I hope you have a nice day :)