https://github.com/bootlin/pdf-link-checker

Checks for broken hyperlinks in PDF documents
https://github.com/bootlin/pdf-link-checker

Last synced: 3 months ago
JSON representation

Checks for broken hyperlinks in PDF documents

Host: GitHub
URL: https://github.com/bootlin/pdf-link-checker
Owner: bootlin
License: other
Created: 2019-02-14T06:12:30.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2023-11-15T17:12:50.000Z (over 2 years ago)
Last Synced: 2023-12-17T11:36:11.422Z (over 2 years ago)
Language: Python
Homepage: https://bootlin.com/blog/pdf-link-checker/
Size: 87.9 KB
Stars: 19
Watchers: 8
Forks: 5
Open Issues: 7
Metadata Files:
- Readme: README.rst
- License: LICENSE

Awesome Lists containing this project

README

          ================

pdf-link-checker

================

**pdf-link-checker** is a simple tool that parses a PDF document and checks for

broken hyperlinks. This done by sending a simple HTTP request to each link

found in a given document.

Getting it running

==================

::

    pip install git+https://github.com/bootlin/pdf-link-checker.git

    export PATH=$HOME/.local/bin:$PATH

    pdf-link-checker my-awesome-slides.pdf

Options

=======

* --max-threads

  Specifies the maximum number of allowed threads (default: 100).

  To speedup the run, pdf-link-checker will launch several threads

  in order to check several links in parallel.

  This option allows to set a limit to the number of threads.

* --max-requests-per-host

  Specifies the maximum number of allowed requests per host.

  Some URLs may belong to the same host, and since pdf-link-checker

  can check many URLs at the same time, we may want to set a limit

  to the number of requests per host.

  Otherwise, some hosts may confuse the check with a DoS attack.

Getting help

============

You can get support by reporting your issue on this project

on GitHub: https://github.com/bootlin/pdf-link-checker/issues

TODO

====

*(...because there's no active project without a TODO list!)*

* Fix: some documents are failing on doc.initialize().

* Fix: if the URL is a huge document, we should just check and not

  download it entirely.

* Replace the thread array into a nice thread pool.

  Each thread from the pool should take an URL from a (protected) queue.

  We could also have one queue per host and thus handle the

  max-requests-per-host constraint without a separate parameter.

Version History

===============

1.2.0

  * Repair breakage against newer versions of pdfminer

1.1.1

  * Remove extra print, just a leftover

1.1.0

  * Only allow https and ftp URIs. This prevents from failing on mailto:

    and file:// URIs.

  * Add better exception handling to avoid crashing

  * Add better timeout and request exception handling

  * Fix broken thread management

  * Remove stupid double-requests

  * Several small fixes

1.0.2

  * Updated repo location

  * Moved from distutils to setuptools

1.0.1

  * Version bump

1.0

  * Initial release

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/bootlin/pdf-link-checker

Awesome Lists containing this project

README