https://github.com/gsauthof/adf2pdf

automate the workflow around ADF scanning, OCR and PDF creation
https://github.com/gsauthof/adf2pdf

adf duplex-scanning ocr pdf pdf-generation sane scanning tesseract

Last synced: 5 months ago
JSON representation

automate the workflow around ADF scanning, OCR and PDF creation

Host: GitHub
URL: https://github.com/gsauthof/adf2pdf
Owner: gsauthof
License: gpl-3.0
Created: 2017-10-08T20:31:37.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2025-10-11T11:17:56.000Z (8 months ago)
Last Synced: 2025-12-21T23:46:20.337Z (6 months ago)
Topics: adf, duplex-scanning, ocr, pdf, pdf-generation, sane, scanning, tesseract
Language: Python
Homepage:
Size: 35.2 KB
Stars: 7
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: COPYING

Awesome Lists containing this project

README

          adf2pdf - a tool that turns a batch of paper pages into a PDF

with a text layer.  By default, it detects empty pages (as they

may easily occur during duplex scanning) and excludes them from

the OCR and the resulting PDF.

For that, it uses [Sane's][5] [scanimage][6] for the scanning,

[Tesseract][4] for the [optical character recognition][ocr] (OCR), and

the Python packages [img2pdf][9], [Pillow (PIL)][10] and

[PyPDF2][11] for some image-processing tasks and PDF mangling.

Example:

    $ adf2pdf contract-xyz.pdf

2017, Georg Sauthoff 

## Features

- Automatic document feed (ADF) support

- Fast empty page detection

- Overlaying of scanning, image processing, OCR and PDF creation

  to minimize the total runtime

- Fast creation of small PDFs using the fine [img2pdf][9] package

- Only use of safe compression methods, i.e. no error-prone

  symbol segmentation style compression like [JBIG2][12] or JB2

  that is used in [Xerox photocopiers][12] and the DjVu format.

## Install Instructions

Adf2pdf can be directly installed with [`pip`][13], e.g.

    $ pip3 install --user adf2pdf

or

    $ pip3 install adf2pdf

See also the [PyPI adf2pdf project page][14].

Alternatively, the Python file `adf2pdf.py` can be directly

executed in a cloned repository, e.g.:

    $ ./adf2pdf.py report.pdf

In addition to that, one can install the development version from

a cloned work-tree like this:

    $ pip3 install --user .

## Hardware Requirements

A scanner with automatic document feed (ADF) that is supported by

Sane. For example, the [Fujitsu ScanSnap S1500][1] works

well. That model supports duplex scanning, which is quite

convenient.

## Example continued

Running _adf2pdf_ for a 7 page example document takes 150 seconds

on an i7-6600U (Intel Skylake, 4 cores) CPU (using the ADF of the

Fujitsu ScanSnap S1500). With the defaults, _adf2pdf_ calls

`scanimage` for duplex scanning into 600 dpi lineart (black and

white) images. In this example, 6 pages are empty and thus

automatically excluded, i.e. the resulting PDF then just contains

8 pages.

The resulting PDF contains a text layer from the OCR such that

one can search and copy'n'paste some text. It is 1.1 MiB big,

i.e. a page is stored in 132 KiB, on average.

## Related Work

In case you have existing PDF files without text layer or a scan

appliance that spits out PDFs but doesn't support OCR,

[OCRmyPDF](https://github.com/ocrmypdf/OCRmyPDF)

may be a good fit.

It takes a PDF file, applies OCR to each page and adds the

result as text layer to the input PDF file.

It's written in Python and also uses Tesseract for the OCR.

For users that prefer a GUI,

[Skanpage](https://invent.kde.org/utilities/skanpage) may fit the bill.

As the name suggests, it's a KDE application that provides a

clean and modern graphical interface to scanning.

Unlike a few other GUI alternatives, it _does_ also integrate OCR

via Tesseract.

For example, 'Gnome Document Scanner' (a.k.a. simple-scan) and

Skanlite (also KDE) do **not** support OCR, as of 2025.

## Software Requirements

The script assumes Tesseract version 4, by default. Version 3 can

be used as well, but the [new neural network system in Tesseract

4][8] just performs magnitudes better than the old OCR model.

Tesseract 4.0.0 was released in late 2018, thus, distributions

released in that time frame may still just include version 3 in

their repositories (e.g. Fedora 29 while Fedora 30 features version

4). Since version 4 is so much better at OCR I can't recommend it

enough over the stable version 3.

Tesseract 4 notes (in case you need to build it from the sources):

- [Build instructions][2] - warning: if you miss the

  `autoconf-archive` dependency you'll get weird autoconf error

  messages

- [Data files][3] - you need the training data for your

  languages of choice and the OSD data

Python packages:

- [img2pdf][9] (Fedora package: python3-img2pdf)

- [Pillow (PIL)][10] (Fedora package: python3-pillow-devel)

- [PyPDF2][11] (Fedora package: python3-PyPDF2)

[1]: http://www.fujitsu.com/us/products/computing/peripheral/scanners/product/eol/s1500/

[2]: https://github.com/tesseract-ocr/tesseract/wiki/Compiling-–-GitInstallation

[3]: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

[4]: https://en.wikipedia.org/wiki/Tesseract_(software)

[5]: https://en.wikipedia.org/wiki/Scanner_Access_Now_Easy

[6]: http://www.sane-project.org/man/scanimage.1.html

[7]: https://en.wikipedia.org/wiki/Optical_character_recognition

[8]: https://github.com/tesseract-ocr/tesseract/wiki/NeuralNetsInTesseract4.00

[9]: https://pypi.org/project/img2pdf/

[10]: http://python-pillow.github.io/

[11]: https://github.com/mstamy2/PyPDF2

[12]: https://en.wikipedia.org/wiki/JBIG2

[13]: https://en.wikipedia.org/wiki/Pip_(package_manager)

[14]: https://pypi.org/project/adf2pdf/

[ocr]: https://en.wikipedia.org/wiki/Optical_character_recognition

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gsauthof/adf2pdf

Awesome Lists containing this project

README