https://github.com/ocr-d/ocrd_pagetopdf
OCR-D wrapper for prima-pagetopdf
https://github.com/ocr-d/ocrd_pagetopdf
ocr ocr-d prima-pagetopdf
Last synced: 9 months ago
JSON representation
OCR-D wrapper for prima-pagetopdf
- Host: GitHub
- URL: https://github.com/ocr-d/ocrd_pagetopdf
- Owner: OCR-D
- License: apache-2.0
- Created: 2020-03-20T13:22:18.000Z (about 6 years ago)
- Default Branch: master
- Last Pushed: 2025-05-21T11:11:47.000Z (about 1 year ago)
- Last Synced: 2025-08-24T14:39:14.473Z (10 months ago)
- Topics: ocr, ocr-d, prima-pagetopdf
- Language: Python
- Homepage:
- Size: 2.91 MB
- Stars: 9
- Watchers: 5
- Forks: 7
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# ocrd-pagetopdf
> OCR-D wrapper for prima-page-to-pdf
[](https://github.com/OCR-D/ocrd_pagetopdf/actions/workflows/ci.yml)
[](https://github.com/OCR-D/ocrd_pagetopdf/actions/workflows/docker.yml)
[](https://pypi.org/project/ocrd-pagetopdf/)
Contents:
* [Introduction](#introduction)
* [Requirements](#requirements)
* [Installation](#installation)
* [With Docker](#with-docker)
* [Native, from PyPI](#native-from-pypi)
* [Native, from git](#native-from-git)
* [Usage](#usage)
* [ocrd-pagetopdf](#ocrd-pagetopdf)
* [ocrd-altotopdf](#ocrd-altotopdf)
* [FAQ](#faq)
## Introduction
This package offers [OCR-D](https://ocr-d.de/en/spec) compliant
[workspace processors](https://ocr-d.de/en/spec/cli) for conversion of OCR data
represented in [METS](https://ocr-d.de/en/spec/mets) (on the document level)
and [PAGE](https://ocr-d.de/en/spec/page)
or [ALTO](https://www.loc.gov/standards/alto/)
(on the page level) to PDF.
It transforms both the scan image (_facsimile_) and annotations (_text overlay_),
optionally drawing _polygon outlines_ for text regions / lines / words / glyphs.
Optionally _validates_ the structural annotation and fixes its coordinates before
attempting conversion.
The text layer is generated from the textual annotation on the configured _level_
of the structural hierarchy (region / line / word / glyph). It is rendered with a
configurable _font_ (which is useful to make sure all codepoints are covered by
adequate glyphs, esp. in historic prints and manuscripts).
The _page labels_ can be configured to use various attributes from the
physical pages of the METS.
A _table of contents_ will be added according to the labels of the
recursive `mets:div` logical structure.
## Requirements
- GNU `make`
- Python 3 with `pip` and `venv`
- [OCR-D](https://github.com/OCR-D/core)
- Java runtime (OpenJDK ≥8 works for [PageToPdf](https://github.com/PRImA-Research-Lab/prima-page-to-pdf/releases) 1.1.2)
## Installation
### With Docker
This is the best option if you want to run the software in a container.
You need to have [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/)
docker pull ocrd/pagetopdf
To run with docker:
docker run -v path/to/workspaces:/data ocrd/pagetopdf ocrd-pagetopdf ...
### Native, from PyPI
This is the best option if you want to use the stable, released version.
After installing Python and Java, simply do:
pip install ocrd_pagetopdf
### Native, from git
Use this option if you want to change the source code or install the latest, unpublished changes.
We strongly recommend to use [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).
After installing `make`, assuming you are on a Debian/Ubuntu OS, you can do:
sudo make deps-ubuntu
Otherwise, simulate this step and install requirements with equivalent actions on your system:
make -n deps-ubuntu
...
Finally, to install the Python package, do:
make install
# or equivalently:
pip install .
## Usage
The command-line interface `ocrd-pagetopdf` conforms to [OCR-D processor](https://ocr-d.de/en/spec/cli) specifications.
Assuming you have an [OCR-D workspace](https://ocr-d.de/en/user_guide#preparing-a-workspace) in your current working directory, simply do:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word
This will run the script and create PDF files for each page with a text layer based on word-level annotations.
In order to create an additional multipage file for the entire document, named `merged.pdf`,
concatenating the single page PDFs in physical order and with page labels and contents, do:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word -P multipage merged
In case your workspace does not contain fulltext in **PAGE** format, but **ALTO**, there is a dedicated
processor CLI `ocrd-altotopdf`, with some limitations compared to the former:
- You need to _manually_ select the fileGrp providing the images which match the annotation coordinates,
passing it as second input fileGrp. (The image references are required by PAGE, but not by ALTO.)
- The images are _not_ generated on-the-fly according to all annotations (from existing `AlternativeImage`s,
or by cropping via coordinates into the higher-level image, and deskewing when applicable), and _not_
chosen via `input_feature_selector` / `input_feature_filter` mechanism. Instead, only the original
images can be used here.
- The annotations are _not_ tested comprehensively regarding validity and consistency of coordinates and
then repaired. Instead, only superficial checks and repairs can be applied (like negative coordinates).
Assuming you have a workspace representing a typical [DFG-conforming](https://dfg-viewer.de/) METS,
with `FULLTEXT` for ALTO and `DEFAULT` for the original images, do:
ocrd-altotopdf -I FULLTEXT,DEFAULT -O PDF-FILEGRP -P textequiv_level word -P multipage merged
For more options and explanations, see below.
### ocrd-pagetopdf
OCR-D CLI
Usage: ocrd-pagetopdf [worker|server] [OPTIONS]
Convert text and layout annotations from PAGE to PDF format (overlaying original image with text layer and polygon outlines)
> Converts all pages of the document to PDF
> For each page, open and deserialize PAGE input file and its
> respective image. Then extract a derived image of the (cropped,
> deskewed, binarized...) page, with features depending on
> ``image_feature_selector`` (a comma-separated list of required image
> features, cf. :py:func:`ocrd.workspace.Workspace.image_from_page`)
> and ``image_feature_filter`` (a comma-separated list of forbidden
> image features).
> Next, generate a temporary PAGE output file for that very image
> (adapting all coordinates if necessary). If ``negative2zero`` is
> set, validate and repair invalid or inconsistent coordinates.
> Convert the PAGE/image pair with PRImA PageToPdf, applying
> - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set;
> - ``outlines`` to draw boundary polygons, if set;
> - ``font`` accordingly.
> Copy the resulting PDF file to the output file group and reference
> it in the METS.
> Finally, if ``multipage`` is set, then concatenate all generated
> files to a multi-page PDF file, setting ``pagelabels`` accordingly,
> as well as PDF metadata and bookmarks. Reference it with
> ``multipage`` as ID in the output file group, too. If
> ``multipage_only`` is also set, then remove the single-page PDF
> files afterwards.
Subcommands:
worker Start a processing worker rather than do local processing
server Start a processor server rather than do local processing
Options for processing:
-m, --mets URL-PATH URL or file path of METS to process [./mets.xml]
-w, --working-dir PATH Working directory of local workspace [dirname(URL-PATH)]
-I, --input-file-grp USE File group(s) used as input
-O, --output-file-grp USE File group(s) used as output
-g, --page-id ID Physical page ID(s) to process instead of full document []
--overwrite Remove existing output pages/images
(with "--page-id", remove only those).
Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE
--debug Abort on any errors with full stack trace.
Short-hand for OCRD_MISSING_OUTPUT=ABORT
--profile Enable profiling
--profile-file PROF-PATH Write cProfile stats to PROF-PATH. Implies "--profile"
-p, --parameter JSON-PATH Parameters, either verbatim JSON string
or JSON file path
-P, --param-override KEY VAL Override a single JSON object key-value pair,
taking precedence over --parameter
-U, --mets-server-url URL URL of a METS Server for parallel incremental access to METS
If URL starts with http:// start an HTTP server there,
otherwise URL is a path to an on-demand-created unix socket
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Override log level globally [INFO]
--log-filename LOG-PATH File to redirect stderr logging to (overriding ocrd_logging.conf).
Options for information:
-C, --show-resource RESNAME Dump the content of processor resource RESNAME
-L, --list-resources List names of processor resources
-J, --dump-json Dump tool description as JSON
-D, --dump-module-dir Show the 'module' resource location path for this processor
-h, --help Show this message
-V, --version Show version
Parameters:
"image_feature_selector" [string - ""]
comma-separated list of required image features (e.g.
binarized,despeckled,cropped,deskewed,rotated-90)
"image_feature_filter" [string - ""]
comma-separated list of forbidden image features (e.g.
binarized,despeckled,cropped,deskewed,rotated-90)
"font" [string - ""]
Font file to be used in PDF file. If unset, AletheiaSans.ttf is used.
(Make sure to pick a font which covers all glyphs!)
"outlines" [string - ""]
What segment hierarchy to draw coordinate outlines for. If unset, no
outlines are drawn.
Possible values: ["", "region", "line", "word", "glyph"]
"textequiv_level" [string - ""]
What segment hierarchy level to render text output from. If unset, no
text is rendered.
Possible values: ["", "region", "line", "word", "glyph"]
"negative2zero" [boolean - false]
Repair invalid or inconsistent coordinates before trying to convert.
"ext" [string - ".pdf"]
Output filename extension
"multipage" [string - ""]
Merge all PDFs into one multipage file. The value is used as METS
file ID and file basename for the PDF.
"multipage_only" [boolean - false]
When producing a `multipage`, do not add single-page files into the
output fileGrp (but use a temporary directory for them).
"pagelabel" [string - "pageId"]
Parameter for 'multipage': Set the labels used as page outlines.
- 'pageId': physical page ID,
- 'pagenumber': use consecutive numbers,
- 'pagelabel': use '@ORDERLABEL - @LABEL',
- 'basename': use the name of the input file,
- 'local_filename': use the href relative path of the input file,
- 'url': use the href URL of the input file,
- 'ID': use the file ID of the input file
Possible values: ["pagenumber", "pagelabel", "pageId", "basename",
"basename_without_extension", "local_filename", "ID", "url"]
"script-args" [string - ""]
Extra arguments to PageToPdf (see https://github.com/PRImA-Research-
Lab/prima-page-to-pdf)
### ocrd-altotopdf
OCR-D CLI
Usage: ocrd-altotopdf [worker|server] [OPTIONS]
Convert text and layout annotations from ALTO to PDF format (overlaying original image with text layer and polygon outlines)
> Converts all pages of the document to PDF
> For each page, find the ALTO input file in the first fileGrp,
> together with the image input file in the second fileGrp.
> Then convert ALTO to PAGE with PRImA PageConverter in a temporary
> location.
> Next convert the PAGE/image pair with PRImA PageToPdf in a temporary location,
> applying
> - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set;
> - ``outlines`` to draw boundary polygons, if set;
> - ``font`` accordingly;
> - ``negative2zero`` (i.e. `-neg-coords toZero`) to repair negative coordintes.
> Copy to the resulting PDF file to the output file group and
> reference it in the METS.
> Finally, if ``multipage`` is set, then concatenate all generated
> files to a multi-page PDF file, setting ``pagelabels`` accordingly,
> as well as PDF metadata and bookmarks. Reference it with
> ``multipage`` as ID in the output fileGrp, too. If
> ``multipage_only`` is also set, then remove the single-page PDF
> files afterwards.
Subcommands:
worker Start a processing worker rather than do local processing
server Start a processor server rather than do local processing
Options for processing:
-m, --mets URL-PATH URL or file path of METS to process [./mets.xml]
-w, --working-dir PATH Working directory of local workspace [dirname(URL-PATH)]
-I, --input-file-grp USE File group(s) used as input
-O, --output-file-grp USE File group(s) used as output
-g, --page-id ID Physical page ID(s) to process instead of full document []
--overwrite Remove existing output pages/images
(with "--page-id", remove only those).
Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE
--debug Abort on any errors with full stack trace.
Short-hand for OCRD_MISSING_OUTPUT=ABORT
--profile Enable profiling
--profile-file PROF-PATH Write cProfile stats to PROF-PATH. Implies "--profile"
-p, --parameter JSON-PATH Parameters, either verbatim JSON string
or JSON file path
-P, --param-override KEY VAL Override a single JSON object key-value pair,
taking precedence over --parameter
-U, --mets-server-url URL URL of a METS Server for parallel incremental access to METS
If URL starts with http:// start an HTTP server there,
otherwise URL is a path to an on-demand-created unix socket
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Override log level globally [INFO]
--log-filename LOG-PATH File to redirect stderr logging to (overriding ocrd_logging.conf).
Options for information:
-C, --show-resource RESNAME Dump the content of processor resource RESNAME
-L, --list-resources List names of processor resources
-J, --dump-json Dump tool description as JSON
-D, --dump-module-dir Show the 'module' resource location path for this processor
-h, --help Show this message
-V, --version Show version
Parameters:
"font" [string - ""]
Font file to be used in PDF file. If unset, AletheiaSans.ttf is used.
(Make sure to pick a font which covers all glyphs!)
"outlines" [string - ""]
What segment hierarchy to draw coordinate outlines for. If unset, no
outlines are drawn.
Possible values: ["", "region", "line", "word", "glyph"]
"textequiv_level" [string - ""]
What segment hierarchy level to render text output from. If unset, no
text is rendered.
Possible values: ["", "region", "line", "word", "glyph"]
"negative2zero" [boolean - false]
Repair invalid or inconsistent coordinates before trying to convert.
"ext" [string - ".pdf"]
Output filename extension
"multipage" [string - ""]
Merge all PDFs into one multipage file. The value is used as METS
file ID and file basename for the PDF.
"multipage_only" [boolean - false]
When producing a `multipage`, do not add single-page files into the
output fileGrp (but use a temporary directory for them).
"pagelabel" [string - "pageId"]
Parameter for 'multipage': Set the labels used as page outlines.
- 'pageId': physical page ID,
- 'pagenumber': use consecutive numbers,
- 'pagelabel': use '@ORDERLABEL - @LABEL',
- 'basename': use the name of the input file,
- 'local_filename': use the href relative path of the input file,
- 'url': use the href URL of the input file,
- 'ID': use the file ID of the input file
Possible values: ["pagenumber", "pagelabel", "pageId", "basename",
"basename_without_extension", "local_filename", "ID", "url"]
"script-args" [string - ""]
Extra arguments to PageToPdf (see https://github.com/PRImA-Research-
Lab/prima-page-to-pdf)
## FAQ
- `Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner()`
If that appears, try installing OpenJDK 8.
- `java.lang.NullPointerException`
If that appears, try (a little workaround) and set negative coordinates to zero:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P negative2zero true
- Some letters are illegible?
Please note that the standard displayed font ([AletheiaSans.ttf](https://github.com/PRImA-Research-Lab/prima-aletheia-web/raw/master/war/aletheiasans-webfont.ttf)) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P font /usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf
Fonts can also be referenced by file name if they are installed as [processor resources](https://ocr-d.de/en/spec/cli#processor-resources). A number of options have been preconfigured, cf. `ocrd resmgr list-available -e ocrd-pagetopdf`.
- The multipage file's page labels can be configured, e.g. consecutively via `pagelabel=pagenumber` or from `@ORDERLABEL` and `@LABEL` via `pagelabel=pagelabel`:
ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P pagelabel pagelabel