{"id":27804107,"url":"https://github.com/ocr-d/ocrd_pagetopdf","last_synced_at":"2025-09-05T02:48:24.150Z","repository":{"id":42665112,"uuid":"248760922","full_name":"OCR-D/ocrd_pagetopdf","owner":"OCR-D","description":"OCR-D wrapper for prima-pagetopdf","archived":false,"fork":false,"pushed_at":"2025-05-21T11:11:47.000Z","size":3051,"stargazers_count":9,"open_issues_count":3,"forks_count":7,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-08-24T14:39:14.473Z","etag":null,"topics":["ocr","ocr-d","prima-pagetopdf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OCR-D.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2020-03-20T13:22:18.000Z","updated_at":"2025-05-21T11:11:48.000Z","dependencies_parsed_at":"2025-05-01T08:23:08.890Z","dependency_job_id":"e6912a37-8e8d-4646-be5f-27902b216aa6","html_url":"https://github.com/OCR-D/ocrd_pagetopdf","commit_stats":null,"previous_names":[],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/OCR-D/ocrd_pagetopdf","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_pagetopdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_pagetopdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_pagetopdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_pagetopdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OCR-D","download_url":"https://codeload.github.com/OCR-D/ocrd_pagetopdf/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_pagetopdf/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273703693,"owners_count":25153001,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-05T02:00:09.113Z","response_time":402,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","ocr-d","prima-pagetopdf"],"created_at":"2025-05-01T08:21:55.267Z","updated_at":"2025-09-05T02:48:24.135Z","avatar_url":"https://github.com/OCR-D.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocrd-pagetopdf\n\n\u003e OCR-D wrapper for prima-page-to-pdf\n\n[![Python CI](https://github.com/OCR-D/ocrd_pagetopdf/actions/workflows/ci.yml/badge.svg)](https://github.com/OCR-D/ocrd_pagetopdf/actions/workflows/ci.yml)\n[![Docker CD](https://github.com/OCR-D/ocrd_pagetopdf/actions/workflows/docker.yml/badge.svg)](https://github.com/OCR-D/ocrd_pagetopdf/actions/workflows/docker.yml)\n[![PyPI CD](https://img.shields.io/pypi/v/ocrd-pagetopdf.svg)](https://pypi.org/project/ocrd-pagetopdf/)\n\nContents:\n * [Introduction](#introduction)\n * [Requirements](#requirements)\n * [Installation](#installation)\n    * [With Docker](#with-docker)\n    * [Native, from PyPI](#native-from-pypi)\n    * [Native, from git](#native-from-git)\n * [Usage](#usage)\n    * [ocrd-pagetopdf](#ocrd-pagetopdf)\n    * [ocrd-altotopdf](#ocrd-altotopdf)\n * [FAQ](#faq)\n\n## Introduction\n\nThis package offers [OCR-D](https://ocr-d.de/en/spec) compliant\n[workspace processors](https://ocr-d.de/en/spec/cli) for conversion of OCR data\nrepresented in [METS](https://ocr-d.de/en/spec/mets) (on the document level)\nand [PAGE](https://ocr-d.de/en/spec/page)\nor [ALTO](https://www.loc.gov/standards/alto/)\n(on the page level) to PDF.\n\nIt transforms both the scan image (_facsimile_) and annotations (_text overlay_),\noptionally drawing _polygon outlines_ for text regions / lines / words / glyphs.\n\nOptionally _validates_ the structural annotation and fixes its coordinates before\nattempting conversion.\n\nThe text layer is generated from the textual annotation on the configured _level_\nof the structural hierarchy (region / line / word / glyph). It is rendered with a\nconfigurable _font_ (which is useful to make sure all codepoints are covered by\nadequate glyphs, esp. in historic prints and manuscripts).\n\nThe _page labels_ can be configured to use various attributes from the\nphysical pages of the METS.\n\nA _table of contents_ will be added according to the labels of the\nrecursive `mets:div` logical structure.\n\n## Requirements\n\n- GNU `make`\n- Python 3 with `pip` and `venv`\n- [OCR-D](https://github.com/OCR-D/core)\n- Java runtime (OpenJDK \u0026ge;8 works for [PageToPdf](https://github.com/PRImA-Research-Lab/prima-page-to-pdf/releases) 1.1.2)\n\n## Installation\n\n### With Docker\n\nThis is the best option if you want to run the software in a container.\n\nYou need to have [Docker](https://docs.docker.com/install/linux/docker-ce/ubuntu/)\n\n\n    docker pull ocrd/pagetopdf\n\n\nTo run with docker:\n\n\n    docker run -v path/to/workspaces:/data ocrd/pagetopdf ocrd-pagetopdf ...\n\n### Native, from PyPI\n\nThis is the best option if you want to use the stable, released version.\n\nAfter installing Python and Java, simply do:\n\n\n    pip install ocrd_pagetopdf\n\n\n### Native, from git\n\nUse this option if you want to change the source code or install the latest, unpublished changes.\n\nWe strongly recommend to use [venv](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/).\n\nAfter installing `make`, assuming you are on a Debian/Ubuntu OS, you can do:\n\n    sudo make deps-ubuntu\n\nOtherwise, simulate this step and install requirements with equivalent actions on your system:\n\n    make -n deps-ubuntu\n    ...\n\nFinally, to install the Python package, do:\n\n    make install\n    # or equivalently:\n    pip install .\n\n\n## Usage\n\nThe command-line interface `ocrd-pagetopdf` conforms to [OCR-D processor](https://ocr-d.de/en/spec/cli) specifications.\n\nAssuming you have an [OCR-D workspace](https://ocr-d.de/en/user_guide#preparing-a-workspace) in your current working directory, simply do:\n\n    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word\n\nThis will run the script and create PDF files for each page with a text layer based on word-level annotations.\n\nIn order to create an additional multipage file for the entire document, named `merged.pdf`,\nconcatenating the single page PDFs in physical order and with page labels and contents, do:\n\n    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -P textequiv_level word -P multipage merged\n\nIn case your workspace does not contain fulltext in **PAGE** format, but **ALTO**, there is a dedicated\nprocessor CLI `ocrd-altotopdf`, with some limitations compared to the former:\n\n- You need to _manually_ select the fileGrp providing the images which match the annotation coordinates,\n  passing it as second input fileGrp. (The image references are required by PAGE, but not by ALTO.)\n- The images are _not_ generated on-the-fly according to all annotations (from existing `AlternativeImage`s,\n  or by cropping via coordinates into the higher-level image, and deskewing when applicable), and _not_\n  chosen via `input_feature_selector` / `input_feature_filter` mechanism. Instead, only the original\n  images can be used here.\n- The annotations are _not_ tested comprehensively regarding validity and consistency of coordinates and\n  then repaired. Instead, only superficial checks and repairs can be applied (like negative coordinates).\n\nAssuming you have a workspace representing a typical [DFG-conforming](https://dfg-viewer.de/) METS,\nwith `FULLTEXT` for ALTO and `DEFAULT` for the original images, do:\n\n    ocrd-altotopdf -I FULLTEXT,DEFAULT -O PDF-FILEGRP -P textequiv_level word -P multipage merged\n\nFor more options and explanations, see below.\n\n### ocrd-pagetopdf\n\n\u003cdetails\u003e\u003csummary\u003eOCR-D CLI\u003c/summary\u003e\n\n\n\u003cpre\u003e\nUsage: ocrd-pagetopdf [worker|server] [OPTIONS]\n\n  Convert text and layout annotations from PAGE to PDF format (overlaying original image with text layer and polygon outlines)\n\n  \u003e Converts all pages of the document to PDF\n\n  \u003e For each page, open and deserialize PAGE input file and its\n  \u003e respective image. Then extract a derived image of the (cropped,\n  \u003e deskewed, binarized...) page, with features depending on\n  \u003e ``image_feature_selector`` (a comma-separated list of required image\n  \u003e features, cf. :py:func:`ocrd.workspace.Workspace.image_from_page`)\n  \u003e and ``image_feature_filter`` (a comma-separated list of forbidden\n  \u003e image features).\n\n  \u003e Next, generate a temporary PAGE output file for that very image\n  \u003e (adapting all coordinates if necessary). If ``negative2zero`` is\n  \u003e set, validate and repair invalid or inconsistent coordinates.\n\n  \u003e Convert the PAGE/image pair with PRImA PageToPdf, applying\n  \u003e - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set;\n  \u003e - ``outlines`` to draw boundary polygons, if set;\n  \u003e - ``font`` accordingly.\n\n  \u003e Copy the resulting PDF file to the output file group and reference\n  \u003e it in the METS.\n\n  \u003e Finally, if ``multipage`` is set, then concatenate all generated\n  \u003e files to a multi-page PDF file, setting ``pagelabels`` accordingly,\n  \u003e as well as PDF metadata and bookmarks. Reference it with\n  \u003e ``multipage`` as ID in the output file group, too. If\n  \u003e ``multipage_only`` is also set, then remove the single-page PDF\n  \u003e files afterwards.\n\nSubcommands:\n    worker      Start a processing worker rather than do local processing\n    server      Start a processor server rather than do local processing\n\nOptions for processing:\n  -m, --mets URL-PATH             URL or file path of METS to process [./mets.xml]\n  -w, --working-dir PATH          Working directory of local workspace [dirname(URL-PATH)]\n  -I, --input-file-grp USE        File group(s) used as input\n  -O, --output-file-grp USE       File group(s) used as output\n  -g, --page-id ID                Physical page ID(s) to process instead of full document []\n  --overwrite                     Remove existing output pages/images\n                                  (with \"--page-id\", remove only those).\n                                  Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE\n  --debug                         Abort on any errors with full stack trace.\n                                  Short-hand for OCRD_MISSING_OUTPUT=ABORT\n  --profile                       Enable profiling\n  --profile-file PROF-PATH        Write cProfile stats to PROF-PATH. Implies \"--profile\"\n  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string\n                                  or JSON file path\n  -P, --param-override KEY VAL    Override a single JSON object key-value pair,\n                                  taking precedence over --parameter\n  -U, --mets-server-url URL       URL of a METS Server for parallel incremental access to METS\n                                  If URL starts with http:// start an HTTP server there,\n                                  otherwise URL is a path to an on-demand-created unix socket\n  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]\n                                  Override log level globally [INFO]\n  --log-filename LOG-PATH         File to redirect stderr logging to (overriding ocrd_logging.conf).\n\nOptions for information:\n  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME\n  -L, --list-resources            List names of processor resources\n  -J, --dump-json                 Dump tool description as JSON\n  -D, --dump-module-dir           Show the 'module' resource location path for this processor\n  -h, --help                      Show this message\n  -V, --version                   Show version\n\nParameters:\n   \"image_feature_selector\" [string - \"\"]\n    comma-separated list of required image features (e.g.\n    binarized,despeckled,cropped,deskewed,rotated-90)\n   \"image_feature_filter\" [string - \"\"]\n    comma-separated list of forbidden image features (e.g.\n    binarized,despeckled,cropped,deskewed,rotated-90)\n   \"font\" [string - \"\"]\n    Font file to be used in PDF file. If unset, AletheiaSans.ttf is used.\n    (Make sure to pick a font which covers all glyphs!)\n   \"outlines\" [string - \"\"]\n    What segment hierarchy to draw coordinate outlines for. If unset, no\n    outlines are drawn.\n    Possible values: [\"\", \"region\", \"line\", \"word\", \"glyph\"]\n   \"textequiv_level\" [string - \"\"]\n    What segment hierarchy level to render text output from. If unset, no\n    text is rendered.\n    Possible values: [\"\", \"region\", \"line\", \"word\", \"glyph\"]\n   \"negative2zero\" [boolean - false]\n    Repair invalid or inconsistent coordinates before trying to convert.\n   \"ext\" [string - \".pdf\"]\n    Output filename extension\n   \"multipage\" [string - \"\"]\n    Merge all PDFs into one multipage file. The value is used as METS\n    file ID and file basename for the PDF.\n   \"multipage_only\" [boolean - false]\n    When producing a `multipage`, do not add single-page files into the\n    output fileGrp (but use a temporary directory for them).\n   \"pagelabel\" [string - \"pageId\"]\n    Parameter for 'multipage': Set the labels used as page outlines.\n\n    - 'pageId': physical page ID,\n\n    - 'pagenumber': use consecutive numbers,\n\n    - 'pagelabel': use '@ORDERLABEL - @LABEL',\n\n    - 'basename': use the name of the input file,\n\n    - 'local_filename': use the href relative path of the input file,\n\n    - 'url': use the href URL of the input file,\n\n    - 'ID': use the file ID of the input file\n    Possible values: [\"pagenumber\", \"pagelabel\", \"pageId\", \"basename\",\n    \"basename_without_extension\", \"local_filename\", \"ID\", \"url\"]\n   \"script-args\" [string - \"\"]\n    Extra arguments to PageToPdf (see https://github.com/PRImA-Research-\n    Lab/prima-page-to-pdf)\n\u003c/pre\u003e\n\n\u003c/details\u003e\n\n### ocrd-altotopdf\n\n\u003cdetails\u003e\u003csummary\u003eOCR-D CLI\u003c/summary\u003e\n\n\n\u003cpre\u003e\nUsage: ocrd-altotopdf [worker|server] [OPTIONS]\n\n  Convert text and layout annotations from ALTO to PDF format (overlaying original image with text layer and polygon outlines)\n\n  \u003e Converts all pages of the document to PDF\n\n  \u003e For each page, find the ALTO input file in the first fileGrp,\n  \u003e together with the image input file in the second fileGrp.\n\n  \u003e Then convert ALTO to PAGE with PRImA PageConverter in a temporary\n  \u003e location.\n\n  \u003e Next convert the PAGE/image pair with PRImA PageToPdf in a temporary location,\n  \u003e applying\n  \u003e - ``textequiv_level`` (i.e. `-text-source`) to retrieve a text layer, if set;\n  \u003e - ``outlines`` to draw boundary polygons, if set;\n  \u003e - ``font`` accordingly;\n  \u003e - ``negative2zero`` (i.e. `-neg-coords toZero`) to repair negative coordintes.\n\n  \u003e Copy to the resulting PDF file to the output file group and\n  \u003e reference it in the METS.\n\n  \u003e Finally, if ``multipage`` is set, then concatenate all generated\n  \u003e files to a multi-page PDF file, setting ``pagelabels`` accordingly,\n  \u003e as well as PDF metadata and bookmarks. Reference it with\n  \u003e ``multipage`` as ID in the output fileGrp, too. If\n  \u003e ``multipage_only`` is also set, then remove the single-page PDF\n  \u003e files afterwards.\n\nSubcommands:\n    worker      Start a processing worker rather than do local processing\n    server      Start a processor server rather than do local processing\n\nOptions for processing:\n  -m, --mets URL-PATH             URL or file path of METS to process [./mets.xml]\n  -w, --working-dir PATH          Working directory of local workspace [dirname(URL-PATH)]\n  -I, --input-file-grp USE        File group(s) used as input\n  -O, --output-file-grp USE       File group(s) used as output\n  -g, --page-id ID                Physical page ID(s) to process instead of full document []\n  --overwrite                     Remove existing output pages/images\n                                  (with \"--page-id\", remove only those).\n                                  Short-hand for OCRD_EXISTING_OUTPUT=OVERWRITE\n  --debug                         Abort on any errors with full stack trace.\n                                  Short-hand for OCRD_MISSING_OUTPUT=ABORT\n  --profile                       Enable profiling\n  --profile-file PROF-PATH        Write cProfile stats to PROF-PATH. Implies \"--profile\"\n  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string\n                                  or JSON file path\n  -P, --param-override KEY VAL    Override a single JSON object key-value pair,\n                                  taking precedence over --parameter\n  -U, --mets-server-url URL       URL of a METS Server for parallel incremental access to METS\n                                  If URL starts with http:// start an HTTP server there,\n                                  otherwise URL is a path to an on-demand-created unix socket\n  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]\n                                  Override log level globally [INFO]\n  --log-filename LOG-PATH         File to redirect stderr logging to (overriding ocrd_logging.conf).\n\nOptions for information:\n  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME\n  -L, --list-resources            List names of processor resources\n  -J, --dump-json                 Dump tool description as JSON\n  -D, --dump-module-dir           Show the 'module' resource location path for this processor\n  -h, --help                      Show this message\n  -V, --version                   Show version\n\nParameters:\n   \"font\" [string - \"\"]\n    Font file to be used in PDF file. If unset, AletheiaSans.ttf is used.\n    (Make sure to pick a font which covers all glyphs!)\n   \"outlines\" [string - \"\"]\n    What segment hierarchy to draw coordinate outlines for. If unset, no\n    outlines are drawn.\n    Possible values: [\"\", \"region\", \"line\", \"word\", \"glyph\"]\n   \"textequiv_level\" [string - \"\"]\n    What segment hierarchy level to render text output from. If unset, no\n    text is rendered.\n    Possible values: [\"\", \"region\", \"line\", \"word\", \"glyph\"]\n   \"negative2zero\" [boolean - false]\n    Repair invalid or inconsistent coordinates before trying to convert.\n   \"ext\" [string - \".pdf\"]\n    Output filename extension\n   \"multipage\" [string - \"\"]\n    Merge all PDFs into one multipage file. The value is used as METS\n    file ID and file basename for the PDF.\n   \"multipage_only\" [boolean - false]\n    When producing a `multipage`, do not add single-page files into the\n    output fileGrp (but use a temporary directory for them).\n   \"pagelabel\" [string - \"pageId\"]\n    Parameter for 'multipage': Set the labels used as page outlines.\n\n    - 'pageId': physical page ID,\n\n    - 'pagenumber': use consecutive numbers,\n\n    - 'pagelabel': use '@ORDERLABEL - @LABEL',\n\n    - 'basename': use the name of the input file,\n\n    - 'local_filename': use the href relative path of the input file,\n\n    - 'url': use the href URL of the input file,\n\n    - 'ID': use the file ID of the input file\n    Possible values: [\"pagenumber\", \"pagelabel\", \"pageId\", \"basename\",\n    \"basename_without_extension\", \"local_filename\", \"ID\", \"url\"]\n   \"script-args\" [string - \"\"]\n    Extra arguments to PageToPdf (see https://github.com/PRImA-Research-\n    Lab/prima-page-to-pdf)\n\u003c/pre\u003e\n\n\u003c/details\u003e\n\n\n## FAQ\n\n- `Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner()`\n   If that appears, try installing OpenJDK 8.\n\n- `java.lang.NullPointerException` \n  If that appears, try (a little workaround) and set negative coordinates to zero:\n  \n      ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P negative2zero true\n\n- Some letters are illegible?\n  Please note that the standard displayed font ([AletheiaSans.ttf](https://github.com/PRImA-Research-Lab/prima-aletheia-web/raw/master/war/aletheiasans-webfont.ttf)) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:\n  \n      ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P font /usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf\n  \n  Fonts can also be referenced by file name if they are installed as [processor resources](https://ocr-d.de/en/spec/cli#processor-resources). A number of options have been preconfigured, cf. `ocrd resmgr list-available -e ocrd-pagetopdf`.\n\n- The multipage file's page labels can be configured, e.g. consecutively via `pagelabel=pagenumber` or from `@ORDERLABEL` and `@LABEL` via `pagelabel=pagelabel`:\n  \n      ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP ... -P pagelabel pagelabel\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focr-d%2Focrd_pagetopdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Focr-d%2Focrd_pagetopdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focr-d%2Focrd_pagetopdf/lists"}