{"id":22700032,"url":"https://github.com/ub-mannheim/ocrd_pagetopdf","last_synced_at":"2025-08-07T08:31:57.718Z","repository":{"id":42665112,"uuid":"248760922","full_name":"UB-Mannheim/ocrd_pagetopdf","owner":"UB-Mannheim","description":"OCR-D wrapper for prima-pagetopdf","archived":false,"fork":false,"pushed_at":"2024-10-28T19:05:44.000Z","size":2841,"stargazers_count":8,"open_issues_count":4,"forks_count":6,"subscribers_count":6,"default_branch":"master","last_synced_at":"2024-10-28T20:20:00.774Z","etag":null,"topics":["ocr","ocr-d","prima-pagetopdf"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UB-Mannheim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-20T13:22:18.000Z","updated_at":"2024-10-28T19:05:48.000Z","dependencies_parsed_at":"2024-10-28T20:29:57.949Z","dependency_job_id":null,"html_url":"https://github.com/UB-Mannheim/ocrd_pagetopdf","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focrd_pagetopdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focrd_pagetopdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focrd_pagetopdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focrd_pagetopdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UB-Mannheim","download_url":"https://codeload.github.com/UB-Mannheim/ocrd_pagetopdf/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":229013256,"owners_count":18006191,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","ocr-d","prima-pagetopdf"],"created_at":"2024-12-10T06:09:36.354Z","updated_at":"2024-12-10T06:09:36.949Z","avatar_url":"https://github.com/UB-Mannheim.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocrd-pagetopdf\n\n\u003e OCR-D wrapper for prima-page-to-pdf\n\nTransforms all PAGE-XML+IMG to PDF with text layer and (optionally) polygon outlines.\n\n(Converts original images together with text and layout annotations of all pages in the PAGE input file group to PDF. The text is rendered as an overlay.)\n\n### Requirements\n\n- GNU `make`\n- Python 3 with `pip` and `venv`\n- [OCR-D](https://github.com/OCR-D/core)\n- Java runtime (OpenJDK 8 works for [PageToPdf](https://github.com/PRImA-Research-Lab/prima-page-to-pdf/releases) 1.1.2)\n\n### Installation\n\nOnce you have installed Java, make, Python, and set up your [virtual environment](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/), do:\n\n    make deps # or: pip install ocrd\n    make install # copies into PREFIX or VIRTUAL_ENV\n\n### Usage\n\nThe command-line interface conforms to [OCR-D processor](https://ocr-d.de/en/spec/cli) specifications.\n\nAssuming you have an [OCR-D workspace](https://ocr-d.de/en/user_guide#preparing-a-workspace) in your current working directory, simply do:\n\n    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{\"textequiv_level\" : \"word\"}'\n\nThis will run the script and create PDF files for each page with a text layer based on word-level annotations.\n\nThere is also an option to create an additional multipage file with name `merged.pdf`, which contain all single pages in correct order:\n\n    ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{\"textequiv_level\" : \"word\", \"multipage\":\"merged\"}'\n\n### FAQ\n\n- `Illegal reflective access by com.itextpdf.text.io.ByteBufferRandomAccessSource$1 to method java.nio.DirectByteBuffer.cleaner()`\n   If that appears, try installing OpenJDK 8.\n\n- `java.lang.NullPointerException` \n  If that appears, try (a little workaround) and set negative coordinates to zero:\n  \n      ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{\"textequiv_level\" : \"word\", \"negative2zero\": true}'\n\n- Some letters are illegible?\n  Please note that the standard displayed font ([AletheiaSans.ttf](https://github.com/PRImA-Research-Lab/prima-aletheia-web/raw/master/war/aletheiasans-webfont.ttf)) does not support all Unicode glyphs. In case yours are missing, set a (monospace) Unicode font yourself:\n\n      ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{\"textequiv_level\" : \"word\", \"font\": \"/usr/share/fonts/truetype/ubuntu/UbuntuMono-R.ttf\"}'\n\n- The multipage file pagelabelnames can be changed, e.g. consecutively pagenumber.\n\n      ocrd-pagetopdf -I PAGE-FILGRP -O PDF-FILEGRP -p '{\"textequiv_level\" : \"word\", \"multipage\":\"merged\", \"pagelabelname\":\"pagenumber\"}'\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fub-mannheim%2Focrd_pagetopdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fub-mannheim%2Focrd_pagetopdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fub-mannheim%2Focrd_pagetopdf/lists"}