{"id":24951939,"url":"https://github.com/ocr-d/ocrd_fileformat","last_synced_at":"2025-04-10T12:51:54.842Z","repository":{"id":42428617,"uuid":"233099130","full_name":"OCR-D/ocrd_fileformat","owner":"OCR-D","description":"OCR-D wrapper for ocr-fileformat","archived":false,"fork":false,"pushed_at":"2024-10-16T14:14:39.000Z","size":73,"stargazers_count":4,"open_issues_count":6,"forks_count":3,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-24T11:38:26.227Z","etag":null,"topics":["ocr-d"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OCR-D.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-10T17:38:37.000Z","updated_at":"2024-10-16T14:14:33.000Z","dependencies_parsed_at":"2024-01-11T21:27:00.982Z","dependency_job_id":"8eee920c-49c2-4b27-87ba-5cc2cf00b774","html_url":"https://github.com/OCR-D/ocrd_fileformat","commit_stats":null,"previous_names":[],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_fileformat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_fileformat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_fileformat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OCR-D%2Focrd_fileformat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OCR-D","download_url":"https://codeload.github.com/OCR-D/ocrd_fileformat/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248220291,"owners_count":21067278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr-d"],"created_at":"2025-02-03T01:32:47.904Z","updated_at":"2025-04-10T12:51:54.821Z","avatar_url":"https://github.com/OCR-D.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ocrd-fileformat\n\n\u003e OCR-D wrapper for [`ocr-fileformat`](https://github.com/UB-Mannheim/ocr-fileformat)\n\n[![CircleCI](https://circleci.com/gh/OCR-D/ocrd_fileformat.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_fileformat)\n\n\n## Prerequisities\n\n* GNU make\n* Python \u0026\u0026 pip\n* OpenJDK (required by submodule)\n* optional: Docker CE for building container images \n\n## Installation\n\nClone the repository and it's submodule recursive:\n\n    git clone --recursive https://github.com/OCR-D/ocrd_fileformat.git\n\nStep into local clone, build and install `ocr-fileformat` and the `ocrd_fileformat` [OCR-D](https://ocr-d.de) wrapper:\n\n    make -C ocrd_fileformat install\n\nAlternatively, for the Docker option, just get:\n\n    docker pull ocrd/fileformat\n\n\n## Usage\n\nAfter successful installation type `ocrd-fileformat-transform --help` to get an idea\nwhich conversions are supported already:\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ccode\u003eocrd-fileformat-transform -h\u003c/code\u003e\u003c/summary\u003e\n  \u003cpre\u003e\nUsage: ocrd-fileformat-transform [OPTIONS]\n\n  Convert between OCR file formats\n\n  \u0026gt; Processor base class and helper functions. A processor is a tool\n  \u0026gt; that implements the uniform OCR-D command-line interface for run-\n  \u0026gt; time data processing. That is, it executes a single workflow step,\n  \u0026gt; or a combination of workflow steps, on the workspace (represented by\n  \u0026gt; local METS). It reads input files for all or requested physical\n  \u0026gt; pages of the input fileGrp(s), and writes output files for them into\n  \u0026gt; the output fileGrp(s). It may take  a number of optional or\n  \u0026gt; mandatory parameters. Process the :py:attr:`workspace`  from the\n  \u0026gt; given :py:attr:`input_file_grp` to the given\n  \u0026gt; :py:attr:`output_file_grp` for the given :py:attr:`page_id` under\n  \u0026gt; the given :py:attr:`parameter`.\n\n  \u0026gt; (This contains the main functionality and needs to be overridden by\n  \u0026gt; subclasses.)\n\nOptions:\n  -I, --input-file-grp USE        File group(s) used as input\n  -O, --output-file-grp USE       File group(s) used as output\n  -g, --page-id ID                Physical page ID(s) to process\n  --overwrite                     Remove existing output pages/images\n                                  (with --page-id, remove only those)\n  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string\n                                  or JSON file path\n  -P, --param-override KEY VAL    Override a single JSON object key-value pair,\n                                  taking precedence over --parameter\n  -s, --server HOST PORT WORKERS  Run web server instead of one-shot processing\n                                  (shifts mets/working-dir/page-id options to\n                                   HTTP request arguments); pass network interface\n                                  to bind to, TCP port, number of worker processes\n  -m, --mets URL-PATH             URL or file path of METS to process\n  -w, --working-dir PATH          Working directory of local workspace\n  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]\n                                  Log level\n  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME\n  -L, --list-resources            List names of processor resources\n  -J, --dump-json                 Dump tool description as JSON and exit\n  -h, --help                      This help message\n  -V, --version                   Show version\n\nParameters:\n   \"from-to\" [string - \"page alto\"]\n    Transformation scenario, see ocr-fileformat -L\n    Possible values: [\"abbyy hocr\", \"abbyy page\", \"alto2.0 alto3.0\",\n    \"alto2.0 alto3.1\", \"alto2.0 hocr\", \"alto2.1 alto3.0\", \"alto2.1\n    alto3.1\", \"alto2.1 hocr\", \"alto page\", \"alto text\", \"gcv hocr\", \"gcv\n    page\", \"hocr alto2.0\", \"hocr alto2.1\", \"hocr page\", \"hocr text\",\n    \"page alto\", \"page hocr\", \"page page2019\", \"page text\", \"tei hocr\"]\n   \"ext\" [string - \"\"]\n    Output extension. Set to empty string to derive extension from the\n    media type.\n   \"script-args\" [string - \"\"]\n    Arguments to Saxon (for XSLT transformations) or to transformation\n    script\n\u003c/pre\u003e\n\u003c/details\u003e\n\nWith the [OCR-D](https://ocr-d.de/en/spec/intro) [CLI](https://ocr-d.de/en/spec/cli) wrapper\nthe `ocr-fileformat` converter integrates fluently into existing OCR-D tool [workflows](https://ocr-d.de/en/workflows).\n\nGiven a previous step which produces PAGE-XML under the file group `OCR`,\na conversion into plain text under the file group `OCR-TXT` can be achieved with:\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ccode\u003eocrd-fileformat-transform -I OCR -O OCR-TXT -P from-to \"page text\"\u003c/code\u003e\u003c/summary\u003e\n  \u003ch4\u003eWith \u003ca href=\"https://github.com/bertsky/workflow-configuration\"\u003ebertsky/workflow-configuration\u003c/a\u003e\u003c/h4\u003e\n  \u003cpre\u003e\nOCR-TXT: OCR\nOCR-TXT: TOOL = ocrd-fileformat-transform\nOCR-TXT: PARAMS = \"from-to\": \"page text\"\n\u003c/pre\u003e\n\u003c/details\u003e\n\nSince the conversion from PAGE-XML to ALTO-XML (V4.1) is such a common\nrequirement, it is the default value for the parameter `from-to`. Therefore,\nparameters can be omitted completely:\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ccode\u003eocrd-fileformat-transform -I OCR -O OCR-ALTO\u003c/code\u003e\u003c/summary\u003e\n  \u003ch4\u003eWith \u003ca href=\"https://github.com/bertsky/workflow-configuration\"\u003ebertsky/workflow-configuration\u003c/a\u003e\u003c/h4\u003e\n  \u003cpre\u003e\nOCR-ALTO: OCR\nOCR-ALTO: TOOL = ocrd-fileformat-transform\n\u003c/pre\u003e\n\u003c/details\u003e\n\nHowever, typically the ALTO converter itself will require additional parameters\nto be able to cope with the kind of annotations present. For example, if you have\nno cropping in the workflow, and OCR text is only annotated on the line level,\nthen you will need to add:\n\n\u003cdetails\u003e\n  \u003csummary\u003e\u003ccode\u003eocrd-fileformat-transform -I OCR -O OCR-ALTO -P script-args \"--no-check-border --no-check-words --dummy-word\"\u003c/code\u003e\u003c/summary\u003e\n  \u003ch4\u003eWith \u003ca href=\"https://github.com/bertsky/workflow-configuration\"\u003ebertsky/workflow-configuration\u003c/a\u003e\u003c/h4\u003e\n  \u003cpre\u003e\nOCR-ALTO: OCR\nOCR-ALTO: TOOL = ocrd-fileformat-transform\nOCR-ALTO: PARAMS = \"script-args\": \"--no-check-border --no-check-words --dummy-word\"\n\u003c/pre\u003e\n\u003c/details\u003e\n\nTo run the program via Docker, just spin up a container analogously:\n\n    docker run --rm -v $PWD:/data ocrd/fileformat ocrd-fileformat-transform -I OCR -O OCR-ALTO\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focr-d%2Focrd_fileformat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Focr-d%2Focrd_fileformat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Focr-d%2Focrd_fileformat/lists"}