https://github.com/ocr-d/ocrd_fileformat

OCR-D wrapper for ocr-fileformat
https://github.com/ocr-d/ocrd_fileformat

ocr-d

Last synced: about 1 year ago
JSON representation

OCR-D wrapper for ocr-fileformat

Host: GitHub
URL: https://github.com/ocr-d/ocrd_fileformat
Owner: OCR-D
License: apache-2.0
Created: 2020-01-10T17:38:37.000Z (over 6 years ago)
Default Branch: master
Last Pushed: 2024-10-16T14:14:39.000Z (over 1 year ago)
Last Synced: 2025-03-24T11:38:26.227Z (about 1 year ago)
Topics: ocr-d
Language: Shell
Size: 71.3 KB
Stars: 4
Watchers: 3
Forks: 3
Open Issues: 6
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# ocrd-fileformat

> OCR-D wrapper for [`ocr-fileformat`](https://github.com/UB-Mannheim/ocr-fileformat)

[![CircleCI](https://circleci.com/gh/OCR-D/ocrd_fileformat.svg?style=svg)](https://circleci.com/gh/OCR-D/ocrd_fileformat)

## Prerequisities

* GNU make
* Python && pip
* OpenJDK (required by submodule)
* optional: Docker CE for building container images

## Installation

Clone the repository and it's submodule recursive:

git clone --recursive https://github.com/OCR-D/ocrd_fileformat.git

Step into local clone, build and install `ocr-fileformat` and the `ocrd_fileformat` [OCR-D](https://ocr-d.de) wrapper:

make -C ocrd_fileformat install

Alternatively, for the Docker option, just get:

docker pull ocrd/fileformat

## Usage

After successful installation type `ocrd-fileformat-transform --help` to get an idea
which conversions are supported already:

ocrd-fileformat-transform -h

Usage: ocrd-fileformat-transform [OPTIONS]

Convert between OCR file formats

> Processor base class and helper functions. A processor is a tool
> that implements the uniform OCR-D command-line interface for run-
> time data processing. That is, it executes a single workflow step,
> or a combination of workflow steps, on the workspace (represented by
> local METS). It reads input files for all or requested physical
> pages of the input fileGrp(s), and writes output files for them into
> the output fileGrp(s). It may take a number of optional or
> mandatory parameters. Process the :py:attr:`workspace` from the
> given :py:attr:`input_file_grp` to the given
> :py:attr:`output_file_grp` for the given :py:attr:`page_id` under
> the given :py:attr:`parameter`.

> (This contains the main functionality and needs to be overridden by
> subclasses.)

Options:
-I, --input-file-grp USE File group(s) used as input
-O, --output-file-grp USE File group(s) used as output
-g, --page-id ID Physical page ID(s) to process
--overwrite Remove existing output pages/images
(with --page-id, remove only those)
-p, --parameter JSON-PATH Parameters, either verbatim JSON string
or JSON file path
-P, --param-override KEY VAL Override a single JSON object key-value pair,
taking precedence over --parameter
-s, --server HOST PORT WORKERS Run web server instead of one-shot processing
(shifts mets/working-dir/page-id options to
HTTP request arguments); pass network interface
to bind to, TCP port, number of worker processes
-m, --mets URL-PATH URL or file path of METS to process
-w, --working-dir PATH Working directory of local workspace
-l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]
Log level
-C, --show-resource RESNAME Dump the content of processor resource RESNAME
-L, --list-resources List names of processor resources
-J, --dump-json Dump tool description as JSON and exit
-h, --help This help message
-V, --version Show version

Parameters:
"from-to" [string - "page alto"]
Transformation scenario, see ocr-fileformat -L
Possible values: ["abbyy hocr", "abbyy page", "alto2.0 alto3.0",
"alto2.0 alto3.1", "alto2.0 hocr", "alto2.1 alto3.0", "alto2.1
alto3.1", "alto2.1 hocr", "alto page", "alto text", "gcv hocr", "gcv
page", "hocr alto2.0", "hocr alto2.1", "hocr page", "hocr text",
"page alto", "page hocr", "page page2019", "page text", "tei hocr"]
"ext" [string - ""]
Output extension. Set to empty string to derive extension from the
media type.
"script-args" [string - ""]
Arguments to Saxon (for XSLT transformations) or to transformation
script

With the [OCR-D](https://ocr-d.de/en/spec/intro) [CLI](https://ocr-d.de/en/spec/cli) wrapper
the `ocr-fileformat` converter integrates fluently into existing OCR-D tool [workflows](https://ocr-d.de/en/workflows).

Given a previous step which produces PAGE-XML under the file group `OCR`,
a conversion into plain text under the file group `OCR-TXT` can be achieved with:

ocrd-fileformat-transform -I OCR -O OCR-TXT -P from-to "page text"

With bertsky/workflow-configuration


OCR-TXT: OCR

OCR-TXT: TOOL = ocrd-fileformat-transform

OCR-TXT: PARAMS = "from-to": "page text"

Since the conversion from PAGE-XML to ALTO-XML (V4.1) is such a common
requirement, it is the default value for the parameter `from-to`. Therefore,
parameters can be omitted completely:

ocrd-fileformat-transform -I OCR -O OCR-ALTO

With bertsky/workflow-configuration


OCR-ALTO: OCR

OCR-ALTO: TOOL = ocrd-fileformat-transform

However, typically the ALTO converter itself will require additional parameters
to be able to cope with the kind of annotations present. For example, if you have
no cropping in the workflow, and OCR text is only annotated on the line level,
then you will need to add:

ocrd-fileformat-transform -I OCR -O OCR-ALTO -P script-args "--no-check-border --no-check-words --dummy-word"

With bertsky/workflow-configuration


OCR-ALTO: OCR

OCR-ALTO: TOOL = ocrd-fileformat-transform

OCR-ALTO: PARAMS = "script-args": "--no-check-border --no-check-words --dummy-word"

To run the program via Docker, just spin up a container analogously:

docker run --rm -v $PWD:/data ocrd/fileformat ocrd-fileformat-transform -I OCR -O OCR-ALTO

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ocr-d/ocrd_fileformat

Awesome Lists containing this project

README

With bertsky/workflow-configuration

With bertsky/workflow-configuration

With bertsky/workflow-configuration