{"id":13415602,"url":"https://github.com/UB-Mannheim/ocr-fileformat","last_synced_at":"2025-03-14T23:30:56.155Z","repository":{"id":38751977,"uuid":"55786219","full_name":"UB-Mannheim/ocr-fileformat","owner":"UB-Mannheim","description":"Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)","archived":false,"fork":false,"pushed_at":"2024-08-01T09:42:55.000Z","size":824,"stargazers_count":176,"open_issues_count":35,"forks_count":23,"subscribers_count":20,"default_branch":"master","last_synced_at":"2024-08-05T01:07:27.980Z","etag":null,"topics":["alto","finereader","hocr","ocr","ocr-d","page-xml","transformation","validation"],"latest_commit_sha":null,"homepage":"https://digi.bib.uni-mannheim.de/ocr-fileformat/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/UB-Mannheim.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-04-08T14:43:58.000Z","updated_at":"2024-08-05T01:07:30.895Z","dependencies_parsed_at":"2023-01-21T22:48:22.932Z","dependency_job_id":"a76c8858-7442-4bf4-bdb2-ac700b81c569","html_url":"https://github.com/UB-Mannheim/ocr-fileformat","commit_stats":null,"previous_names":["ub-mannheim/ocr-transform"],"tags_count":14,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focr-fileformat","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focr-fileformat/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focr-fileformat/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/UB-Mannheim%2Focr-fileformat/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/UB-Mannheim","download_url":"https://codeload.github.com/UB-Mannheim/ocr-fileformat/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243663327,"owners_count":20327299,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alto","finereader","hocr","ocr","ocr-d","page-xml","transformation","validation"],"created_at":"2024-07-30T21:00:50.659Z","updated_at":"2025-03-14T23:30:56.147Z","avatar_url":"https://github.com/UB-Mannheim.png","language":"JavaScript","funding_links":[],"categories":["1. \u003ca name='Software'\u003e\u003c/a\u003eSoftware","File formats and tools","JavaScript"],"sub_categories":["1.3. \u003ca name='OCRfileformats'\u003e\u003c/a\u003eOCR file formats","CTPN [paper:2016](https://arxiv.org/pdf/1609.03605.pdf)"],"readme":"# ocr-fileformat\n\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/1cd1dc54634249aebbe3e157569ed26f)](https://app.codacy.com/gh/UB-Mannheim/ocr-fileformat/dashboard?utm_source=gh\u0026utm_medium=referral\u0026utm_content=\u0026utm_campaign=Badge_grade)\n[![Build Status](https://github.com/UB-Mannheim/ocr-fileformat/actions/workflows/ci.yml/badge.svg)](https://github.com/UB-Mannheim/ocr-fileformat/actions/workflows/ci.yml)\n[![GitHub release](https://img.shields.io/github/release/UB-Mannheim/ocr-fileformat.svg?maxAge=3600)](https://github.com/UB-Mannheim/ocr-fileformat/releases)\n[![ocr-fileformat Docker build](https://img.shields.io/docker/automated/ubma/ocr-fileformat.svg?maxAge=2592000?style=plastic)](https://hub.docker.com/r/ubma/ocr-fileformat)\n\nValidate and transform between OCR file formats (hOCR, ALTO, PAGE, FineReader)\n\n![Screenshot GUI](https://raw.githubusercontent.com/UB-Mannheim/ocr-fileformat/master/screenshot.png)\n\n\u003c!-- BEGIN-MARKDOWN-TOC --\u003e\n* [Installation](#installation)\n  * [Docker](#docker)\n  * [System-wide](#system-wide)\n* [Usage](#usage)\n  * [CLI](#cli)\n  * [GUI](#gui)\n  * [API](#api)\n* [Transformation](#transformation)\n  * [Transformation CLI](#transformation-cli)\n  * [Transformation GUI](#transformation-gui)\n  * [Transformation API](#transformation-api)\n  * [Supported Transformations](#supported-transformations)\n* [Validation](#validation)\n  * [Validation CLI](#validation-cli)\n  * [Validation GUI](#validation-gui)\n  * [Validation API](#validation-api)\n  * [Supported Validation Formats](#supported-validation-formats)\n* [License](#license)\n\n\u003c!-- END-MARKDOWN-TOC --\u003e\n\n## Installation\n\n### Docker\n\nYou can run the [command line scripts](#cli) and [web interface](#gui) as a\n[Docker container](https://hub.docker.com/r/ubma/ocr-fileformat), you only need\nDocker installed.\n\nTo start the web interface on [http://localhost:8080](http://localhost:8080):\n\n```sh\ndocker run --rm -it -p 8080:8080 ubma/ocr-fileformat\n```\n\nTo run the command line scripts, mount the directory containing your input\nfiles into the container's `/data` directory:\n\n```sh\ndocker run --rm -it -v \"$PWD\":/data ubma/ocr-fileformat ocr-transform alto2.0 hocr somefile.alto\n```\n\n### System-wide\n\nTo install system-wide to `/usr/local`:\n\n```sh\nsudo make install\n```\n\nTo install without `sudo` to your home directory:\n\n```sh\nmake install PREFIX=$HOME/.local\n```\n\nIf `$HOME/.local/bin` is not in your `PATH`, add this to your shell startup file (e.g. `~/.bashrc` or `~/.zshrc`):\n\n```\nexport PATH=\"$HOME/.local/bin $PATH\"\n```\n\nThe web application has a PHP backed. You can deploy it on any PHP-capable\nserver by copying the [`web`](./web) folder somewhere below the document root\nof your server, e.g. `/var/www/html` for Apache on Debian/Ubuntu:\n\n```\nsudo -u www-data cp -r web /var/www/html/ocr-fileformat\n```\n\nIn this example the GUI would be available under [http://localhost/ocr-fileformat/](http://localhost/ocr-fileformat/).\n\n## Usage\n\nThe project offers two functionalities, which can be accessd via a command line\nscript (CLI), using a web interface (GUI) or in you own tools (API)\n\n### CLI\n\n* [`ocr-transform`](./bin/ocr-transform.sh): Transformation of OCR output between OCR formats\n* [`ocr-validate`](./bin/ocr-validate.sh): Validation of OCR output against OCR format schemas\n\n### GUI\n\nThe web interface is for testing validation and transformations. You can upload\na file or select an input file by URL.\n\n### API\n\n* [`$PREFIX/share/ocr-fileformat/xslt`](./xslt) - XSLT stylesheets\n* [`$PREFIX/share/ocr-fileformat/xsd`](./xsd) - XSD schemas\n* [`$PREFIX/share/ocr-fileformat/script/transform`](./script/transform) - Transformation scripts\n* [`$PREFIX/share/ocr-fileformat/script/validate`](./script/validate) - Validation scripts\n\n## Transformation\n\n### Transformation CLI\n\n```\nUsage: ocr-transform [-dl] \u003cinput-fmt\u003e \u003coutput-fmt\u003e [\u003cinput\u003e [\u003coutput\u003e]] [-- \u003csaxon_opts\u003e]\n```\n\nFor example, you can transform an ALTO XML to a hOCR file with:\n\n```sh\nocr-transform alto hocr sample.xml sample.hocr\n```\n\nOr convert from ALTO XML (version 2.1) to hOCR with:\n\n```sh\nocr-transform alto2.1 hocr sample.alto sample.hocr\n```\n\nYou can also pass arguments directly to the Saxon CLI by passing them after a double dash (`--`). For example, to set the `foo` parameter to `bar`:\n\n```sh\nocr-transform alto hocr sample.xml sample.hocr -- foo=bar\n```\n\nTry `ocr-transform -h` to get an overview:\n\n\u003c!-- BEGIN-EVAL echo '```';./bin/ocr-transform.sh -h 2\u003e\u00261;echo '```'  --\u003e\n```\nUsage:\nocr-transform [OPTIONS] \u003cfrom\u003e \u003cto\u003e [\u003cinfile\u003e [\u003coutfile\u003e]] [-- \u003cscript-args\u003e]\nocr-transform [OPTIONS] \u003cfrom\u003e \u003cto\u003e --help-args Show script-args, and exit\nocr-transform [OPTIONS] -h|--help               Show this help, and exit\nocr-transform [OPTIONS] -v|--version            Show version, and exit\nocr-transform [OPTIONS] -L|--list               List available from/to, and exit\n\n    Options:\n        --debug   -d     Increase debug level by 1, can be repeated\n\n    Transformations:\n        abbyy hocr\n        abbyy page\n        alto hocr\n        alto page\n        alto text\n        alto2.0 alto3.0\n        alto2.0 alto3.1\n        alto2.0 hocr\n        alto2.1 alto3.0\n        alto2.1 alto3.1\n        alto2.1 hocr\n        alto4.2 alto2.1\n        gcv alto\n        gcv hocr\n        gcv page\n        hocr alto\n        hocr alto2.0\n        hocr alto2.1\n        hocr alto3.0\n        hocr alto4.0\n        hocr page\n        hocr tei\n        hocr text\n        mybib alto3.0\n        page alto\n        page alto_legacy\n        page hocr\n        page page2019\n        page text\n        tei hocr\n        textract page\n```\n\n\u003c!-- END-EVAL --\u003e\n\n### Transformation GUI\n\nSelect the `Transform` menu option. Choose a URL, an input and an output\nformat. Click `Transform`.\n\n### Transformation API\n\nThe stylesheets are installed in `$PREFIX/share/ocr-fileformat/xslt` and can be\nused directly in your scripts and software. You will need to use an XSLT 2.0\ncapable stylesheet transformer.\n\n### Supported Transformations\n\n| From ╲ To           | hOCR | ALTO | PAGEXML | TEI | Text |\n| ---:                | ---  | ---  | ---     | --- | ---  |\n| hOCR                | -    | ✓    | ✓       | ✓   | ✓    |\n| ALTO                | ✓    | ✓    | ✓       | -   | ✓    |\n| PAGEXML             | ✓    | ✓    | ✓       | -   | ✓    |\n| ABBYY FineReader    | ✓    | -    | ✓       | -   | -    |\n| Google Cloud Vision | ✓    | ✓    | ✓       | -   | -    |\n| Amazon AWS Textract | -    | -    | ✓       | -   | -    |\n| TEI                 | ✓    | -    | -       | -   | -    |\n\n## Validation\n\n\u003c!-- BEGIN-EVAL echo '```';./bin/ocr-validate.sh -h 2\u003e\u00261;echo '```'  --\u003e\n```\nUsage:\nocr-validate [OPTIONS] \u003cschema\u003e \u003cfile\u003e [\u003cresultsFile\u003e]\nocr-validate [OPTIONS] -h|--help       Show this help, and exit\nocr-validate [OPTIONS] -v|--version    Show version, and exit\nocr-validate [OPTIONS] -L|--list       List available schemas, and exit\n\n    Options:\n        --debug   -d     Increase debug level by 1, can be repeated\n\n    Schemas:\n        hocr\n        alto-1-0 alto-1-1 alto-1-2 alto-1-3 alto-1-4 alto-2-0 alto-2-1 alto-2-2-draft alto-3-0 alto-3-1 alto-3-2-draft alto-4-0 alto-4-1 alto-4-2 alto-4-3\n        abbyy-6-schema-v1 abbyy-8-schema-v2 abbyy-9-schema-v1 abbyy-10-schema-v1\n        page-2009-03-16 page-2010-01-12 page-2010-03-19 page-2013-07-15 page-2016-07-15 page-2017-07-15 page-2018-07-15 page-2019-07-15\n```\n\n\u003c!-- END-EVAL --\u003e\n\n### Validation CLI\n\nFor example, to validate an XML file against the ALTO 3.1 schema:\n\n```\nocr-validate alto-3-1 myFile.alto\n```\n\n### Validation GUI\n\nSelect the `Validate` menu option. Choose a URL and an schema. Click `Validate`.\n\n### Validation API\n\nThe XSD files are installed under `$PREFIX/share/ocr-fileformat/xsd`\n\n### Supported Validation Formats\n\n|            | hOCR | ALTO | PAGEXML | FineReader | Google Cloud Vision | Amazon AWS Textract |\n| ---:       | ---  | ---  | ---     | ---        | ---                 | ---                 |\n| Validation | ✓    | ✓    | ✓       | ✓          | -                   | -                   |\n\n\n## License\n\nThis is free software. You may use it under the terms of the [MIT License](LICENSE).\n\nDuring the installation process several projects are included (in [`./vendor`](./vendor)). These projects have different licenses:\n\n* [Saxon HE 9.7](http://saxon.sourceforge.net/#F9.7HE), [`MPL`](https://www.mozilla.org/MPL/).\n* [ALTOXML schema](https://github.com/altoxml/schema), [\"Open Source\"](https://github.com/altoxml/schema/issues/37#issuecomment-218730230) for ALTO \u003c= 3.1, [`CC BY SA 4.0`](https://creativecommons.org/licenses/by-sa/4.0/legalcode) since ALTO 4.0\n* [PAGE schemas](http://www.primaresearch.org/schema/PAGE/gts/pagecontent/), `?`\n* [xsd-validator](https://github.com/kba/xsd-validator) by Adrian Mouat [@amouat](https://github.com/amouat), `Apache 2.0`\n* ABBYY FineReader XSD, `?`\n* [hOCR-to-ALTO](https://github.com/filak/hOCR-to-ALTO) by Filip Kriz [@filak](https://github.com/filak), [`MIT`](https://github.com/filak/hOCR-to-ALTO/blob/master/LICENSE.txt)\n* [hocr-spec](https://github.com/kba/hocr-spec-python) by Konstantin Baierer [@kba](https://github.com/kba), [`MIT`](https://github.com/kba/hocr-spec-python/blob/master/LICENSE)\n* [gcv2hocr](https://github.com/dinosauria123/gcv2hocr) by Endo Michiaki, [`CC BY 4.0`](https://creativecommons.org/licenses/by/4.0/legalcode)\n* [format-converters](https://github.com/OCR-D/format-converters) by OCR-D, [`Apache 2.0`](https://github.com/OCR-D/format-converters/blob/master/LICENSE)\n* [prima-page-converter](https://github.com/PRImA-Research-Lab/prima-page-converter/) by PRImA Research Lab , [`Apache 2.0`](https://github.com/PRImA-Research-Lab/prima-page-converter/blob/master/LICENSE)\n* [page-to-alto](https://github.com/kba/page-to-alto/) by Konstantin Baierer @kba, [`Apache 2.0`](https://github.com/kba/page-to-alto/blob/master/LICENSE)\n* [textract2page](https://github.com/slub/textract2page/) by Arne Rümmler @rue-a, [`Apache 2.0`](https://github.com/slub/textract2page/blob/master/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUB-Mannheim%2Focr-fileformat","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FUB-Mannheim%2Focr-fileformat","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FUB-Mannheim%2Focr-fileformat/lists"}