https://github.com/ocr-d/ocrd_framework
Docker installation for the OCR-D framework containing all available processors, taverna workflow and local repository.
https://github.com/ocr-d/ocrd_framework
ocr-d
Last synced: about 1 year ago
JSON representation
Docker installation for the OCR-D framework containing all available processors, taverna workflow and local repository.
- Host: GitHub
- URL: https://github.com/ocr-d/ocrd_framework
- Owner: OCR-D
- License: apache-2.0
- Created: 2019-12-13T07:46:28.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-01-08T07:51:00.000Z (over 6 years ago)
- Last Synced: 2025-02-03T01:34:25.538Z (over 1 year ago)
- Topics: ocr-d
- Language: Shell
- Size: 16.6 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# OCR-D Famework
Installation of OCR-D framework containing all available processors, taverna workflow and local research data repository.
## Installation with Docker
### Hardware Requirements
More than 8GB RAM and 20 GB of hard disc.
### Requirements
- Docker (see installation for [Ubuntu](https://github.com/OCR-D/repository_metastore/blob/master/installDocker/installationDocker.md))
- docker
- docker-compose
- git
- sed
- unzip
- wget
### Installation
Choose a directory on a disc with at least 10 GB free space left.
(In our example we use ocrd_framework inside the home directory)
Download and start [installation script](install_OCR-D_framework.sh).
```bash=bash
user@localhost:/home/user/$bash install_OCR-D_framework.sh /home/user/ocrd_framework
[...]
SUCCESS
Now you can start an OCR-D workflow with the following commands:
cd "/home/user/ocrd_framework/taverna"
docker run --network=\"host\" -v `pwd`:/data ocrd/taverna process"
```
Now there exists several folders
- repository - Contains all files of repository and the databases
- taverna - Contains all files workspaces and configuration of workflows
### Prepare hosts for accessing files in repo via browser
```bash=bash
user@localhost:/home/user/$ echo '127.0.0.1 kitdm20' | sudo tee -a /etc/hosts
127.0.0.1 kitdm20
```
### First Test
To check if the installation works fine you can start a first test.
```bash=bash
user@localhost:~/ocrd_framework/taverna$docker run --network="host" -v `pwd`:/data ocrd/taverna testWorkflow
[...]
Outputs will be saved to the directory: /taverna/git/Execute_OCR_D_workfl_output
# The processed workspace should look like this:
user@localhost:~/ocrd_framework/taverna$ls -1 workspace/example/data/
metadata
mets.xml
OCR-D-GT-SEG-BLOCK
OCR-D-GT-SEG-PAGE
OCR-D-IMG
OCR-D-IMG-BIN
OCR-D-IMG-BIN-OCROPY
OCR-D-OCR-CALAMARI_GT4HIST
OCR-D-OCR-TESSEROCR-BOTH
OCR-D-OCR-TESSEROCR-FRAKTUR
OCR-D-OCR-TESSEROCR-GT4HISTOCR
OCR-D-SEG-LINE
OCR-D-SEG-REGION
```
Each sub folder starting with 'OCR-D-OCR' should now
contain 4 files with the detected full text.
#### The metadata sub directory
The subdirectory 'metadata' contains the provenance of the workflow all
intermediate mets files and the stdout and stderr output of all executed processors.
#### Check results in browser
After the workflow all results are ingested to the research data repository.
The repository is available at http://localhost:8080/api/v1/metastore/bagit
### Create your own workflow
For configuration of the workflow see instructions in [README.md](https://github.com/OCR-D/taverna_workflow/blob/master/README.MD).
:information_source: All provided paths inside the parameter and workflow configuration files have to be 'dockerized'. For executing scripts relative paths are also possible.
The commands should look like this:
### Test Processors
For a fast test if a processor is available try the following command:
```bash=bash
# Test if processor is installed e.g. ocrd-cis-ocropy-binarize
user@localhost:~/ocrd_framework/taverna$docker run -v `pwd`:/data ocrd/taverna dump ocrd-cis-ocropy-binarize
{
"executable": "ocrd-cis-ocropy-binarize",
"categories": [
"Image preprocessing"
],
"steps": [
"preprocessing/optimization/binarization",
"preprocessing/optimization/grayscale_normalization",
"preprocessing/optimization/deskewing"
],
"input_file_grp": [
"OCR-D-IMG",
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"output_file_grp": [
"OCR-D-IMG-BIN",
"OCR-D-SEG-BLOCK",
"OCR-D-SEG-LINE"
],
"description": "Binarize (and optionally deskew/despeckle) pages / regions / lines with ocropy",
"parameters": {
"method": {
"type": "string",
"enum": [
"none",
"global",
"otsu",
"gauss-otsu",
"ocropy"
],
"description": "binarization method to use (only ocropy will include deskewing)",
"default": "ocropy"
},
"grayscale": {
"type": "boolean",
"description": "for the ocropy method, produce grayscale-normalized instead of thresholded image",
"default": false
},
"maxskew": {
"type": "number",
"description": "modulus of maximum skewing angle to detect (larger will be slower, 0 will deactivate deskewing)",
"default": 0.0
},
"noise_maxsize": {
"type": "number",
"description": "maximum pixel number for connected components to regard as noise (0 will deactivate denoising)",
"default": 0
},
"level-of-operation": {
"type": "string",
"enum": [
"page",
"region",
"line"
],
"description": "PAGE XML hierarchy level granularity to annotate images for",
"default": "page"
}
}
}
user@localhost:~/ocrd_framework/taverna$
```
### Execute your own Workflow
If workflow is configured it can be started.
```bash=bash
user@localhost:~/ocrd_framework/taverna$docker run --network="host" -v `pwd`:/data ocrd/taverna process my_parameters.txt relative/path/to/workspace/containing/mets
```
## More Information
* [Docker](https://www.docker.com/)