An open API service indexing awesome lists of open source software.

https://github.com/ocr-d/gt-mufilevelrules

OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.
https://github.com/ocr-d/gt-mufilevelrules

ground-truth guidelines ocr ocr-d transcription

Last synced: 5 months ago
JSON representation

OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.

Awesome Lists containing this project

README

          

# gt-MufiLevelRules

Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by [MUFI: The Medieval Unicode Font Initiative](https://mufi.info).

The resulting OCR-D level rules conform to the [OCR-D specification](https://ocr-d.de/en/gt-guidelines/trans/transkription.html).
These rules can be used for substitutions or level checks, among other things.

Note:
- There may not always be a definition for every level, esp. on level 1.
- OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the [unicruft](https://github.com/tboenig/gt-MufiLevelRules/tree/main/unicruft) program.
- For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1
is currently not recommended before manually checking and correcting the corresponding rules.

## Download the Rules

**🚦 You can download the set of rules here. 🚦**
- select the corresponding rule file: [rules directory](https://github.com/tboenig/gt-MufiLevelRules/tree/gh-pages/rules/characters)
- as zip release file: [latest Releases](https://github.com/tboenig/gt-MufiLevelRules/releases/latest)

## Recreation of the rules

1. copy or clone the repository.

`git clone https://github.com/tboenig/gt-MufiLevelRules.git`
2. Install [Saxon](https://www.saxonica.com/download/download_page.xml) for XSL Transformations v3.0. Then simply run with:


`java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes`

Parameters:
- **output** ``characters`` -> create the rules, all rules are saved under directory: ``[directory]/rules/characters``
- **merge** ``yes`` -> create the megarules, all rules in one file. Megarules saved under directoy ``[directory]/rules``

The result of the conversion can be found in the directory: ``[directory]/rules/characters``.
- Output Format:
- xml
- json

The script uses:

1. the [MUFI rules](https://gefin.ku.dk/q.php?q=mufiexport) [new Version] and [MUFI rules old-Version](https://raw.githubusercontent.com/tboenig/keyboardGT/main/metadata/mufi.json)

2. a summary of the following [**additional rules**](https://github.com/tboenig/gt-MufiLevelRules/blob/main/metadata/megarules.json) from the [OCR-D Ground-Truth Transcription Guide](https://ocr-d.de/en/gt-guidelines/trans/trBeispiele.html), which have priority (take precendence over MUFI rules where applicable):
- [ruleset_character.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_character.json)
- [ruleset_hyphenation.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_hyphenation.json)
- [ruleset_ligature.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_ligature.json)
- [ruleset_roman_digits.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_roman_digits.json)

## Description of the rules

### JSON Format

All JSON files (both the pure MUFI rules and the final result) follow the same schema.

**Example:**

```JSON
{"ruleset":[
...
{"rule": ["ä", "aͤ", ""], "type": "level"}
...
]}
```

- Each rule has a key: `rule` and a list of values
- The values define the character representation on each of the 3 transcription levels:
- Level 1 is at the first position
- Level 2 is in the second place
- Level 3 is in the third place
- Additional key-value combinations: ...
- Character values can be empty to signify there is no definition (representation) at that level.

### XML Format

```XML


AlphPresForm
LATIN SMALL LIGATURE FF
ff
ff

level

```
- **Elements**
- `` = root element of a gt-MufiLevelRules dataset
- `` = root element of a ruleset
- `` = category of characters
- `` = general description of the sign or symbol
- ``
- Level 1: rule[position() = 1]
- Level 2: rule[position() = 2]
- Level 3: rule[position() = 3]

The category of characters `` and the general description of the sign or symbol `` were imported from the MUFI dataset.

The JSONPaths are:
- range : `$['..']['range']`
- desc : `$['..']['description']`

## See Also

- MUFI: The Medieval Unicode Font Initiative https://mufi.info/
- MUFI's data as JSON export https://gefin.ku.dk/q.php?q=mufiexport
- OCR-D Ground Truth Transcription Guidelines https://ocr-d.de/en/gt-guidelines/trans/
- Ground Truth level overview https://ocr-d.de/en/gt-guidelines/trans/trLevels.html