https://github.com/ocr-d/gt-mufilevelrules
OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.
https://github.com/ocr-d/gt-mufilevelrules
ground-truth guidelines ocr ocr-d transcription
Last synced: 5 months ago
JSON representation
OCR-D-Level-Rules can be created automatically with gt-MufiLevelRules from the encodings published by MUFI: The Medieval Unicode Font Initiative.
- Host: GitHub
- URL: https://github.com/ocr-d/gt-mufilevelrules
- Owner: OCR-D
- License: gpl-3.0
- Created: 2022-08-18T08:24:48.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2024-04-18T17:44:27.000Z (about 2 years ago)
- Last Synced: 2024-12-21T19:33:11.957Z (over 1 year ago)
- Topics: ground-truth, guidelines, ocr, ocr-d, transcription
- Language: XSLT
- Homepage: https://tboenig.github.io/gt-MufiLevelRules/
- Size: 1.21 MB
- Stars: 2
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# gt-MufiLevelRules
Creates OCR-D Ground-Truth Transcription Level Rules automatically from the encodings published by [MUFI: The Medieval Unicode Font Initiative](https://mufi.info).
The resulting OCR-D level rules conform to the [OCR-D specification](https://ocr-d.de/en/gt-guidelines/trans/transkription.html).
These rules can be used for substitutions or level checks, among other things.
Note:
- There may not always be a definition for every level, esp. on level 1.
- OCR-D will try to fill in these gaps manually or automatically. The automated completion is based on the [unicruft](https://github.com/tboenig/gt-MufiLevelRules/tree/main/unicruft) program.
- For this reason, using the rules for automatic character normalization from level 3 or level 2 to level 1
is currently not recommended before manually checking and correcting the corresponding rules.
## Download the Rules
**🚦 You can download the set of rules here. 🚦**
- select the corresponding rule file: [rules directory](https://github.com/tboenig/gt-MufiLevelRules/tree/gh-pages/rules/characters)
- as zip release file: [latest Releases](https://github.com/tboenig/gt-MufiLevelRules/releases/latest)
## Recreation of the rules
1. copy or clone the repository.
`git clone https://github.com/tboenig/gt-MufiLevelRules.git`
2. Install [Saxon](https://www.saxonica.com/download/download_page.xml) for XSL Transformations v3.0. Then simply run with:
`java -jar saxon-he-XX.jar -xsl:scripts/MufiGTLevelRules2.xsl -s:scripts/MufiGTLevelRules.xsl output=characters merge=yes`
Parameters:
- **output** ``characters`` -> create the rules, all rules are saved under directory: ``[directory]/rules/characters``
- **merge** ``yes`` -> create the megarules, all rules in one file. Megarules saved under directoy ``[directory]/rules``
The result of the conversion can be found in the directory: ``[directory]/rules/characters``.
- Output Format:
- xml
- json
The script uses:
1. the [MUFI rules](https://gefin.ku.dk/q.php?q=mufiexport) [new Version] and [MUFI rules old-Version](https://raw.githubusercontent.com/tboenig/keyboardGT/main/metadata/mufi.json)
2. a summary of the following [**additional rules**](https://github.com/tboenig/gt-MufiLevelRules/blob/main/metadata/megarules.json) from the [OCR-D Ground-Truth Transcription Guide](https://ocr-d.de/en/gt-guidelines/trans/trBeispiele.html), which have priority (take precendence over MUFI rules where applicable):
- [ruleset_character.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_character.json)
- [ruleset_hyphenation.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_hyphenation.json)
- [ruleset_ligature.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_ligature.json)
- [ruleset_roman_digits.json](https://github.com/tboenig/gt-guidelines/blob/gh-pages/rules/ruleset_roman_digits.json)
## Description of the rules
### JSON Format
All JSON files (both the pure MUFI rules and the final result) follow the same schema.
**Example:**
```JSON
{"ruleset":[
...
{"rule": ["ä", "aͤ", ""], "type": "level"}
...
]}
```
- Each rule has a key: `rule` and a list of values
- The values define the character representation on each of the 3 transcription levels:
- Level 1 is at the first position
- Level 2 is in the second place
- Level 3 is in the third place
- Additional key-value combinations: ...
- Character values can be empty to signify there is no definition (representation) at that level.
### XML Format
```XML
AlphPresForm
LATIN SMALL LIGATURE FF
ff
ff
ff
level
```
- **Elements**
- `` = root element of a gt-MufiLevelRules dataset
- `` = root element of a ruleset
- `` = category of characters
- `` = general description of the sign or symbol
- ``
- Level 1: rule[position() = 1]
- Level 2: rule[position() = 2]
- Level 3: rule[position() = 3]
The category of characters `` and the general description of the sign or symbol `` were imported from the MUFI dataset.
The JSONPaths are:
- range : `$['..']['range']`
- desc : `$['..']['description']`
## See Also
- MUFI: The Medieval Unicode Font Initiative https://mufi.info/
- MUFI's data as JSON export https://gefin.ku.dk/q.php?q=mufiexport
- OCR-D Ground Truth Transcription Guidelines https://ocr-d.de/en/gt-guidelines/trans/
- Ground Truth level overview https://ocr-d.de/en/gt-guidelines/trans/trLevels.html