https://github.com/ub-mannheim/dach-gt
Ground truth and full text for selected prints of German libraries
https://github.com/ub-mannheim/dach-gt
escriptorium fraktur ground-truth ocr
Last synced: about 1 year ago
JSON representation
Ground truth and full text for selected prints of German libraries
- Host: GitHub
- URL: https://github.com/ub-mannheim/dach-gt
- Owner: UB-Mannheim
- License: cc0-1.0
- Created: 2023-06-03T14:51:01.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2025-01-11T09:06:58.000Z (over 1 year ago)
- Last Synced: 2025-04-13T05:52:18.511Z (about 1 year ago)
- Topics: escriptorium, fraktur, ground-truth, ocr
- Language: Shell
- Homepage:
- Size: 12.1 MB
- Stars: 2
- Watchers: 4
- Forks: 2
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Ground truth and full text for selected prints of German archives and libraries
* [Staatsbibliothek zu Berlin](data/DE-1)
* [Universitätsbibliothek Marburg](data/DE-4)
* [Bayerische Staatsbibliothek](data/DE-12) / Münchener Digitalisierungszentrum
* [Universitäts- und Landesbibliothek Darmstadt](data/DE-17)
* [Herzog August Bibliothek Wolfenbüttel](data/DE-23)
* [Thüringer Universitäts- und Landesbibliothek](data/DE-27)
* [Universitäts- und Stadtbibliothek Köln](data/DE-38)
* [Staats- und Universitätsbibliothek Bremen](data/DE-46)
* [Universitäts- und Landesbibliothek Düsseldorf](data/DE-61)
* [Hochschulbibliothek Fachhochschule Potsdam](data/DE-525)
* [MARCHIVUM Mannheim](data/DE-Mh40)
### Collection of useful commands
```
# Remove empty lines from ALTO and PAGE XML.
perl -i -ne "tr|\r||d; next if /^\s*$/;print" *.xml
# Remove ALTO files without fulltext.
rm -f $(grep -L 'CONTENT="..*"' *.xml)
# Remove PAGE files without fulltext.
rm -f $(grep -L '..*' *.xml)
```