{"id":17182418,"url":"https://github.com/bertsky/hsbcala","last_synced_at":"2026-03-12T14:36:00.880Z","repository":{"id":118132719,"uuid":"494915211","full_name":"bertsky/hsbcala","owner":"bertsky","description":"train Calamari models for Upper Sorbian (Fraktur and Antiqua) prints on HPC","archived":false,"fork":false,"pushed_at":"2024-09-07T09:15:40.000Z","size":14,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-02-22T00:31:42.376Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bertsky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-21T23:20:49.000Z","updated_at":"2024-09-07T09:15:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"8275fca3-2e61-46f0-8cae-51b18559d6f2","html_url":"https://github.com/bertsky/hsbcala","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fhsbcala","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fhsbcala/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fhsbcala/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fhsbcala/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bertsky","download_url":"https://codeload.github.com/bertsky/hsbcala/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240384319,"owners_count":19792970,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-15T00:37:02.861Z","updated_at":"2026-03-12T14:35:55.829Z","avatar_url":"https://github.com/bertsky.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# hsbcala\n\n\u003e train Calamari models for Upper Sorbian (Fraktur and Antiqua) prints on HPC\n\nScripts for training [Calamari OCR](https://github.com/Calamari-OCR/calamari) models on [ZIH's Power9 NVidia V100 HPC cluster](https://doc.zih.tu-dresden.de/jobs_and_resources/hardware_overview/#ibm-power9-nodes-for-machine-learning) for [Upper Sorbian](https://www.sorabicon.de/en/home/) prints.\n\nThe GT data is [here for Fraktur](https://mrocel.sorbib.de/index.php/s/XstEfxREcf7LQEj) and [here for Antiqua](https://mrocel.sorbib.de/index.php/s/emcjiHz3MZFZtdW). Production and rights: [Sorbian Institute](https://www.serbski-institut.de/en).\n\nThe approach was to do **finetuning** on pretrained models:\n- for Fraktur prints (16k lines * 5 kinds of preprocessing):\n  - with Calamari 2: [deep3_fraktur19](https://github.com/Calamari-OCR/calamari_models_experimental)\n  - with Calamari 1: [fraktur_19th_century](https://github.com/Calamari-OCR/calamari_models)\n- for Antiqua prints (16k lines * 5 kinds of preprocessing):\n  - with Calamari 2: [deep3_lsh4](https://github.com/Calamari-OCR/calamari_models_experimental)\n  - with Calamari 1: [antiqua_historical](https://github.com/Calamari-OCR/calamari_models)\n\n(We don't want to have voting during inference, therefore we run `calamari-train` – not `calamari-cross-fold-train` – and pick the first model among the pretrained ensembles, respectively. We use Calamari 2.2.2 / Calamari 1.0.5 CLIs – in an attempt to find similar settings for both versions.)\n\nThis repo provides the [Slurm](https://doc.zih.tu-dresden.de/jobs_and_resources/slurm/) scripts, which:\n1. source an environment script `ocrenv.sh` loading the HPC environment's [modules](https://doc.zih.tu-dresden.de/software/modules/) (an [Lmod](https://www.tacc.utexas.edu/research-development/tacc-projects/lmod) system) and a custom venv (`powerai-kernel2.txt`)\n2. checks whether any checkpoints exist in the output directory already –\n   - if yes, then use `calamari-resume-training`\n   - otherwise, start `calamari-train`\n3. sets up all parameters\n4. wraps the call with [Nvidia Nsight](https://developer.nvidia.com/nsight-systems) for profiling\n\nFor optimal resource allocation (empirically determined via Nsight and the [PIKA system](https://doc.zih.tu-dresden.de/software/pika/) for job monitoring), we use\n- a large batch size (64-80)\n- a large number (10) of cores and data workers\n- a high amount of RAM (32 GB) per core, ~~with~~ _without_ preloading (but data on RAM disk) and data prefetching (32)\n- multiple GPUs (with the `MirroredStrategy` for [distributed training](https://www.tensorflow.org/guide/distributed_training)) on Calamari 2\n\nFor optimal accuracy, we use\n- re-computing the codec (i.e. keeping only shared codepoints, adding new ones)\n- implicit augmentation (5-fold)\n- explicit augmentation (by passing raw colors plus multiple binarization variants)\n- early stopping (at 10 epochs without improvement)\n\n## Results\n\nThe models are simply named…\n- for Fraktur prints:\n  + `hsbfraktur.cala1` (for Calamari 1)\n  + `hsbfraktur.cala` (for Calamari 2)\n- for Antiqua prints:\n  + `hsblatin.cala1` (for Calamari 1)\n  + `hsblatin.cala` (for Calamari 2)\n\nSee [release archives](https://github.com/bertsky/hsbcala/releases) for model files.\n\n**Note**: the models seem to have a soft dependency on\n(meaning the inference quality will be better if)\n- textline segmentation with dewarping or some vertical padding (\u0026gt;4px)\n- binarization with little to no noise (for Antiqua)  \n  raw colors (for Fraktur)\n\n(This needs to be investigated further.)\n\n## Evaluation\n\n...on held out validation data (used for checkpoint selection, 3.2k / 3.8k lines):\n\n| **model** | **CER** |\n| --- | --- |\n| hsbfraktur.cala1 | 1.82% |\n| hsbfraktur.cala | 0.50% |\n| hsblatin.cala1 | 0.95% |\n| hsblatin.cala | 0.25% |\n\n...on truly representative extra data (771 / 1640 lines):\n\n| **model** | **CER** |\n| --- | --- |\n| hsbfraktur.cala1 | 0.45% |\n| hsbfraktur.cala | 0.47% |\n| hsblatin.cala1 | 1.23% |\n| hsblatin.cala | 0.52% |\n\n## Acknowledgement\n\nThe authors are grateful to the [Center for Information Services and High Performance Computing at TU Dresden](https://tu-dresden.de/zih/hochleistungsrechnen)\nfor providing its facilities for high throughput calculations.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertsky%2Fhsbcala","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbertsky%2Fhsbcala","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertsky%2Fhsbcala/lists"}