{"id":17182416,"url":"https://github.com/bertsky/nmalign","last_synced_at":"2025-07-29T19:13:52.092Z","repository":{"id":57677952,"uuid":"490370157","full_name":"bertsky/nmalign","owner":"bertsky","description":"forced alignment of lists of string by fuzzy string matching","archived":false,"fork":false,"pushed_at":"2025-05-01T22:02:40.000Z","size":93,"stargazers_count":10,"open_issues_count":2,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-07-03T04:40:43.811Z","etag":null,"topics":["alignment","ocr-d"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bertsky.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-05-09T16:54:47.000Z","updated_at":"2025-05-01T22:02:43.000Z","dependencies_parsed_at":"2025-04-15T09:01:02.878Z","dependency_job_id":null,"html_url":"https://github.com/bertsky/nmalign","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"purl":"pkg:github/bertsky/nmalign","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fnmalign","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fnmalign/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fnmalign/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fnmalign/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bertsky","download_url":"https://codeload.github.com/bertsky/nmalign/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bertsky%2Fnmalign/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":267740314,"owners_count":24137073,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-07-29T02:00:12.549Z","response_time":2574,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alignment","ocr-d"],"created_at":"2024-10-15T00:37:02.289Z","updated_at":"2025-07-29T19:13:51.816Z","avatar_url":"https://github.com/bertsky.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![PyPI version](https://badge.fury.io/py/nmalign.svg)](https://badge.fury.io/py/nmalign)\n[![Pytest CI](https://github.com/bertsky/nmalign/actions/workflows/test-python.yml/badge.svg)](https://github.com/bertsky/nmalign/actions/workflows/test-python.yml)\n[![codecov](https://codecov.io/gh/bertsky/nmalign/graph/badge.svg?token=9JQYUC66M1)](https://codecov.io/gh/bertsky/nmalign)\n[![Docker Image CD](https://github.com/bertsky/nmalign/actions/workflows/docker-image.yml/badge.svg)](https://github.com/bertsky/nmalign/actions/workflows/docker-image.yml)\n\n# nmalign\n\n    forced alignment of lists of string by fuzzy string matching\n    \n  * [Introduction](#introduction)\n  * [Installation](#installation)\n  * [Usage](#usage)\n     * [standalone command-line interface nmalign](#standalone-command-line-interface-nmalign)\n     * [OCR-D processor interface ocrd-nmalign-merge](#ocr-d-processor-interface-ocrd-nmalign-merge)\n  * [Implementation](#implementation)\n     * [Consistency (monotonicity)](#consistency-monotonicity)\n     * [Splitting (subalignment)](#splitting-subalignment)\n     * [Interactive approval](#interactive-approval)\n\n## Introduction\n\nThis offers **forced alignment** of textlines by fuzzy string matching.\n(The implementation is based on [rapidfuzz cdist](https://maxbachmann.github.io/RapidFuzz/Usage/process.html#cdist).)\n\nIt combines all pairs of strings (i.e. text lines) from either side,\ncalculates their edit distance (assuming some of them are very similar),\nand assigns a mapping from one side to the other by iteratively\nselecting those pairs which have the next-smallest distance (and taking\nthem out of the search). \n\nThe mapping is not necessarily injective or surjective\n(because segments may be split or not match at all).\n\nThis can be used in OCR settings to align (pages or) lines when you have different\nsegmentation. For example, often ground truth data is only transcribed on\nthe page level, but OCR results are available on the line level with precise\ncoordinates. If GT and OCR text are close enough to each other, you could then\nmap the GT text to the predicted coordinates.\n\nIt offers:\n- an API (`nmalign.match`)\n- a standalone CLI (for strings / text files / list files)\n- an [OCR-D](https://ocr-d.de) compliant [workspace processor](https://ocr-d.de/en/spec/cli) (for [METS-XML](https://ocr-d.de/en/spec/mets)/[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) documents)\n\n\n## Installation\n\nCreate and activate a [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments) as usual.\n\nTo install Python dependencies:\n\n    pip install -r requirements.txt\n\nTo install this module (along with Python dependencies), do:\n\n    pip install .\n\nAlternatively, download the prebuilt image from Dockerhub:\n\n    docker pull ocrd/nmalign\n\n## Usage\n\n### standalone command-line interface `nmalign`\n\n\n```\nUsage: nmalign [OPTIONS]\n\n  Force-align two lists of strings.\n\n  Computes string alignments between each pair among l1 and l2 (after optionally\n  normalising both sides).\n\n  Then iteratively searches the next closest pair, while trying to maintain\n  local monotonicity.\n\n  If splits are allowed and the score is already low, then searches for more\n  matches among l1 for the pair's right side sequence: If any subset of them can\n  be combined into a path such that the sum score is better than the integral\n  score, then prefers those assignments.\n\n  Stores the assigned result as a mapping from l1 to l2. (Unmatched or cut off\n  elements will be assigned -1.)\n\n  Prints the corresponding list indices and match scores [0.0,1.0] as CSV data.\n  (For subsequences, the start and end position will be appended.)\n\nlist to be replaced: [exactly 1 required]\n  --strings1 TUPLE               as strings\n  --files1 TUPLE                 as file paths of strings\n  --filelist1 FILENAME           as text file with file paths of strings\n\nlist of replacements: [exactly 1 required]\n  --strings2 TUPLE               as strings\n  --files2 TUPLE                 as file paths of strings\n  --filelist2 FILENAME           as text file with file paths of strings\n\nOther options:\n  -i, --interactive              prompt for each assigned pair, either proceeding or skipping\n  -j, --processes INTEGER RANGE  number of processes to run in parallel\n                                 [1\u003c=x\u003c=32]\n  -N, --normalization TEXT       JSON object with regex patterns and\n                                 replacements to be applied before comparison\n  -x, --allow-splits             find multiple submatches if replacement scores\n                                 low\n  -s, --show-strings             print strings themselves instead of indices\n  -f, --show-files               print file names themselves instead of indices\n  -S, --separator TEXT           print this string between result columns\n                                 (default: tab)\n  --help                         Show this message and exit.\n```\n\nFor example:\n\n\u003cdetails\u003e\u003csummary\u003efile input, index output\u003c/summary\u003e\n\u003cp\u003e\n\n```\nnmalign --files1 GT.SELECTED/FILE_0094_*.gt.txt --files2 GT/FILE_0094_*.gt.txt\n0    -1    0.0\n1    -1    0.0\n2    0    91.78082\n3    1    100.0\n4    2    90.90909\n5    3    95.945946\n6    4    95.588234\n7    5    89.85507\n8    6    92.64706\n9    7    92.10526\n10    8    93.84615\n11    9    89.61039\n12    10    89.85507\n13    11    89.189186\n14    12    92.0\n15    13    96.969696\n16    14    95.38461\n17    15    90.41096\n18    16    91.25\n19    17    96.1039\n20    18    95.89041\n21    19    93.50649\n22    20    93.333336\n23    21    92.68293\n24    22    88.6076\n25    23    95.652176\n26    24    92.85714\n27    25    71.014496\n28    26    92.753624\n29    27    94.5946\n30    28    80.0\n31    29    89.393936\n32    30    83.333336\n33    31    97.01492\n34    32    94.366196\n35    33    89.333336\n36    34    100.0\n37    35    95.588234\n38    36    92.0\n39    37    87.5\n40    38    91.54929\n41    39    92.85714\n42    40    90.0\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\u003csummary\u003efile input, string output\u003c/summary\u003e\n\u003cp\u003e\n\n```\nzo ſo wbohim wotpalenym we jich nuzy a pſchi natwarjenju ze wſchěch ſtro⸗\tzo ſo wbohim wotpalenym we jih nuzy a pihi natwarjenju ze wſchech ftro-\t91.78082\nnow a podwolnje pomhaſche.\tnow a podwolnje pomhaſche.\t100.0\nWe lěcźe wójny mjez pruſkim kralom a rakuſkim khěžorom 1866 na 13.\tWe löcze wöjny mjez pruſkim kralom a rakuſtim khezorom 1866 na 13.\t90.90909\njanuara rano we 4 hodźinach wozjewi ſo we drjewjanej khěžcy we Filipsdorfu\tjanuara rano we 4 hodzinach wozjewi ſo we drjewjanej khezcy we Filipsdorfu\t95.945946\npola Rumburka Macź Boža khorej knježnje Madlenje Kadec, jej prajicy:\tpola Rumburka Macz Boza khorej knjeznje Madlenje Kadec, jej prajicy:95.588234\n„Moje dźěcźo, z nětka zažije\" (twoja bolaca rana). We Filipsdorfu na⸗\t„Moje dzéczo, z nétka zazije“ (twoja bolaca rana). We Filipsdorfu na-\t89.85507\ntwari ſo wot lěta 1870 z darow pobožnych wěriwych wulka rjana cyrkej\ttwari ſo wot leta 1870 z darow pobozuych weriwych wulka rjana eyrkej92.64706\na Serbja k ſwj. Marji tam rad pucźuja.\ta Serbja É fwj. Marji tam rad puczuja.\t92.10526\nDo Rumburka (1 hodźinu wot Filipsdorfa daloko) Serbja hižom dlěhe\tDo Rumburka (1 hodjinu wot Filipsdorfa daloko) Serbia hijom dlehe\t93.84615\nhacž 100 lět na porciunkulu (na 1. a 2. auguſcźe) k kapucinarjam z proceſſio⸗\thaci 100 ſet na porciunkulu (na 1. a 2. auquſcze) É kapucinarjam z proceſſiv-\t89.61039\nnom khodźa: do ſtareje a noweje Krupki we Cžechach (k cźeŕpjacomu Jě⸗\tnom khodja: do ſtareje a noweje Krupki we Czechach (f czetpjacomu Jè-\t89.85507\nzuſej na ſwjatym ſkhodźe a k boloſcźiwej Macźeri Božej) dwójcy za lěto, na\tzuſej na ſwjatym ſkhodze a É boloſcziwej Maczeri Bozej) dwöjcy za Leto, na\t89.189186\nſwjatki a na ſwj. dźeń Marijnoho naroda (8. ſeptembra), tež tak dołho. Prě⸗\tſwjatki a na fwj. dzeü Marijnoho naroda (8. ſeptembra), tež tak dolho. Pre-\t92.0\nnja ſerbſka putnica do Krupki bě wěſta Korchowa z Nowoſlic (1754).\tnja ſerbſka putnica do Krupki be weſta Korchowa z Nowoſlic (1754).\t96.969696\nHdyž we ſeptembrje 1865 ſerbſki proceſſion do Krupki dźěſche a do\tHdyi we ſeptembrje 1865 ſerbſki proceſſion do Krupki dzeſche a do\t95.38461\nměſtacžka Gottleuby pſchińdźe, hanjachu a wuſměchowachu pobožnych Serbow.\tmiſtaczka Gottleuby pſchindze, hanjachu a wuſmächowachu pobozuych Serbow.\t90.41096\nKrótki cžas na to, na 4. oktobrje 1865, wulki dźěl měſtacžka ſo wotpali. Młynſki\tKrótki czas na to, na 4. oktobrje 1865, wulki dzel méſtaczka ſo wotpali. Mtynſei\t91.25\nmiſchtr k. Jurij Wawrik z Khanjec, kotryž bě pſchi proceſſionje pobył, nahro⸗\tmiſchtr k. Jurij Wawrik z Khanjec, kotryz be pſchi proceſſionje pobyk, nahro⸗\t96.1039\nmadźi za wotpalenych mjez Serbami 110 toleri a pſchipóſła je do Gottleuby\tmadzi za wotpalenych mjez Serbami 110 toleri a pſchipöſta je do Gottleuby\t95.89041\nz liſtom: „Dar luboſcźe za wotpalenych wot katholſkich Serbow, kiž kóžde lěto\tz liſtom: „Dar luboſcze za wotpalenych wot katholſkich Serbow, kiz közde léto\t93.50649\npſchez Gottleubu do Krupki pucźuja.\" Hdyž ſerbſki proceſſion we ſcźehowacym\tpſchez Gottleubu do Krupki puczuja.“ Hdyz ſerbſki proceſſion we ſczehowaeym\t93.333336\nlěcźe 1866 ſo zaſy Gottleubje pſchibliži, cźehnjechu jomu měſchcźanoſta, lutherſki\tlécze 1866 ſo zaſy Gottleubje pſchiblizi, czehnjechu jomu meſchczanoſta, lutherſki\t92.68293\nduchowny a ſchulſke dźěcźi napſchecźo, powitachu jón luboznje z rycžu a z khěr⸗\tduchowny a ſchulſke dzẽczi napſcheczo, powitachu jón luboznje 3 ryczu a 3 kher—\t88.6076\nluſchom a podźakowachu ſo za doſtaty pjenježny dar. Knj. Jurij Wawrik\tluſchom a podzakowachu ſo za doſtaty pjenjezuy dar. Knj. Jurij Wawrik\t95.652176\nnahromadźi tež we lěcźe 1865 mjez Serbami 60 toleri na wobnowjenjo Ma⸗\tnahromadzi te} we lecze 1865 mjez Serbami 60 toleri na wobnowjenjo Ma-\t92.85714\nrijnoho wołtarja we farſkej cyrkwi w Krupcy. — Tónſamyn J. Wawrik na⸗\trijnoho woktarja we farſkej cyrkwi w Krupcy. — Tón\t71.014496\ntwari z kublerjom Jakubom Kocorom na pucźu mjez Khanjecami a Swinjaŕ⸗\ttwari 3 kublerjom Jakubom Kocorom na puczu miez Khanjecami a Swinjat-\t92.753624\nnju rjanu kapałku, kotruž kniez biſkop Ludwik Forwerk na 9. ſeptembra 1870\tnju riann kapalku, kotruz kniez biſkop Ludwik Forwerk na 9. ſeptembra 1870\t94.5946\nwoſwjecźi.\twoſwieczi.\t80.0\n1866 zawjedźechu ſo we Kulowſkej ſarſkej cyrkwi mejſke pobožnoſcźe\t1866 zawjedzechu ſo we Kulowſkej farſkej eyrkwi mejſke poboznofeze\t89.393936\nk cžeſcźi Macźeri Božej.\tk czeſczi Maczeri Bozej.\t83.333336\n1867 na 10. novembra załoži knj. tachantſki vikar Jakub Hermann (we\t1867 na 10. novembra zakozi knj. tachantſki vikar Jakub Hermann (we97.01492\nwójnje 1866 katholſki pólny kapłan pola ſakſkoho wójſtwa) katholſke to⸗\twöjnje 1866 katholſki pölny kaplan pola ſakſkoho wöjſtwa) katholſke to⸗\t94.366196\nwarſtwo rjemjeſłniſkich we Budyſchinje, kotrež po tſjóch lětach rjanu khěžu\twarſtwo rjemjeſkniſkich we Budyſchinje, kotrez po tſjöch Itah rjanu khezu\t89.333336\nna garbarſkej haſy kupi.\tna garbarſkej haſy kupi.\t100.0\nPo poſtajenju ſakſkoho miniſterſtwa ſo wot lěta 1868 měſto Budyſchin\tPo poſtajenju ſakſtoho miniſterſtwa ſo wot leta 1868 meſto Budyſchin95.588234\nněmſki wjacy njemjenuje „Budiſſin\", ale „Bautzen\".\tnämſti wjacy njemjenuje „Budiſſin“, ale „Bautzen“.\t92.0\nNa kóncu lěta 1867 płacźeſche kórc pſcheńcy 7 tol. 7½ nſl.; rožki 3 tol.\tNa köncu léta 1867 placzeſche köre pſcheücy 7 tol. 7½ uſl.; rozki 3 tol.\t87.5\n20 nſl.; jecžmjenja 2 tol. 25 nſl.; wowſa 2 tol. 10 nſl.; jahłow 7 tol.\t20 ufl.; jeczmjenja 2 tol. 25 uſl.; wowſa 2 tol. 10 uſl.; jahkow 7 tol.\t91.54929\n20 nſl.; hejduſche 5 tol. 25 nſl.; kana butry 22½ nſl. — Na kóncu lěta\t20 ufl.; hejduſche 5 tol. 25 nſl.; kana butry 22½ nfl. — Na köncu léta\t92.85714\n1868: kórc pſcheńcy 6 tol.; rožki 4 tol. 22½ nfl.; jecžmjenja 4 tol.;\t1868: köre pſcheücy 6 tol.; rozki 4 tol. 22 ½ nſl.; jeczmjenja 4 tol.;\t90.0\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\u003cdetails\u003e\u003csummary\u003efile input, filename output\u003c/summary\u003e\n\u003cp\u003e\n\n```\nnmalign -f --files1 GT.SELECTED/FILE_0094_*.gt.txt --files2 GT/FILE_0094_*.gt.txt\nGT.SELECTED/FILE_0094_GT.SELECTED_region0000_region0000_line0000.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0013.gt.txt\t0.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0002_region0002_line0000.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0013.gt.txt\t0.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0000.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0013.gt.txt\t0.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0001.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0001.gt.txt\t91.89189\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0002.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0002.gt.txt\t100.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0003.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0003.gt.txt\t91.04478\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0004.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0004.gt.txt\t96.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0005.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0005.gt.txt\t95.652176\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0006.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0006.gt.txt\t90.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0007.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0007.gt.txt\t92.753624\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0008.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0008.gt.txt\t92.30769\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0009.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0009.gt.txt\t93.93939\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0010.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0010.gt.txt\t89.74359\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0011.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0011.gt.txt\t90.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0012.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0012.gt.txt\t89.333336\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0013.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0013.gt.txt\t92.10526\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0014.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0014.gt.txt\t97.01492\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0015.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0015.gt.txt\t95.454544\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0016.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0016.gt.txt\t90.54054\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0017.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0017.gt.txt\t91.358025\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0018.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0018.gt.txt\t96.15385\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0019.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0019.gt.txt\t95.945946\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0020.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0020.gt.txt\t93.589745\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0021.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0021.gt.txt\t93.42105\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0022.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0022.gt.txt\t92.77109\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0023.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0023.gt.txt\t88.75\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0024.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0024.gt.txt\t95.71429\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0025.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0025.gt.txt\t92.95775\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0026.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0026.gt.txt\t71.42857\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0027.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0029.gt.txt\t92.85714\nGT.SELECTED/FILE_0094_GT.SELECTED_region0005_region0005_line0028.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0002_FILE_0094_CROPPED_region0002_line0030.gt.txt\t94.666664\nGT.SELECTED/FILE_0094_GT.SELECTED_region0011_region0011_line0000.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0001.gt.txt\t81.818184\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0000.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0002.gt.txt\t89.55224\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0001.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0003.gt.txt\t84.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0002.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0004.gt.txt\t97.05882\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0003.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0005.gt.txt\t94.44444\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0004.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0006.gt.txt\t89.47369\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0005.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0007.gt.txt\t100.0\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0006.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0008.gt.txt\t95.652176\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0007.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0009.gt.txt\t92.15686\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0008.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0010.gt.txt\t87.671234\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0009.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0011.gt.txt\t91.666664\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0010.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0012.gt.txt\t92.95775\nGT.SELECTED/FILE_0094_GT.SELECTED_region0012_region0012_line0011.gt.txt\tGT/FILE_0094_GT_FILE_0094_CROPPED_region0003_FILE_0094_CROPPED_region0003_line0013.gt.txt\t90.14085\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n### [OCR-D processor](https://ocr-d.de/en/spec/cli) interface `ocrd-nmalign-merge`\n\nTo be used with [PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) documents in an [OCR-D](https://ocr-d.de/en/about) annotation workflow.\n\n```\nUsage: ocrd-nmalign-merge [OPTIONS]\n\n  forced alignment of lists of string by fuzzy string matching\n\n  \u003e Force-align the textlines text of both inputs for each page, then\n  \u003e insert the 2nd into the 1st.\n\n  \u003e Find file pairs in both input file groups of the workspace for the\n  \u003e same page IDs.\n\n  \u003e Open and deserialize PAGE input files, then iterate over the element\n  \u003e hierarchy down to the TextLine level, looking at each first\n  \u003e TextEquiv. (If the second input has no TextLines, but newline-\n  \u003e separated TextEquiv at the TextRegion level, then use these instead.\n  \u003e If either side has no lines, then skip that page.)\n\n  \u003e Align character sequences in all pairs of lines for any combination\n  \u003e of textlines from either side.\n\n  \u003e If ``normalization`` is non-empty, then apply each of these regex\n  \u003e replacements to both sides before comparison.\n\n  \u003e Then iteratively search the next closest match pair. Remember the\n  \u003e assigned result as mapping from first to second fileGrp.\n\n  \u003e When all lines of the second fileGrp have been assigned, or the\n  \u003e ``cutoff_dist`` has been reached, apply the mapping by inserting\n  \u003e each line from the second fileGrp into position 0 (and `@index=0`)\n  \u003e at the first fileGrp. Also mark the inserted TextEquiv via\n  \u003e `@dataType=other` and `@dataTypeDetails=GRP`.\n\n  \u003e (Unmatched or cut off lines will stay unchanged, except for their\n  \u003e `@index` now starting at 1.)\n\n  \u003e If ``allow_splits`` is true, then for each long bad match, spend\n  \u003e some extra time searching for subsegmentation candidates, i.e. a\n  \u003e sequence of multiple lines from the first fileGrp aligning with a\n  \u003e single line from the second fileGrp. When such a sequence outscores\n  \u003e the bad match, prefer the concatenated sequence over the single\n  \u003e match when inserting results.\n\n  \u003e Produce a new PAGE output file by serialising the resulting\n  \u003e hierarchy.\n\nOptions:\n  -I, --input-file-grp USE        File group(s) used as input\n  -O, --output-file-grp USE       File group(s) used as output\n  -g, --page-id ID                Physical page ID(s) to process\n  --overwrite                     Remove existing output pages/images\n                                  (with --page-id, remove only those)\n  --profile                       Enable profiling\n  --profile-file                  Write cProfile stats to this file. Implies --profile\n  -p, --parameter JSON-PATH       Parameters, either verbatim JSON string\n                                  or JSON file path\n  -P, --param-override KEY VAL    Override a single JSON object key-value pair,\n                                  taking precedence over --parameter\n  -m, --mets URL-PATH             URL or file path of METS to process\n  -w, --working-dir PATH          Working directory of local workspace\n  -l, --log-level [OFF|ERROR|WARN|INFO|DEBUG|TRACE]\n                                  Log level\n  -C, --show-resource RESNAME     Dump the content of processor resource RESNAME\n  -L, --list-resources            List names of processor resources\n  -J, --dump-json                 Dump tool description as JSON and exit\n  -D, --dump-module-dir           Output the 'module' directory with resources for this processor\n  -h, --help                      This help message\n  -V, --version                   Show version\n\nParameters:\n   \"normalization\" [object - {}]\n    replacement pairs (regex patterns and regex backrefs) to be applied\n    prior to matching (but not on the result itself)\n   \"allow_splits\" [boolean - false]\n    allow line strings of the first input fileGrp to be matched by\n    multiple line strings of the second input fileGrp (so concatenate\n    all the latter before inserting into the former)\n```\n\nFor example:\n\n\u003cdetails\u003e\u003csummary\u003efile input, index output\u003c/summary\u003e\n\u003cp\u003e\n\n```\nocrd-nmalign-merge -I OCR-D-OCR,OCR-D-GT-SEG-BLOCK -O OCR-D-GT-SEG-LINE\n```\n\n\u003c/p\u003e\n\u003c/details\u003e\n\n## Implementation\n\n1. set up a matrix of shape _N,M_ (where _N_ is the number of strings\non the left-hand side, and _M_ is the number of strings on the right-hand side)\nand compute all pairwise similarity scores – using `rapidfuzz.process.cdist`\nwith metric `rapidfuzz.metric.Levenshtein.normalized_similarity`,\nwhich efficiently calculates global alignments (Needleman-Wunsch) in parallel.\n\n2. iteratively assign pairs _i,j_ (effectively adding a mapping from _i_ to _j_)\nby picking the best scoring pair among the rows and columns not already assigned.\n\n### Consistency (monotonicity)\n\nNaïvely, best score means largest similarity. But it is easier for a short pair\nto be similar by chance than for a long pair of strings. And we want to start\nwith pairs that most certainly belong to each other. So the similarity must be\n**weighted** with the length of the string in question.\n\nMoreover, since realistically two sets of texts will likely differ slightly in\ntheir global _reading order_, but (more or less) retain local _line segmentation_,\nwe prioritise solutions which maintain **monotonicity**. \n\nSince this criterion becomes more important as the matrix gets completed and the\nactual string alignments become worse (and thus less reliable), we attenuate\nthe preference towards monotonicity by a sigmoid over _M_.\n\n### Splitting (subalignment)\n\nSometimes even the local _line segmentation_ is not retained between both sets\nof texts. Thus, strings will appear split on one side. To address that, there\nis an option to allow **splitting** the right-hand side:\n\nIf (during step 2.) the score is already too low, then search all remaining rows\nfor partial matches against column _j_ – again using `cdist`, but now with metric\n`rapidfuzz.fuzz.partial_ratio`, which efficiently calculates local alignments\n(if not exactly Smith-Waterman) in parallel.\n\nNote that for a subsegmentation of that column, we need a **spanning sequence**\nof mutually **non-overlapping** matches across some matching rows. To that end,\nfor all matches _i_ above some threshold, now proceed to compute their exact\nsubalignments of the string in column _j_ (again in parallel), and store their\ndistances into a matrix of shape _L,L_ (where _L_ is the length of that string)\nwith the start position as row and the end position as column, respectively.\nLikewise, store row indexes into a sister matrix.\n\nNext, determine the **shortest path** through the distance matrix (spanning\nfrom _0,0_ to _L,L_ monotonically).\n(In order to accommodate the case where subalignment matches are not already\nspanning perfectly, the distance matrix is filled with default distances\ncorresponding to random deletions of characters.)\n\nBacktrack that path among both matrices to determine the overall score, the\nlocal scores and row indexes _i_, and the local column string positions.\nIf the overall score does improve the global score for _j_, then assign\nall rows _i_ to subslices of _j_, respectively. (Otherwise continue with\nthe global assignment.)\n\n### Interactive approval\n\nIf enabled, during 2. (after one alignment pair, or after a list of subalignments\nhave been found), prints this pair as a diff and prompts for approval.\n\nIf granted, proceeds normally. Else, if this was a subalignment, drops that\nresult and proceeds to the global alignment for that segment (prompting again).\nOtherwise, skips that pair and proceeds to the next-best one.\n\n## Open Tasks\n\nIf OCR confidence data is available on the input, this should be utilised.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertsky%2Fnmalign","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbertsky%2Fnmalign","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbertsky%2Fnmalign/lists"}