{"id":13415587,"url":"https://github.com/cneud/ocr-conversion","last_synced_at":"2026-02-06T03:31:48.848Z","repository":{"id":36742114,"uuid":"41048716","full_name":"cneud/ocr-conversion","owner":"cneud","description":"Conversions between various OCR formats","archived":false,"fork":false,"pushed_at":"2023-05-13T15:11:16.000Z","size":36,"stargazers_count":82,"open_issues_count":0,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-01-23T19:58:43.928Z","etag":null,"topics":["abbyy-xml","alto-xml","hocr","ocr","page-xml","tei-xml"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cneud.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-08-19T17:18:38.000Z","updated_at":"2025-11-21T20:26:00.000Z","dependencies_parsed_at":"2024-01-07T21:13:32.979Z","dependency_job_id":"4bb455aa-b3ce-49f6-8b77-e4b7c8b1357a","html_url":"https://github.com/cneud/ocr-conversion","commit_stats":null,"previous_names":["cneud/ocr-conversion-scripts"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/cneud/ocr-conversion","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cneud%2Focr-conversion","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cneud%2Focr-conversion/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cneud%2Focr-conversion/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cneud%2Focr-conversion/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cneud","download_url":"https://codeload.github.com/cneud/ocr-conversion/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cneud%2Focr-conversion/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29148136,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-06T02:39:25.012Z","status":"ssl_error","status_checked_at":"2026-02-06T02:37:22.784Z","response_time":59,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["abbyy-xml","alto-xml","hocr","ocr","page-xml","tei-xml"],"created_at":"2024-07-30T21:00:50.535Z","updated_at":"2026-02-06T03:31:48.823Z","avatar_url":"https://github.com/cneud.png","language":null,"funding_links":[],"categories":["1. \u003ca name='Software'\u003e\u003c/a\u003eSoftware","Others"],"sub_categories":["1.3. \u003ca name='OCRfileformats'\u003e\u003c/a\u003eOCR file formats"],"readme":"OCR conversion\n==============\n\nCollection of scripts and stylesheets for conversion between various OCR formats. \n\nYou may also want to check out the excellent [ocr-fileformat](https://github.com/UB-Mannheim/ocr-fileformat) by [@UB-Mannheim](https://github.com/UB-Mannheim).\n\n#### ABBYY\n * [`abbyy2hocr.xsl`](https://gist.github.com/tfmorris/5977784) - ABBYY FineReader XML to hOCR converter [@Rod Page](http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html#comment-400434491)\n  * [`abbyy2hocr.xsl`](https://github.com/OCR-D/format-converters/blob/master/abbyy2hocr.xsl) - ABBYY FineReader XML to hOCR converter by [@Rod Page](http://iphylo.blogspot.com/2011/07/correcting-ocr-using-hocr-firefox.html#comment-400434491) - updated by [@OCR-D](https://github.com/OCR-D)\n  * [`abbyy-to-hocr`](https://git.archive.org/merlijn/archive-hocr-tools/-/blob/master/bin/abbyy-to-hocr) - ABBYY FineReader XML to hOCR converter by [@merlijn](https://git.archive.org/merlijn)\n * [`teip5-v5.xsl`](http://discoveryspace.upei.ca/islandlives.ca/sites/discoveryspace.upei.ca.islandlives.ca/files/teip5-v5.xsl) - Transform ABBYY Finereader XML into TEI [@UPEI](http://discoveryspace.upei.ca/islandlives.ca/node/130)\n * [`ABBYY_to_TEI_by_XMLReader.php`](http://able.myspecies.info/abbyy-xml-tei-xml) - Convert ABBYY XML to TEI using PHP's XMLReader [@able-project](http://able.myspecies.info/abbyy-xml-tei-xml)\n * [`ocr_to_teifacsimile.xsl`](https://github.com/emory-libraries/readux/blob/master/readux/books/ocr_to_teifacsimile.xsl) - Generate page-level TEI facsimile from Abbyy OCR xml or METS/ALTO [@readux](https://github.com/emory-libraries/readux)\n * [`AbbyyToAlto.php`](https://github.com/ironymark/AbbyyToAlto/blob/master/AbbyyToAlto.php) - PHP5 to convert Abbyy FineReader XML into ALTO XML [@ironymark](https://github.com/ironymark/AbbyyToAlto)\n * [`AbbyyToAltoConverter.java`](https://github.com/Mewel/abbyy-to-alto) - Java library to convert abbyy.xml (v10) to alto.xml (v2) [@abbyy-to-alto](https://github.com/Mewel/abbyy-to-alto)\n \n#### ALTO\n * [`alto2tei.xsl`](https://github.com/INL/OpenConvert/blob/master/resources/xsl/alto2tei.xsl) - Output TEI from ALTO input format [@OpenConvert](https://github.com/INL/OpenConvert) \n * [`AltoToTeiA.xsl`](https://github.com/collex/typewright/blob/master/lib/saxon/AltoToTeiA.xsl) - For Gale OCR XML or 18thConnect Typewright XML files [@typewright](https://github.com/collex/typewright)\n * [`ocr_to_teifacsimile.xsl`](https://github.com/emory-libraries/readux/blob/master/readux/books/ocr_to_teifacsimile.xsl) - Generate page-level TEI facsimile from Abbyy OCR xml or METS/ALTO [@readux](https://github.com/emory-libraries/readux)\n * [`alto2hocr.xsl`](https://github.com/filak/hOCR-to-ALTO/blob/master/alto2hocr.xsl) - Convert ALTO 2.0 / ALTO 2.1 to hOCR [@filak](https://github.com/filak/hOCR-to-ALTO)\n * [`alto2text.xsl`](https://github.com/filak/hOCR-to-ALTO/blob/master/alto2text.xsl) - Convert ALTO 2.0 / ALTO 2.1 to plain text [@filak](https://github.com/filak/hOCR-to-ALTO)\n * [`alto_ocr_text.py`](https://github.com/cneud/alto-ocr-text/blob/master/alto_ocr_text.py) - Extracts the text from an ALTO file and writes it to stdout [@cneud](https://github.com/cneud/alto-ocr-text)\n * [`ALTO2HTML.bat`](https://github.com/altomator/ALTO-HTML) - Batch script to convert ALTO files to HTML [@altomator](https://github.com/altomator/ALTO-HTML)\n * [`dinglehopper-extract`](https://github.com/qurator-spk/dinglehopper) - Extracts the text from ALTO and PAGE XML files [@qurator-spk](https://github.com/qurator-spk/)\n \n#### hOCR\n * [`hOCR2ALTO.xsl`](https://github.com/ONB-RD/hOCRTools/blob/master/xsl/hOCR2ALTO.xsl) - Utilities to process and handle hOCR [@ONB-RD](https://github.com/ONB-RD/hOCRTools)\n * [`hocr2alto2.0.xsl`](https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.0.xsl) - Convert hOCR to ALTO 2.0 [@filak](https://github.com/filak/hOCR-to-ALTO)\n * [`hocr2alto2.1.xsl`](https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2alto2.1.xsl) - Convert hOCR to ALTO 2.1 [@filak](https://github.com/filak/hOCR-to-ALTO)\n * [`hocr2tei.xsl`](https://github.com/TEIC/Hackathon/blob/master/DH2015/xsl/hocr2tei.xsl) - Convert hOCR from Tesseract to basic TEI output [@DH2015](https://github.com/TEIC/Hackathon/tree/master/DH2015)\n  * [`hocr2tei.xsl`](https://github.com/OCR-D/format-converters/blob/master/hocr2tei.xsl) - Convert hOCR from Tesseract to basic TEI output from [@DH2015](https://github.com/TEIC/Hackathon/tree/master/DH2015) - updated by [@OCR-D](https://github.com/OCR-D)\n * [`hocr2text.xsl`](https://github.com/filak/hOCR-to-ALTO/blob/master/hocr2text.xsl) Convert hOCR to plain text [@filak](https://github.com/filak/hOCR-to-ALTO)\n * [`HocrConverter.py`](https://github.com/jbrinley/HocrConverter/blob/master/HocrConverter.py) - Create a PDF from an hOCR file and an image [@jbrinley](https://github.com/jbrinley/HocrConverter)\n \n#### PAGE\n * [`PageConverter.java`](https://github.com/PRImA-Research-Lab/prima-page-converter) - Convert ALTO XML, FineReader XML, Google CV, and hOCR to the latest PAGE XML format [@prima](https://github.com/PRImA-Research-Lab/prima-page-converter)\n * [`xml_to_box.xsl`](https://github.com/idhmc-tamu/eMOP/blob/master/xml_to_box.xsl) - Convert PAGE XML to Tesseract box file [@eMOP](https://github.com/idhmc-tamu/eMOP)\n * [`page_to_text.py`](https://github.com/cneud/page-to-text/blob/master/page_to_text.py) - Extracts the text from a PAGE file and writes it to stdout [@cneud](https://github.com/cneud/page-to-text)\n * [`PageToPdfConverter.java`](https://github.com/PRImA-Research-Lab/prima-page-to-pdf) - Convert PAGE XML files with layout and text content to PDF [@prima](https://github.com/PRImA-Research-Lab/prima-page-to-pdf)\n * [`page2tei-0.xsl`](https://github.com/dariok/page2tei/blob/master/page2tei-0.xsl) - Convert PAGE XML to TEI [@dariok](https://github.com/dariok/page2tei)\n * [`PageToAlto.xsl`](https://github.com/Transkribus/TranskribusCore/blob/master/src/main/resources/xslt/PageToAlto.xsl) - Convert PAGE XML to ALTO [@Transkribus](https://github.com/Transkribus)\n * [`page-to-alto`](https://github.com/kba/page-to-alto) – Convert PAGE XML to ALTO (all versions) [@kba](https://github.com/kba/page-to-alto)\n * [`dinglehopper-extract`](https://github.com/qurator-spk/dinglehopper) - Extracts the text from ALTO and PAGE XML files [@qurator-spk](https://github.com/qurator-spk/)\n \n#### TEI\n * [`tei2txt.xsl`](https://github.com/haoess/dta-tools/blob/master/tei2txt/share/xslt/tei2txt.xsl) - Convert DTA TEI-P5 to plain text [@haoess](https://github.com/haoess/dta-tools)\n * [`tei2hocr.xsl`](https://github.com/jbaiter/tei2hocr/blob/master/tei2hocr.xsl) - Convert DTA TEI-P5 to hOCR [@jbaiter](https://github.com/jbaiter/tei2hocr)\n\n#### Other\n* [`iw2alto.xsl`](https://github.com/karkraeg/im2alto/blob/main/iw2alto.xsl) - Convert [ImageWare MyBib eL OCR](https://www.imageware.de/produkte/mybib-el-allgemein/) to ALTO [@karkraeg](https://github.com/karkraeg/im2alto)\n* [`transkribus-xslt`](https://gitlab.com/readcoop/transkribus/TranskribusCore/-/tree/master/src/main/resources/xslt) - Various stylesheets from Transkribus [@readcoop](https://gitlab.com/readcoop/)\n* [`transkribus-to-prima`](https://github.com/kba/transkribus-to-prima) – Convert Transkribus dialect to official PAGE XML format [@kba](https://github.com/kba/page-to-alto)\n* [`textract2page`](https://github.com/slub/textract2page) - Convert [Amazon AWS Textract](https://docs.aws.amazon.com/textract/latest/dg/how-it-works-document-layout.html) to PAGE XML [@slub](https://github.com/slub/textract2page)\n* [`gcv2hocr`](https://github.com/dinosauria123/gcv2hocr) – Convert [Google Cloud Vision](https://cloud.google.com/vision/docs/) to hOCR [@dinosauria123](https://github.com/dinosauria123/gcv2hocr)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcneud%2Focr-conversion","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcneud%2Focr-conversion","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcneud%2Focr-conversion/lists"}