Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-ocr

Links to awesome OCR projects
https://github.com/kba/awesome-ocr

Last synced: 2 days ago
JSON representation

Literature
- Blog Posts and Tutorials
- Academic articles
- OCR-related publication and link lists
  - IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
  - OCR-D - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de:
  - Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
  - eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
  - Wikipedia: Comparison of optical character recognition software
  - OCR [and Deep Learning
  - Ocropus Wiki: Publications
Software
- OCR as a Service
  - Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
  - ocr.space - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open)
  - Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
- OCR engines
  - gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
  - Ocrad - The GNU OCR. `GPL`
  - RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
  - tesseract - The definitive Open Source OCR engine `Apache 2.0`
  - EasyOCR - OCR engine built on PyTorch by JaidedAI, `Apache 2.0`
  - ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
  - attention-ocr - OCR engine using visual attention mechanisms
  - kraken - Ocropus fork with sane defaults
  - ocular - Machine-learning OCR for historic documents
  - simple-ocr-opencv - ocr-opencv) - A simple pythonic OCR engine using opencv and numpy
  - Calamari - OCR Engine based on OCRopy and Kraken
  - doctr - A seamless & high-performing OCR library powered by Deep Learning
  - ocropus - OCR engine based on LSTM, `Apache 2.0`
  - SwiftOCR - fast and simple OCR library written in Swift
- Older and possibly abandoned OCR engines
  - Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
  - Eye - an experimental Java OCR (image-to-text) application
  - kognition - An omnifont OCR software for KDE
  - OCRchie - Modular Optical Character Recognition Software
  - ocre - o.c.r. easy
  - xplab - A GTK 2 tool for pattern matching
  - hebOCR - Hebrew character recognition library (previously named hocr, see [Wikipedia article](https://de.wikipedia.org/wiki/HebOCR)) `GPL`
- OCR file formats
  - abby2hocr.xslt XSLT script
  - TEI SIG on Libraries - Best Practices for TEI in Libraries
  - GDZ - METS/TEI-based GDZ document format
  - PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
  - omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
  - hocr-spec - hOCR 1.2 specification
  - hocr-parser - hOCR Specification Python Parser
  - hOCRTools - hOCR to ALTO conversion XSLT
  - ALTO XML Schema - XML Schema and development of the ALTO XML format
  - ALTO XML Documentation - Documentation and use cases for ALTO
  - alto-tools - Various tools to work with ALTO files, Python
  - AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
  - TEI-OCR - TEI customization for OCR generated layout and content information
  - py-pagexml - Python library for handling PAGE XML and OPF files.
  - ocr-conversion-scripts
  - hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
  - ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
- OCR CLI
  - OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
  - Ocrocis - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/)
  - Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
  - tesseract-recognize - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)).
- OCR GUI
  - VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including [jTessBoxEditor](http://vietocr.sourceforge.net/training.html) a graphical Tesseract [box data](https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files) editor
  - OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
  - moz-hocr-editor - Firefox Addon for editing hOCR files **Discontinued**
  - gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
  - qt-box-editor - QT4 editor of tesseract-ocr box files.
  - ocr-gt-tools - Client-Server application for editing OCR ground truth.
  - Paperwork - Using scanners and OCR to grep paper documents the easy way.
  - Paperless - Scan, index, and archive all of your paper documents.
- OCR Preprocessing
  - NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
  - binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
  - Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
  - textcleaner - Processes a scanned document of text to clean the text background
  - localcontrast - Fast O(1) local contrast optimization
- OCR evaluation
  - ISRI OCR Evaluation Tools - ocr-evaluation-tools/blob/HEAD/user-guide.pdf)
  - ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
- OCR libraries by programming language
  - leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
Datasets
- Ground Truth
 - CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
 - CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
 - DIVA-HisDB - 150 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of three medieval manuscripts `CC-BY-NC 3.0`
 - EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
 - ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
 - Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
 - FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
 - GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
 - GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
 - IMPACT-BHL - 2,418 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
 - IMPACT-BL - 294 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
 - IMPACT-KB - 142 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of the Netherlands `CC-BY 4.0`
 - IMPACT-NKC - 187 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
 - IMPACT-NLB - 19 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
 - IMPACT-NUK - 209 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
 - IMPACT-PSNC - 478 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
 - MJSynth - 9m synthetic images covering 90k English words
 - OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
 - OCR-D - 180 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0`
 - PRImA-ENP - 528 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
 - RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
 - IMPACT-BHL - 2,418 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
 - IMPACT-BNF - 151 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`

Programming Languages

Python 12 C++ 6 HTML 2 Swift 1 Jupyter Notebook 1 JavaScript 1 PHP 1 XSLT 1 C 1 Java 1

Ecosyste.ms: Awesome

awesome-ocr

Literature

Blog Posts and Tutorials

Academic articles

OCR-related publication and link lists

Software

OCR as a Service

OCR engines

Older and possibly abandoned OCR engines

OCR file formats

OCR CLI

OCR GUI

OCR Preprocessing

OCR evaluation

OCR libraries by programming language

Datasets

Ground Truth