Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

awesome-ocr

Links to awesome OCR projects
https://github.com/kba/awesome-ocr

tesseract - The definitive Open Source OCR engine `Apache 2.0`
EasyOCR - OCR engine built on PyTorch by JaidedAI, `Apache 2.0`
ocropus - OCR engine based on LSTM, `Apache 2.0`
ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
kraken - Ocropus fork with sane defaults
gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
Ocrad - The GNU OCR. `GPL`
ocular - Machine-learning OCR for historic documents
SwiftOCR - fast and simple OCR library written in Swift
attention-ocr - OCR engine using visual attention mechanisms
RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
simple-ocr-opencv - ocr-opencv) - A simple pythonic OCR engine using opencv and numpy
Calamari - OCR Engine based on OCRopy and Kraken
doctr - A seamless & high-performing OCR library powered by Deep Learning
Clara OCR - Open source OCR in C `GPL`
Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
Eye - an experimental Java OCR (image-to-text) application
kognition - An omnifont OCR software for KDE
OCRchie - Modular Optical Character Recognition Software
ocre - o.c.r. easy
xplab - A GTK 2 tool for pattern matching
hebOCR - Hebrew character recognition library (previously named hocr, see [Wikipedia article](https://de.wikipedia.org/wiki/HebOCR)) `GPL`
abby2hocr.xslt XSLT script
ocr-conversion-scripts
hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
hocr-spec - hOCR 1.2 specification
ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
hocr-parser - hOCR Specification Python Parser
hOCRTools - hOCR to ALTO conversion XSLT
ALTO XML Schema - XML Schema and development of the ALTO XML format
ALTO XML Documentation - Documentation and use cases for ALTO
alto-tools - Various tools to work with ALTO files, Python
AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
TEI-OCR - TEI customization for OCR generated layout and content information
TEI SIG on Libraries - Best Practices for TEI in Libraries
GDZ - METS/TEI-based GDZ document format
PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
py-pagexml - Python library for handling PAGE XML and OPF files.
OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
Ocrocis - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/)
tesseract-recognize - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)).
moz-hocr-editor - Firefox Addon for editing hOCR files **Discontinued**
qt-box-editor - QT4 editor of tesseract-ocr box files.
ocr-gt-tools - Client-Server application for editing OCR ground truth.
Paperwork - Using scanners and OCR to grep paper documents the easy way.
Paperless - Scan, index, and archive all of your paper documents.
gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including [jTessBoxEditor](http://vietocr.sourceforge.net/training.html) a graphical Tesseract [box data](https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files) editor
PoCoTo - Fast interactive batch corrections of complete OCR error series in OCR'ed historical documents.
OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
PRImA PAGE Viewer - Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
LAREX - A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
archiscribe - Web application for transcribing OCR ground truth from Archive.org. Deployed instance available at https://archiscribe.jbaiter.de/, results are available in [@jbaiter/archiscribe-corpus](https://github.com/jbaiter/archiscribe-corpus).
nw-page-editor - Simple app for visual editing of Page XML files. Provides desktop and [server docker-based](https://hub.docker.com/r/mauvilsa/nw-page-editor-web) versions.
NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
typeface-corpus - A repository for typefaces to train Tesseract and OCRopus for natural history collections and digital humanities.
binarizewolfjolion - Comparison of binarization algorithms. [Blog post](http://zp-j.github.io/2013/10/04/document-binarization/)
`crop_morphology.py` in oldnyc - Cropping a page to just the text block
Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
textcleaner - Processes a scanned document of text to clean the text background
localcontrast - Fast O(1) local contrast optimization
Open OCR - Run Tesseract in Docker containers
tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
docker-ocropy - A Docker container for running the [ocropy OCR system](htps://github.com/tmbdev/ocropy).
ABBYY Cloud OCR SDK Code samples - Code samples for using the proprietary commercial ABBYY OCR API.
nidaba - An expandable and scalable OCR pipeline
gamera - A meta-framework for building document processing applications, e.g. OCR
ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
ocrad-docker - Run the [ocrad](http://www.gnu.org/software/ocrad/) OCR engine in a docker container
kraken-docker - Run the [kraken](https://github.com/mittagessen/kraken) OCR engine in a docker container
Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
ocr.space - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open)
OCR4all - Provides OCR services through web applications. Included Projects: [LAREX](https://github.com/chreul/LAREX), [OCRopus](https://github.com/tmbdev/ocropy), [calamari](https://github.com/ChWick/calamari) and [nashi](https://github.com/andbue/nashi).
ISRI OCR Evaluation Tools - ocr-evaluation-tools/blob/HEAD/user-guide.pdf)
isri-ocr-evaluation-tools - further development by [@eddieantonio](https://github.com/eddieantonio) (2015, 2016)
ancientgreekocr-evaluation-tools - further development by [@nickjwhite](https://github.com/nickjwhite) (2013, 2014)
ocrevalUAtion - Cross-format evaluation, CLI and GUI
ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
quack - Quality-Assurance-tool for scans with corresponding ALTO-files
tesseract-ocr - A Crystal wrapper for tesseract-ocr.
tesseract_ocr - Elixir library wrapping the tesseract executable.
gosseract - Golang OCR library, wrapping Tesseract-ocr.
Tess4J - Java Native Access bindings to Tesseract.
tess-two - Tools for compiling Tesseract on Android and Java API.
tesseract for .net - A .Net wrapper for tesseract-ocr.
TTesseractOCR4 - Object Pascal binding for tesseract-ocr 4.x.
Tesseract OCR for PHP - Tesseract PHP bindings.
pytesseract - A Python wrapper for Google Tesseract.
pyocr - A Python wrapper for Tesseract and Cuneiform.
ocrodjvu - A library and standalone tool for doing OCR on DjVu documents, wrapping Cuneiform, gocr, ocrad, ocropus and tesseract
tesserocr - A Python wrapper for the tesseract-ocr API
ocracy - pure javascript lstm rnn implementation based on ocropus
gocr.js - Javascript port (emscripten) of gocr
ocrad.js - Javascript port (emscripten) of ocrad
tesseract.js - Javascript port (emscripten) of Tesseract
node-tesseract-ocr - A simple wrapper for the Tesseract OCR package.
node-tesseract-native - C++ module for node providing OCR with tesseract and leptonica.
rtesseract - Ruby library wrapping the tesseract and imagemagick executables.
ruby-tesseract - Native Tesseract bindings for Ruby MRI and JRuby
ocr_space - API wrapper for free ocr service ocr.space. Includes CLI
tesseract.rs - Rust bindings for tesseract OCR.
leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
tesseract - The definitive Open Source OCR engine `Apache 2.0`
Tesseract OCR iOS - Swift and Objective-C wrapper for Tesseract OCR.
SwiftOCR - Fast and simple OCR library written in Swift. Optimized for recognizing short, one line long alphanumeric codes.
glyph-miner - A system for extracting glyphs from early typeset prints
ocrodeg - Document image degradation for OCR data augmentation
archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via [archiscribe](https://archiscribe.jbaiter.de/) `CC-BY 4.0`
CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
Rescribe - Transcriptions of Caroline Minuscule Manuscripts `PDM 1.0`
CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
DIVA-HisDB - 150 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of three medieval manuscripts `CC-BY-NC 3.0`
EarlyPrintedBooks - ~8,800 lines from several early printed books `CC-BY-NC-SA 4.0`
EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by [eMOP](http://emop.tamu.edu/) `PDM 1.0`
Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
FROC-MSS - 4 Old French Medieval Manuscripts `CC-BY 4.0`
GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
imagessan - Sanskrit images & ground truth (Devanagari script)
IMPACT-BHL - 2,418 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
IMPACT-BL - 294 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
IMPACT-BNE - 215 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Spain, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required, [XML@GitHub](https://github.com/impactcentre/groundtruth-spa) `CC-BY-NC-SA 4.0`
IMPACT-BNF - 151 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
IMPACT-KB - 142 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of the Netherlands `CC-BY 4.0`
IMPACT-NKC - 187 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
IMPACT-NLB - 19 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
IMPACT-NUK - 209 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
IMPACT-PSNC - 478 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
MJSynth - 9m synthetic images covering 90k English words
OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
OCR-D - 180 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0`
OCR_GS_Data - Double-checked Arabic Gold Standard from [OpenITI](https://github.com/OpenITI)
old-books - 322 old books from [Project Gutenberg](https://www.gutenberg.org/) `GPL 3.0`
PRImA-ENP - 528 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
OCR-D - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de:
Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
Wikipedia: Comparison of optical character recognition software
OCR [and Deep Learning
Ocropus Wiki: Publications
Tesseract Blends Old and New OCR Technology
What You Always Wanted To Know About Tesseract
Extracting text from an image using Ocropus
Training an Ocropus OCR model
Ocropus Wiki: Compute errors and confusions
Ocropus Wiki: Working with Ground Truth
OCRopus
10 Tips for making your OCR project succeed
Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology
Extracting Text from PDFs; Doing OCR; all within R
R programming environment
Tutorial: Command-line OCR on a Mac
Practical Expercience with OCRopus Model Training
Homemade Manuscript OCR (1): OCRopy - Baptiste-Camps](https://github.com/Jean-Baptiste-Camps)
Optimizing Binarization for OCRopus
Prototype demo for OCR postfix in Danish Newspapers
How Can I OCR My Dictionary?
"Needlessly complex" blog - tos (Python based), particularly:
Page dewarping
Compressing and enhancing hand-written notes
Unprojecting text with ellipses
(Open-Source-)OCR-Workflows - D](https://github.com/OCR-D) project.
A gentle introduction to OCR
Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR
abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
MathOCR - A printed scientific document recognition system, **pre-alpha**
High performance document layout analysis
Adaptive degraded document image binarization
[Internship Report
OCRopus Addons (Internship Report)
Local Logistic Classifiers for Large Scale Learning
High Performance OCR for Printed English and Fraktur using LSTM Networks - Hasan, Mayce Al Azawi. Shafait
Can we build language-independent OCR using LSTM networks? - Hasan, Breuel
Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks - Hasan, Ahmed, Rashid, Shafait, Breuel
OCR of historical printings of Latin texts: Problems, Prospects, Progress.
Correcting Noisy OCR: Context beats Confusion
TypeWright: An Experiment in Participatory Curation
Benchmarking of LSTM Networks
Recognition of Historical Greek Polytonic Scripts Using LSTM - Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
A Segmentation-Free Approach for Printed Devanagari Script Recognition - Hasan, Breuel
A Sequence Learning Approach for Multiple Script Identification - Hasan, Afzal, Shfait, Liwicki, Breuel
Important New Developments in Arabographic Optical Character Recognition (OCR)
OpenArabic/OCR_GS_Data
OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents
Generic Text Recognition using Long Short-Term Memory Networks - Hasan -- Ph.D Thesis
OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters - Hasan, Bukhari
Recursive Recurrent Nets with Attention Modeling for OCR in the Wild
Telugu OCR Framework using Deep Learning
TeluguOCR - ocr/issues/49)
A Two-Stage Method for Text Line Detection in Historical Documents - Net

Programming Languages

Python 18 C++ 11 Java 8 HTML 8 JavaScript 7 C 6 Ruby 3 Go 3 PHP 2 Shell 2

Keywords

ocr 23 tesseract 12 tesseract-ocr 9 optical-character-recognition 8 hocr 5 machine-learning 4 python 4 alto-xml 4 ruby 3 alto 3 page-xml 3 deep-learning 3 docker-image 3 pagexml 3 text-recognition 3 image-to-text 2 scanner 2 lstm 2 pdf 2 dataset 2 pytorch 2 gtk 2 text-detection 2 ground-truth 2 document-recognition 2 cnn 2 ocr-d 1 finereader 1 transformation 1 tei-xml 1 abbyy-xml 1 text-detection-recognition 1 tensorflow2 1 validation 1 alto-xml-schema 1 schema 1 digital-library 1 annotation-processing 1 document-representation 1 docker 1 pdftk 1 cli 1 transcription 1 ocr-engine 1 crnn 1 data-mining 1 easyocr 1 image-processing 1 information-retrieval 1 scene-text 1