Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-ocr
Links to awesome OCR projects
https://github.com/kba/awesome-ocr
Last synced: 1 day ago
JSON representation
-
Literature
-
Blog Posts and Tutorials
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- Tesseract Blends Old and New OCR Technology
- What You Always Wanted To Know About Tesseract
- Extracting text from an image using Ocropus
- Training an Ocropus OCR model
- Ocropus Wiki: Compute errors and confusions
- Ocropus Wiki: Working with Ground Truth
- OCRopus
- 10 Tips for making your OCR project succeed
- Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology
- Extracting Text from PDFs; Doing OCR; all within R
- R programming environment
- Tutorial: Command-line OCR on a Mac
- Practical Expercience with OCRopus Model Training
- Homemade Manuscript OCR (1): OCRopy - Baptiste-Camps](https://github.com/Jean-Baptiste-Camps)
- Optimizing Binarization for OCRopus
- Prototype demo for OCR postfix in Danish Newspapers
- How Can I OCR My Dictionary?
- "Needlessly complex" blog - tos (Python based), particularly:
- Page dewarping
- Compressing and enhancing hand-written notes
- Unprojecting text with ellipses
- (Open-Source-)OCR-Workflows - D](https://github.com/OCR-D) project.
- A gentle introduction to OCR
- Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
-
Academic articles
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- High performance document layout analysis
- Adaptive degraded document image binarization
- [Internship Report
- OCRopus Addons (Internship Report)
- Local Logistic Classifiers for Large Scale Learning
- High Performance OCR for Printed English and Fraktur using LSTM Networks - Hasan, Mayce Al Azawi. Shafait
- Can we build language-independent OCR using LSTM networks? - Hasan, Breuel
- Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks - Hasan, Ahmed, Rashid, Shafait, Breuel
- OCR of historical printings of Latin texts: Problems, Prospects, Progress.
- TypeWright: An Experiment in Participatory Curation
- Benchmarking of LSTM Networks
- Recognition of Historical Greek Polytonic Scripts Using LSTM - Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- A Segmentation-Free Approach for Printed Devanagari Script Recognition - Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification - Hasan, Afzal, Shfait, Liwicki, Breuel
- Important New Developments in Arabographic Optical Character Recognition (OCR)
- OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents
- Generic Text Recognition using Long Short-Term Memory Networks - Hasan -- Ph.D Thesis
- OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters - Hasan, Bukhari
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild
- Telugu OCR Framework using Deep Learning
- TeluguOCR - ocr/issues/49)
- A Two-Stage Method for Text Line Detection in Historical Documents - Net
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Correcting Noisy OCR: Context beats Confusion
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
-
OCR-related publication and link lists
- IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- OCR-D - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de:
- Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
- eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
- Wikipedia: Comparison of optical character recognition software
- OCR [and Deep Learning
- Ocropus Wiki: Publications
-
-
Software
-
OCR as a Service
- Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
- ocr.space - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open)
- Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
-
OCR engines
- gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
- Ocrad - The GNU OCR. `GPL`
- RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
- tesseract - The definitive Open Source OCR engine `Apache 2.0`
- EasyOCR - OCR engine built on PyTorch by JaidedAI, `Apache 2.0`
- ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
- attention-ocr - OCR engine using visual attention mechanisms
- kraken - Ocropus fork with sane defaults
- ocular - Machine-learning OCR for historic documents
- simple-ocr-opencv - ocr-opencv) - A simple pythonic OCR engine using opencv and numpy
- Calamari - OCR Engine based on OCRopy and Kraken
- doctr - A seamless & high-performing OCR library powered by Deep Learning
- ocropus - OCR engine based on LSTM, `Apache 2.0`
- SwiftOCR - fast and simple OCR library written in Swift
-
Older and possibly abandoned OCR engines
- Cuneiform - CuneiForm OCR was developed by Cognitive Technologies
- Eye - an experimental Java OCR (image-to-text) application
- kognition - An omnifont OCR software for KDE
- OCRchie - Modular Optical Character Recognition Software
- ocre - o.c.r. easy
- xplab - A GTK 2 tool for pattern matching
- hebOCR - Hebrew character recognition library (previously named hocr, see [Wikipedia article](https://de.wikipedia.org/wiki/HebOCR)) `GPL`
-
OCR file formats
- abby2hocr.xslt XSLT script
- TEI SIG on Libraries - Best Practices for TEI in Libraries
- GDZ - METS/TEI-based GDZ document format
- PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
- omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
- hocr-spec - hOCR 1.2 specification
- hocr-parser - hOCR Specification Python Parser
- hOCRTools - hOCR to ALTO conversion XSLT
- ALTO XML Schema - XML Schema and development of the ALTO XML format
- ALTO XML Documentation - Documentation and use cases for ALTO
- alto-tools - Various tools to work with ALTO files, Python
- AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
- TEI-OCR - TEI customization for OCR generated layout and content information
- py-pagexml - Python library for handling PAGE XML and OPF files.
- ocr-conversion-scripts
- hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
- ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
-
OCR CLI
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Ocrocis - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/)
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
- tesseract-recognize - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)).
-
OCR GUI
- VietOCR - A Java/.NET GUI frontend for Tesseract OCR engine, including [jTessBoxEditor](http://vietocr.sourceforge.net/training.html) a graphical Tesseract [box data](https://github.com/tesseract-ocr/tesseract/wiki/Make-Box-Files) editor
- OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
- moz-hocr-editor - Firefox Addon for editing hOCR files **Discontinued**
- gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
- qt-box-editor - QT4 editor of tesseract-ocr box files.
- ocr-gt-tools - Client-Server application for editing OCR ground truth.
- Paperwork - Using scanners and OCR to grep paper documents the easy way.
- Paperless - Scan, index, and archive all of your paper documents.
-
OCR Preprocessing
- NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
- binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
- Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
- textcleaner - Processes a scanned document of text to clean the text background
- localcontrast - Fast O(1) local contrast optimization
-
OCR evaluation
- ISRI OCR Evaluation Tools - ocr-evaluation-tools/blob/HEAD/user-guide.pdf)
- ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
-
OCR libraries by programming language
- leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
-
-
Datasets
-
Ground Truth
- CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
- CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
- DIVA-HisDB - 150 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> of three medieval manuscripts `CC-BY-NC 3.0`
- EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
- ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
- Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
- FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
- GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
- GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
- IMPACT-BHL - 2,418 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
- IMPACT-BL - 294 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
- IMPACT-KB - 142 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of the Netherlands `CC-BY 4.0`
- IMPACT-NKC - 187 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- IMPACT-NLB - 19 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
- IMPACT-NUK - 209 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- IMPACT-PSNC - 478 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
- MJSynth - 9m synthetic images covering 90k English words
- OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
- OCR-D - 180 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0`
- PRImA-ENP - 528 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
- RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
- IMPACT-BHL - 2,418 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
- IMPACT-BNF - 151 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
-
Programming Languages
Categories
Sub Categories
Keywords
ocr
15
optical-character-recognition
7
python
5
machine-learning
4
tesseract
4
pdf
3
hocr
3
alto-xml
3
scanner
2
gtk
2
pagexml
2
docker-image
2
text-detection
2
document-recognition
2
pytorch
2
image-processing
2
deep-learning
2
cnn
2
tesseract-ocr
2
ocr-engine
2
lstm
2
neural-networks
1
layout-analysis
1
htr
1
indexing
1
handwritten-text-recognition
1
paperwork
1
tensorflow
1
seq2seq
1
ocr-recognition
1
ml
1
image-recognition
1
google-cloud-ml
1
google-cloud
1
scene-text-recognition
1
scene-text
1
personal-document-system
1
python3
1
sane
1
information-retrieval
1
deprecated
1
easyocr
1
ios
1
data-mining
1
crnn
1
macos
1
ocr-library
1
swift
1
swiftocr
1
archiving
1