awesome-ocr
Links to awesome OCR projects
https://github.com/kba/awesome-ocr
Last synced: 7 days ago
JSON representation
-
Datasets
-
Ground Truth
- CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
- CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
- DIVA-HisDB - 150 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> of three medieval manuscripts `CC-BY-NC 3.0`
- EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
- ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
- Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
- FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
- GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
- GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
- IMPACT-BHL - 2,418 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
- IMPACT-BNF - 151 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- IMPACT-KB - 142 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of the Netherlands `CC-BY 4.0`
- IMPACT-PSNC - 478 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
- MJSynth - 9m synthetic images covering 90k English words
- OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
- OCR-D - 180 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0`
- PRImA-ENP - 528 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
- RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
- IMPACT-BHL - 2,418 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
- archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via [archiscribe](https://archiscribe.jbaiter.de/) `CC-BY 4.0`
- Rescribe - Transcriptions of Caroline Minuscule Manuscripts `PDM 1.0`
- EarlyPrintedBooks - ~8,800 lines from several early printed books `CC-BY-NC-SA 4.0`
- eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by [eMOP](http://emop.tamu.edu/) `PDM 1.0`
- FROC-MSS - 4 Old French Medieval Manuscripts `CC-BY 4.0`
- imagessan - Sanskrit images & ground truth (Devanagari script)
- IMPACT-KB - 142 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of the Netherlands `CC-BY 4.0`
- LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
- OCR_GS_Data - Double-checked Arabic Gold Standard from [OpenITI](https://github.com/OpenITI)
- old-books - 322 old books from [Project Gutenberg](https://www.gutenberg.org/) `GPL 3.0`
- Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
- GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
- IMPACT-BL - 294 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
- IMPACT-BNE - 215 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Spain, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required, [XML@GitHub](https://github.com/impactcentre/groundtruth-spa) `CC-BY-NC-SA 4.0`
- IMPACT-NKC - 187 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- IMPACT-NLB - 19 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
- IMPACT-NUK - 209 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- imagessan - Sanskrit images & ground truth (Devanagari script)
- OCR-D - 180 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0`
-
-
Literature
-
Academic articles
- High performance document layout analysis
- Adaptive degraded document image binarization
- [Internship Report
- OCRopus Addons (Internship Report)
- Local Logistic Classifiers for Large Scale Learning
- High Performance OCR for Printed English and Fraktur using LSTM Networks - Hasan, Mayce Al Azawi. Shafait
- Can we build language-independent OCR using LSTM networks? - Hasan, Breuel
- Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks - Hasan, Ahmed, Rashid, Shafait, Breuel
- OCR of historical printings of Latin texts: Problems, Prospects, Progress.
- TypeWright: An Experiment in Participatory Curation
- Benchmarking of LSTM Networks
- Recognition of Historical Greek Polytonic Scripts Using LSTM - Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- A Segmentation-Free Approach for Printed Devanagari Script Recognition - Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification - Hasan, Afzal, Shfait, Liwicki, Breuel
- Important New Developments in Arabographic Optical Character Recognition (OCR)
- OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents
- Generic Text Recognition using Long Short-Term Memory Networks - Hasan -- Ph.D Thesis
- OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters - Hasan, Bukhari
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild
- Telugu OCR Framework using Deep Learning
- TeluguOCR - ocr/issues/49)
- A Two-Stage Method for Text Line Detection in Historical Documents - Net
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Correcting Noisy OCR: Context beats Confusion
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- [Internship Report
- OCRopus Addons (Internship Report)
- Correcting Noisy OCR: Context beats Confusion
- OpenArabic/OCR_GS_Data
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Local Logistic Classifiers for Large Scale Learning
- TypeWright: An Experiment in Participatory Curation
- Benchmarking of LSTM Networks
- Recognition of Historical Greek Polytonic Scripts Using LSTM - Hassan, Papavassiliou, Basilis Gatos, Katsouros, Liwicki
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents
- Generic Text Recognition using Long Short-Term Memory Networks - Hasan -- Ph.D Thesis
-
Blog Posts and Tutorials
- Tesseract Blends Old and New OCR Technology
- What You Always Wanted To Know About Tesseract
- Extracting text from an image using Ocropus
- Training an Ocropus OCR model
- Ocropus Wiki: Compute errors and confusions
- Ocropus Wiki: Working with Ground Truth
- OCRopus
- 10 Tips for making your OCR project succeed
- Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology
- Extracting Text from PDFs; Doing OCR; all within R
- R programming environment
- Tutorial: Command-line OCR on a Mac
- Practical Expercience with OCRopus Model Training
- Homemade Manuscript OCR (1): OCRopy - Baptiste-Camps](https://github.com/Jean-Baptiste-Camps)
- Optimizing Binarization for OCRopus
- Prototype demo for OCR postfix in Danish Newspapers
- How Can I OCR My Dictionary?
- "Needlessly complex" blog - tos (Python based), particularly:
- Page dewarping
- Compressing and enhancing hand-written notes
- Unprojecting text with ellipses
- (Open-Source-)OCR-Workflows - D](https://github.com/OCR-D) project.
- A gentle introduction to OCR
- Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- Tesseract Blends Old and New OCR Technology
- Extracting text from an image using Ocropus
- Training an Ocropus OCR model
- Ocropus Wiki: Compute errors and confusions
- Ocropus Wiki: Working with Ground Truth
- Tutorial: Command-line OCR on a Mac
- Homemade Manuscript OCR (1): OCRopy - Baptiste-Camps](https://github.com/Jean-Baptiste-Camps)
- A gentle introduction to OCR
-
OCR-related publication and link lists
- OCR-D - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de:
- Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
- eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
- Wikipedia: Comparison of optical character recognition software
- OCR [and Deep Learning
- Ocropus Wiki: Publications
- IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- OCR-D - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de:
- eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
- OCR [and Deep Learning
- Ocropus Wiki: Publications
-
OCR Showcases
- abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
- cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
- MathOCR - A printed scientific document recognition system, **pre-alpha**
-
-
Software
-
OCR as a Service
- ocr.space - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open)
-
Programming Languages
Categories
Sub Categories
Academic articles
85
Blog Posts and Tutorials
62
Ground Truth
38
OCR libraries by programming language
25
OCR file formats
17
OCR engines
14
OCR GUI
14
OCR as a Service
14
OCR-related publication and link lists
11
Older and possibly abandoned OCR engines
10
OCR Preprocessing
10
OCR evaluation
7
OCR CLI
4
OCR Showcases
3
OCR training tools
2
Keywords
ocr
19
tesseract
12
tesseract-ocr
8
optical-character-recognition
8
python
4
hocr
3
alto-xml
3
text-recognition
3
ruby
3
machine-learning
3
deep-learning
3
docker-image
3
pagexml
3
text-detection
2
image-to-text
2
pdf
2
cnn
2
gtk
2
alto
2
document-recognition
2
pytorch
2
scanner
2
lstm
2
ocr-recognition
1
tensorflow
1
ml
1
seq2seq
1
tensorflow2
1
text-detection-recognition
1
alto-xml-schema
1
schema
1
digital-library
1
annotation-processing
1
document-representation
1
javascript
1
webassembly
1
crnn
1
data-mining
1
easyocr
1
image-processing
1
information-retrieval
1
scene-text
1
scene-text-recognition
1
go
1
ocr-server
1
ocr-engine
1
handwritten-text-recognition
1
htr
1
layout-analysis
1
neural-networks
1