Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-ocr
https://github.com/techiewonk/awesome-ocr
Last synced: about 21 hours ago
JSON representation
-
1. <a name='Software'></a>Software
-
1.1. <a name='OCRengines'></a>OCR engines
- tesseract - The definitive Open Source OCR engine `Apache 2.0`
- EasyOCR - OCR engine built on PyTorch by JaidedAI, `Apache 2.0`
- ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
- kraken - Ocropus fork with sane defaults
- ocular - Machine-learning OCR for historic documents
- attention-ocr - OCR engine using visual attention mechanisms
- simple-ocr-opencv - ocr-opencv) - A simple pythonic OCR engine using opencv and numpy
- Calamari - OCR Engine based on OCRopy and Kraken
- doctr - A seamless & high-performing OCR library powered by Deep Learning
- gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
- Ocrad - The GNU OCR. `GPL`
- RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
- ocropus - OCR engine based on LSTM, `Apache 2.0`
- SwiftOCR - fast and simple OCR library written in Swift
-
1.2. <a name='OlderandpossiblyabandonedOCRengines'></a>Older and possibly abandoned OCR engines
- hebOCR - Hebrew character recognition library (previously named hocr, see [Wikipedia article](https://de.wikipedia.org/wiki/HebOCR)) `GPL`
- Eye - an experimental Java OCR (image-to-text) application
- kognition - An omnifont OCR software for KDE
- OCRchie - Modular Optical Character Recognition Software
- ocre - o.c.r. easy
- xplab - A GTK 2 tool for pattern matching
- Clara OCR - Open source OCR in C `GPL`
-
1.3. <a name='OCRfileformats'></a>OCR file formats
- ocr-conversion-scripts
- hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
- hocr-spec - hOCR 1.2 specification
- ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
- hocr-parser - hOCR Specification Python Parser
- hOCRTools - hOCR to ALTO conversion XSLT
- ALTO XML Schema - XML Schema and development of the ALTO XML format
- ALTO XML Documentation - Documentation and use cases for ALTO
- alto-tools - Various tools to work with ALTO files, Python
- AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
- TEI-OCR - TEI customization for OCR generated layout and content information
- py-pagexml - Python library for handling PAGE XML and OPF files.
- abby2hocr.xslt XSLT script
- TEI SIG on Libraries - Best Practices for TEI in Libraries
- GDZ - METS/TEI-based GDZ document format
- PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
- omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
- ocr-conversion-scripts
- hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
- ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
-
1.4. <a name='OCRCLI'></a>OCR CLI
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
- tesseract-recognize - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)).
- Ocrocis - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/)
- OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
-
-
2. <a name='DeskewingandDewarping'></a>Deskewing and Dewarping
-
1.4. <a name='OCRCLI'></a>OCR CLI
- MORAN_v2 - A Multi-Object Rectified Attention Network for Scene Text Recognition
- unproject_text - Perspective recovery of text using transformed ellipses
- unpaper - a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies.
- deskew - Library used to deskew a scanned document
- deskewing - Contains code to deskew images using MLPs, LSTMs and LLS tranformations
- skew_correction - De-skewing images with slanted content by finding the deviation using Canny Edge Detection.
- page_dewarp - Page dewarping and thresholding using a "cubic sheet" model
- text_deskewing - Rotate text images if they are not straight for better text detection and recognition.
- galfar/deskew - Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.
- xellows1305/Document-Image-Dewarping - No code :(
- Alyn
- DewarpNet
- Docuwarp
-
2.1. <a name='OCRGUI'></a>OCR GUI
- moz-hocr-editor - Firefox Addon for editing hOCR files **Discontinued**
- qt-box-editor - QT4 editor of tesseract-ocr box files.
- ocr-gt-tools - Client-Server application for editing OCR ground truth.
- Paperwork - Using scanners and OCR to grep paper documents the easy way.
- gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
- OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
- Paperless - Scan, index, and archive all of your paper documents.
-
-
22. <a name='Literature'></a>Literature
-
22.2. <a name='BlogPostsandTutorials'></a>Blog Posts and Tutorials
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- What You Always Wanted To Know About Tesseract
- OCRopus
- 10 Tips for making your OCR project succeed
- Overview of LEADTOOLS Image Cleanup and Pre-processing SDK Technology
- Extracting Text from PDFs; Doing OCR; all within R
- R programming environment
- Tutorial: Command-line OCR on a Mac
- Practical Expercience with OCRopus Model Training
- Optimizing Binarization for OCRopus
- Prototype demo for OCR postfix in Danish Newspapers
- How Can I OCR My Dictionary?
- "Needlessly complex" blog - tos (Python based), particularly:
- Page dewarping
- Compressing and enhancing hand-written notes
- Unprojecting text with ellipses
- (Open-Source-)OCR-Workflows - D](https://github.com/OCR-D) project.
- A gentle introduction to OCR
- Worauf kann ich mich verlassen? Arbeiten mit digitalisierten Quellen, Teil 1: OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
- A gentle introduction to OCR
-
22.4. <a name='Academicarticles'></a>Academic articles
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- High performance document layout analysis
- Adaptive degraded document image binarization
- [Internship Report
- OCRopus Addons (Internship Report)
- High Performance OCR for Printed English and Fraktur using LSTM Networks - Hasan, Mayce Al Azawi. Shafait
- Can we build language-independent OCR using LSTM networks? - Hasan, Breuel
- Offline Printed Urdu Nastaleeq Script Recognition with Bidirectional LSTM Networks - Hasan, Ahmed, Rashid, Shafait, Breuel
- OCR of historical printings of Latin texts: Problems, Prospects, Progress.
- TypeWright: An Experiment in Participatory Curation
- Benchmarking of LSTM Networks
- A Segmentation-Free Approach for Printed Devanagari Script Recognition - Hasan, Breuel
- A Sequence Learning Approach for Multiple Script Identification - Hasan, Afzal, Shfait, Liwicki, Breuel
- Important New Developments in Arabographic Optical Character Recognition (OCR)
- OCR of historical printings with an application to building diachronic corpora: A case study using the RIDGES herbal corpus
- Automatic quality evaluation and (semi-) automatic improvement of mixed models for OCR on historical documents
- OCRoRACT: A Sequence Learning OCR System Trained on Isolated Characters - Hasan, Bukhari
- Recursive Recurrent Nets with Attention Modeling for OCR in the Wild
- paper:2016
- paper:2016
- Telugu OCR Framework using Deep Learning
- TeluguOCR - ocr/issues/49)
- A Two-Stage Method for Text Line Detection in Historical Documents - Net
- paper:2018
- paper:2018
- paper:2019
- paper:2020
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Correcting Noisy OCR: Context beats Confusion
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
- Adaptive degraded document image binarization
-
22.1. <a name='OCR-relatedpublicationandlinklists'></a>OCR-related publication and link lists
- IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
- Wikipedia: Comparison of optical character recognition software
-
-
3. <a name='Textdetectionandlocalization'></a>Text detection and localization
-
2.1. <a name='OCRGUI'></a>OCR GUI
- ocr_attention - Robust Scene Text Recognition with Automatic Rectification.
-
3.1. <a name='OCRPreprocessing'></a>OCR Preprocessing
- NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
- binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
- Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
- textcleaner - Processes a scanned document of text to clean the text background
- localcontrast - Fast O(1) local contrast optimization
-
-
4. <a name='Segmentation'></a>Segmentation
-
4.4. <a name='DocumentSegmentation'></a>Document Segmentation
-
-
6. <a name='Tabledetection'></a>Table detection
-
4.5. <a name='FormSegmentation'></a>Form Segmentation
-
-
7. <a name='Languagedetection'></a>Language detection
-
4.5. <a name='FormSegmentation'></a>Form Segmentation
-
7.1. <a name='OCRasaService'></a>OCR as a Service
-
7.2. <a name='OCRevaluation'></a>OCR evaluation
- ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
-
7.3. <a name='OCRlibrariesbyprogramminglanguage'></a>OCR libraries by programming language
- leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
-
-
8. <a name='Datasets'></a>Datasets
-
8.1. <a name='GroundTruth'></a>Ground Truth
- CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
- CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
- DIVA-HisDB - 150 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> of three medieval manuscripts `CC-BY-NC 3.0`
- EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
- ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
- Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
- FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
- GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
- IMPACT-BL - 294 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
- IMPACT-NKC - 187 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- IMPACT-NLB - 19 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
- IMPACT-NUK - 209 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
- IMPACT-PSNC - 478 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
- OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
- PRImA-ENP - 528 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
- RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
- IMPACT-BNF - 151 pages<sup>[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML)</sup> from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
-
-
9. <a name='VideoTextSpotting'></a>Video Text Spotting
-
8.1. <a name='GroundTruth'></a>Ground Truth
-
-
11. <a name='OpticalCharacterRecognitionEnginesandFrameworks'></a>Optical Character Recognition Engines and Frameworks
-
8.1. <a name='GroundTruth'></a>Ground Truth
- OCR-D
- Deeplearning-OCR
- FastText - Library for efficient text classification and representation learning
- Ocrad
-
-
12. <a name='Awesomelists'></a>Awesome lists
-
8.1. <a name='GroundTruth'></a>Ground Truth
-
-
13. <a name='ProprietaryOCREngines'></a>Proprietary OCR Engines
-
14. <a name='CloudbasedOCREnginesSaaS'></a>Cloud based OCR Engines (SaaS)
-
8.1. <a name='GroundTruth'></a>Ground Truth
-
-
15. <a name='Fileformatsandtools'></a>File formats and tools
-
17. <a name='DataaugmentationandSyntheticdatageneration'></a>Data augmentation and Synthetic data generation
-
8.1. <a name='GroundTruth'></a>Ground Truth
- DocCreator - DIAR software for synthetic document image and groundtruth generation, with various degradation models for data augmentation.
- SynthText_Chinese_version
-
-
19. <a name='PostOCRCorrection'></a>Post OCR Correction
-
8.1. <a name='GroundTruth'></a>Ground Truth
-
-
21. <a name='misc'></a>misc
-
8.1. <a name='GroundTruth'></a>Ground Truth
- cosc428-structor - ~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.
-
Programming Languages
Categories
22. <a name='Literature'></a>Literature
105
1. <a name='Software'></a>Software
46
2. <a name='DeskewingandDewarping'></a>Deskewing and Dewarping
20
8. <a name='Datasets'></a>Datasets
17
14. <a name='CloudbasedOCREnginesSaaS'></a>Cloud based OCR Engines (SaaS)
7
3. <a name='Textdetectionandlocalization'></a>Text detection and localization
6
7. <a name='Languagedetection'></a>Language detection
5
13. <a name='ProprietaryOCREngines'></a>Proprietary OCR Engines
4
11. <a name='OpticalCharacterRecognitionEnginesandFrameworks'></a>Optical Character Recognition Engines and Frameworks
4
15. <a name='Fileformatsandtools'></a>File formats and tools
2
17. <a name='DataaugmentationandSyntheticdatageneration'></a>Data augmentation and Synthetic data generation
2
4. <a name='Segmentation'></a>Segmentation
2
19. <a name='PostOCRCorrection'></a>Post OCR Correction
1
12. <a name='Awesomelists'></a>Awesome lists
1
21. <a name='misc'></a>misc
1
6. <a name='Tabledetection'></a>Table detection
1
9. <a name='VideoTextSpotting'></a>Video Text Spotting
1
Sub Categories
22.4. <a name='Academicarticles'></a>Academic articles
55
22.2. <a name='BlogPostsandTutorials'></a>Blog Posts and Tutorials
47
8.1. <a name='GroundTruth'></a>Ground Truth
40
1.3. <a name='OCRfileformats'></a>OCR file formats
20
1.4. <a name='OCRCLI'></a>OCR CLI
18
1.1. <a name='OCRengines'></a>OCR engines
14
2.1. <a name='OCRGUI'></a>OCR GUI
8
1.2. <a name='OlderandpossiblyabandonedOCRengines'></a>Older and possibly abandoned OCR engines
7
3.1. <a name='OCRPreprocessing'></a>OCR Preprocessing
5
22.1. <a name='OCR-relatedpublicationandlinklists'></a>OCR-related publication and link lists
3
4.4. <a name='DocumentSegmentation'></a>Document Segmentation
2
4.5. <a name='FormSegmentation'></a>Form Segmentation
2
7.1. <a name='OCRasaService'></a>OCR as a Service
2
7.2. <a name='OCRevaluation'></a>OCR evaluation
1
7.3. <a name='OCRlibrariesbyprogramminglanguage'></a>OCR libraries by programming language
1
Keywords
ocr
19
optical-character-recognition
7
python
6
hocr
5
tesseract
5
machine-learning
4
pdf
4
alto-xml
4
image-processing
4
page-xml
3
ocr-recognition
2
document-recognition
2
text-detection
2
alto
2
docker-image
2
pagexml
2
deskew
2
deskewing
2
scanning
2
scene-text-recognition
2
scene-text
2
pytorch
2
deep-learning
2
cnn
2
tesseract-ocr
2
ocr-engine
2
lstm
2
gtk
2
scanner
2
c-plus-plus
1
htr
1
layout-analysis
1
neural-networks
1
google-cloud
1
google-cloud-ml
1
image-recognition
1
ml
1
sane
1
seq2seq
1
tensorflow
1
knn-algorithm
1
machine-vision
1
machinelearning
1
machinevision
1
opencv
1
python-ocr
1
supervised-learning
1
python3
1
pix2pix
1
gan
1