awesome-ocr

https://github.com/techiewonk/awesome-ocr

Last synced: 6 days ago
JSON representation

1. <a name='Software'></a>Software
- 1.1. <a name='OCRengines'></a>OCR engines
 - tesseract - The definitive Open Source OCR engine `Apache 2.0`
 - EasyOCR - OCR engine built on PyTorch by JaidedAI, `Apache 2.0`
 - ocropus 0.4 - Older v0.4 state of Ocropus, with tesseract 2.04 and iulib, C++
 - kraken - Ocropus fork with sane defaults
 - ocular - Machine-learning OCR for historic documents
 - attention-ocr - OCR engine using visual attention mechanisms
 - Calamari - OCR Engine based on OCRopy and Kraken
 - doctr - A seamless & high-performing OCR library powered by Deep Learning
 - gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
 - Ocrad - The GNU OCR. `GPL`
 - RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
 - ocropus - OCR engine based on LSTM, `Apache 2.0`
 - SwiftOCR - fast and simple OCR library written in Swift
- 1.2. <a name='OlderandpossiblyabandonedOCRengines'></a>Older and possibly abandoned OCR engines
 - hebOCR - Hebrew character recognition library (previously named hocr, see [Wikipedia article](https://de.wikipedia.org/wiki/HebOCR)) `GPL`
 - Eye - an experimental Java OCR (image-to-text) application
 - kognition - An omnifont OCR software for KDE
 - OCRchie - Modular Optical Character Recognition Software
 - ocre - o.c.r. easy
 - xplab - A GTK 2 tool for pattern matching
- 1.3. <a name='OCRfileformats'></a>OCR file formats
 - ocr-conversion-scripts
 - hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
 - hocr-spec - hOCR 1.2 specification
 - ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
 - hocr-parser - hOCR Specification Python Parser
 - hOCRTools - hOCR to ALTO conversion XSLT
 - ALTO XML Schema - XML Schema and development of the ALTO XML format
 - ALTO XML Documentation - Documentation and use cases for ALTO
 - alto-tools - Various tools to work with ALTO files, Python
 - AbbyyToAlto - PHP script converting from Abbyy 6 to ALTO XML
 - TEI-OCR - TEI customization for OCR generated layout and content information
 - py-pagexml - Python library for handling PAGE XML and OPF files.
 - abby2hocr.xslt XSLT script
 - TEI SIG on Libraries - Best Practices for TEI in Libraries
 - GDZ - METS/TEI-based GDZ document format
 - PAGE-XML Schema - XML schema of the PAGE XML format along with documentation and examples
 - omni:us Pages Format (OPF) - XML schema very similar to PAGE XML that has some additional features.
 - ocr-conversion-scripts
 - hocr-tools - Tools for doing various useful things with hOCR files, `Apache 2.0`
 - ocr-transform - CLI tool to convert between hOCR and ALTO, `MIT`
- 1.4. <a name='OCRCLI'></a>OCR CLI
 - OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
 - Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
 - tesseract-recognize - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)).
 - Ocrocis - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/)
 - OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
2. <a name='DeskewingandDewarping'></a>Deskewing and Dewarping
- 1.4. <a name='OCRCLI'></a>OCR CLI
 - MORAN_v2 - A Multi-Object Rectified Attention Network for Scene Text Recognition
 - unproject_text - Perspective recovery of text using transformed ellipses
 - unpaper - a post-processing tool for scanned sheets of paper, especially for book pages that have been scanned from previously created photocopies.
 - deskew - Library used to deskew a scanned document
 - deskewing - Contains code to deskew images using MLPs, LSTMs and LLS tranformations
 - skew_correction - De-skewing images with slanted content by finding the deviation using Canny Edge Detection.
 - page_dewarp - Page dewarping and thresholding using a "cubic sheet" model
 - text_deskewing - Rotate text images if they are not straight for better text detection and recognition.
 - galfar/deskew - Deskew is a command line tool for deskewing scanned text documents. It uses Hough transform to detect "text lines" in the image. As an output, you get an image rotated so that the lines are horizontal.
 - xellows1305/Document-Image-Dewarping - No code :(
 - Alyn
 - DewarpNet
 - Docuwarp
 - text_deskewing - Rotate text images if they are not straight for better text detection and recognition.
 - xellows1305/Document-Image-Dewarping - No code :(
- 2.1. <a name='OCRGUI'></a>OCR GUI
 - moz-hocr-editor - Firefox Addon for editing hOCR files **Discontinued**
 - qt-box-editor - QT4 editor of tesseract-ocr box files.
 - ocr-gt-tools - Client-Server application for editing OCR ground truth.
 - Paperwork - Using scanners and OCR to grep paper documents the easy way.
 - gImageReader - gImageReader is a simple Gtk/Qt front-end to tesseract-ocr.
 - OCRFeeder - GTK graphical user interface that allows the users to correct characters or bounding boxes, ODT export and more.
 - Paperless - Scan, index, and archive all of your paper documents.
22. <a name='Literature'></a>Literature
- 22.2. <a name='BlogPostsandTutorials'></a>Blog Posts and Tutorials
- 22.4. <a name='Academicarticles'></a>Academic articles
- 22.1. <a name='OCR-relatedpublicationandlinklists'></a>OCR-related publication and link lists
 - IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
 - Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
 - Wikipedia: Comparison of optical character recognition software
3. <a name='Textdetectionandlocalization'></a>Text detection and localization
- 2.1. <a name='OCRGUI'></a>OCR GUI
 - ocr_attention - Robust Scene Text Recognition with Automatic Rectification.
- 3.1. <a name='OCRPreprocessing'></a>OCR Preprocessing
 - NoiseRemove.java in MathOCR - Java implementation of Adaptive degraded document image binarization by B. Gatos , I. Pratikakis, S.J. Perantonis
 - binarize.c in ZBar - C implementations of two binarization algorithms, based on Sauvola
 - Whiteboard Picture Cleaner - Shell one-liner/script to clean up and beautify photos of whiteboards
 - textcleaner - Processes a scanned document of text to clean the text background
 - localcontrast - Fast O(1) local contrast optimization
4. <a name='Segmentation'></a>Segmentation
- 4.4. <a name='DocumentSegmentation'></a>Document Segmentation
 - LayoutML
 - LayoutLMv2
6. <a name='Tabledetection'></a>Table detection
- 4.5. <a name='FormSegmentation'></a>Form Segmentation
 - table_layout_detection_research
7. <a name='Languagedetection'></a>Language detection
- 4.5. <a name='FormSegmentation'></a>Form Segmentation
 - langdetect
- 7.1. <a name='OCRasaService'></a>OCR as a Service
 - ocr.space - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open)
 - Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
- 7.2. <a name='OCRevaluation'></a>OCR evaluation
 - ngram-ocr-eval - Brute and simple OCR evaluation using ngrams
- 7.3. <a name='OCRlibrariesbyprogramminglanguage'></a>OCR libraries by programming language
 - leptess - Productive and safe Rust bindings/wrappers for tesseract and leptonica.
8. <a name='Datasets'></a>Datasets
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
 - CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
 - DIVA-HisDB - 150 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of three medieval manuscripts `CC-BY-NC 3.0`
 - EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
 - ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
 - Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
 - FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
 - GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
 - IMPACT-BL - 294 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
 - IMPACT-NKC - 187 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
 - IMPACT-NLB - 19 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
 - IMPACT-NUK - 209 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
 - IMPACT-PSNC - 478 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
 - OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
 - PRImA-ENP - 528 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
 - RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
 - IMPACT-BNF - 151 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
9. <a name='VideoTextSpotting'></a>Video Text Spotting
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - YORO
11. <a name='OpticalCharacterRecognitionEnginesandFrameworks'></a>Optical Character Recognition Engines and Frameworks
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - OCR-D
 - Deeplearning-OCR
 - FastText - Library for efficient text classification and representation learning
 - Ocrad
12. <a name='Awesomelists'></a>Awesome lists
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - perfectspr/awesome-ocr
13. <a name='ProprietaryOCREngines'></a>Proprietary OCR Engines
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - ABBYY
 - Omnipage
 - Clova.ai
 - Konfuzio
14. <a name='CloudbasedOCREnginesSaaS'></a>Cloud based OCR Engines (SaaS)
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - thehive.ai
 - impira
 - AWS Textracet
 - Nanonets
 - docparser
 - ocrolus
 - Butler Labs
15. <a name='Fileformatsandtools'></a>File formats and tools
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - hocr
 - alto
17. <a name='DataaugmentationandSyntheticdatageneration'></a>Data augmentation and Synthetic data generation
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - DocCreator - DIAR software for synthetic document image and groundtruth generation, with various degradation models for data augmentation.
 - SynthText_Chinese_version
19. <a name='PostOCRCorrection'></a>Post OCR Correction
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - afterscan
21. <a name='misc'></a>misc
- 8.1. <a name='GroundTruth'></a>Ground Truth
 - cosc428-structor - ~1000 book pages + OpenCV + python = page regions identified as paragraphs, lines, images, captions, etc.

Programming Languages

Python 22 C++ 6 JavaScript 2 HTML 2 C 2 Swift 1 Pascal 1 PHP 1 XSLT 1 Java 1

awesome-ocr

1. <a name='Software'></a>Software

1.1. <a name='OCRengines'></a>OCR engines

1.2. <a name='OlderandpossiblyabandonedOCRengines'></a>Older and possibly abandoned OCR engines

1.3. <a name='OCRfileformats'></a>OCR file formats

1.4. <a name='OCRCLI'></a>OCR CLI

2. <a name='DeskewingandDewarping'></a>Deskewing and Dewarping

1.4. <a name='OCRCLI'></a>OCR CLI

2.1. <a name='OCRGUI'></a>OCR GUI

22. <a name='Literature'></a>Literature

22.2. <a name='BlogPostsandTutorials'></a>Blog Posts and Tutorials

22.4. <a name='Academicarticles'></a>Academic articles

22.1. <a name='OCR-relatedpublicationandlinklists'></a>OCR-related publication and link lists

3. <a name='Textdetectionandlocalization'></a>Text detection and localization

2.1. <a name='OCRGUI'></a>OCR GUI

3.1. <a name='OCRPreprocessing'></a>OCR Preprocessing

4. <a name='Segmentation'></a>Segmentation

4.4. <a name='DocumentSegmentation'></a>Document Segmentation

6. <a name='Tabledetection'></a>Table detection

4.5. <a name='FormSegmentation'></a>Form Segmentation

7. <a name='Languagedetection'></a>Language detection

4.5. <a name='FormSegmentation'></a>Form Segmentation

7.1. <a name='OCRasaService'></a>OCR as a Service

7.2. <a name='OCRevaluation'></a>OCR evaluation

7.3. <a name='OCRlibrariesbyprogramminglanguage'></a>OCR libraries by programming language

8. <a name='Datasets'></a>Datasets

8.1. <a name='GroundTruth'></a>Ground Truth

9. <a name='VideoTextSpotting'></a>Video Text Spotting

8.1. <a name='GroundTruth'></a>Ground Truth

11. <a name='OpticalCharacterRecognitionEnginesandFrameworks'></a>Optical Character Recognition Engines and Frameworks

8.1. <a name='GroundTruth'></a>Ground Truth

12. <a name='Awesomelists'></a>Awesome lists

8.1. <a name='GroundTruth'></a>Ground Truth

13. <a name='ProprietaryOCREngines'></a>Proprietary OCR Engines

8.1. <a name='GroundTruth'></a>Ground Truth

14. <a name='CloudbasedOCREnginesSaaS'></a>Cloud based OCR Engines (SaaS)

8.1. <a name='GroundTruth'></a>Ground Truth

15. <a name='Fileformatsandtools'></a>File formats and tools

8.1. <a name='GroundTruth'></a>Ground Truth

17. <a name='DataaugmentationandSyntheticdatageneration'></a>Data augmentation and Synthetic data generation

8.1. <a name='GroundTruth'></a>Ground Truth

19. <a name='PostOCRCorrection'></a>Post OCR Correction

8.1. <a name='GroundTruth'></a>Ground Truth

21. <a name='misc'></a>misc

8.1. <a name='GroundTruth'></a>Ground Truth