awesome-ocr

Links to awesome OCR projects
https://github.com/kba/awesome-ocr

Last synced: 15 days ago
JSON representation

Datasets
- Ground Truth
 - CIS OCR Test Set - 2 example documents each in German/Latin/Greek with ground truth for [PoCoTo](https://github.com/cisocrgroup/PoCoTo)
 - CLTK - Corpora from [Classical Language Toolkit](http://cltk.org/) `PDM 1.0`
 - DIVA-HisDB - 150 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of three medieval manuscripts `CC-BY-NC 3.0`
 - EEBO-TCP - 25,363 EEBO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-eebo/) `PDM 1.0`
 - ECCO-TCP - 2,188 ECCO documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-ecco/) `PDM 1.0`
 - Evans-TCP - 4,977 Evans documents transcribed by [TCP](http://www.textcreationpartnership.org/tcp-evans/)
 - FDHN - Finnish Digitised Historical Newspapers, [Paper](http://doi.org/10.1045/july2016-paakkonen), (free) [registration](https://digi.kansalliskirjasto.fi/opendata/submit?set_language=en) required, [Terms of Use](https://digi.kansalliskirjasto.fi/terms)
 - GERMANA - 764 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-germana-corpus) required `non-commercial use only`
 - GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
 - IMPACT-BHL - 2,418 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
 - IMPACT-BNF - 151 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of France, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
 - IMPACT-KB - 142 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of the Netherlands `CC-BY 4.0`
 - IMPACT-PSNC - 478 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from four Polish digital libraries, [XML@GitHub](https://github.com/impactcentre/groundtruth-pol) `CC-BY 3.0`
 - MJSynth - 9m synthetic images covering 90k English words
 - OCR19thSAC - 19,000 pages Swiss Alpine Club yearbooks transcribed via [Text+Berg digital](http://textberg.ch/site/en/welcome/) `CC-BY 4.0`
 - OCR-D - 180 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) of German historical prints from [OCR-D](http://ocr-d.de/) `CC-BY-SA 4.0`
 - PRImA-ENP - 528 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) historic newspapers from [Europeana Newspapers](http://www.europeana-newspapers.eu/), (free) [registration](http://www.primaresearch.org/register) required `PDM 1.0`
 - RODRIGO - 853 Spanish manuscript pages, (free) [registration](https://www.prhlt.upv.es/wp/resource/the-rodrigo-corpus) required `non-commercial use only`
 - IMPACT-BHL - 2,418 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Biodiversity Heritage Library, [XML@GitHub](https://github.com/impactcentre/groundtruth-bhl) `CC-BY 3.0`
 - archiscribe-corpus - >4,200 lines transcribed from 19th Century German prints via [archiscribe](https://archiscribe.jbaiter.de/) `CC-BY 4.0`
 - Rescribe - Transcriptions of Caroline Minuscule Manuscripts `PDM 1.0`
 - EarlyPrintedBooks - ~8,800 lines from several early printed books `CC-BY-NC-SA 4.0`
 - eMOP-TCP - 2,188 ECCO-TCP documents, cleaned up by [eMOP](http://emop.tamu.edu/) `PDM 1.0`
 - FROC-MSS - 4 Old French Medieval Manuscripts `CC-BY 4.0`
 - imagessan - Sanskrit images & ground truth (Devanagari script)
 - IMPACT-KB - 142 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of the Netherlands `CC-BY 4.0`
 - LascivaRoma/lexical - Transcription of 19th century lexical resources for Latin learning
 - OCR_GS_Data - Double-checked Arabic Gold Standard from [OpenITI](https://github.com/OpenITI)
 - old-books - 322 old books from [Project Gutenberg](https://www.gutenberg.org/) `GPL 3.0`
 - Toebler-OCR - (Kraken) Ground Truth transcription of few pages of the Tobler-Lommatzsch: Altfranzösisches Wörterbuch
 - GT4HistOCR - Ground Truth for German Fraktur and Early Modern Latin `CC-BY 4.0`
 - IMPACT-BL - 294 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the British Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `PDM 1.0`
 - IMPACT-BNE - 215 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Spain, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required, [XML@GitHub](https://github.com/impactcentre/groundtruth-spa) `CC-BY-NC-SA 4.0`
 - IMPACT-NKC - 187 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the Czech National Library, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
 - IMPACT-NLB - 19 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Bulgaria, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-ND 4.0`
 - IMPACT-NUK - 209 pages[PAGE-XML](https://github.com/PRImA-Research-Lab/PAGE-XML) from the National Library of Slovenia, (free) [registration](https://www.digitisation.eu/wp-login.php?action=register) required `CC-BY-NC-SA 4.0`
Literature
- Academic articles
- Blog Posts and Tutorials
- OCR-related publication and link lists
  - OCR-D - List of OCR-related academic articles in the context of the [OCR-D](http://www.ocr-d.de/) project. :de:
  - Mendeley Group "OCR - Optical Character Recognition" - Collection of 34 papers on OCR
  - eadh.org projects - List of Digital Humanities-related projects in Europe, some related to OCR
  - Wikipedia: Comparison of optical character recognition software
  - OCR [and Deep Learning
  - Ocropus Wiki: Publications
  - IMPACT: Tools for text digitisation - List of tools software projects related, some related to OCR
- OCR Showcases
  - abbyy-finereader-ocr-senate - Using OCR to parse scanned Senate Financial Disclosure forms.
  - cvOCR - An OCR system for recognizing resume or cv text, implemented in Python and C and based on tesseract
  - MathOCR - A printed scientific document recognition system, **pre-alpha**
Software
- OCR as a Service
  - ocr.space - Free Online OCR and OCR API by [@a9t9](https://github.com/A9T9) based on Tesseract (code is not open)
  - Konfuzio - Free Online OCR up to 2.000 pages per month and OCR API by [@atraining], see https://youtu.be/NZKUrKyFVA8 (code is not open)
  - Open OCR - Run Tesseract in Docker containers
  - tesseract-web-service - An implementation of RESTful web service for tesseract-OCR using tornado.
  - docker-ocropy - A Docker container for running the [ocropy OCR system](htps://github.com/tmbdev/ocropy).
  - nidaba - An expandable and scalable OCR pipeline
  - gamera - A meta-framework for building document processing applications, e.g. OCR
  - ocr-tools - Project to provide CLI and web service interfaces to common OCR engines
  - ocrad-docker - Run the [ocrad](http://www.gnu.org/software/ocrad/) OCR engine in a docker container
  - kraken-docker - Run the [kraken](https://github.com/mittagessen/kraken) OCR engine in a docker container
  - OCR4all - Provides OCR services through web applications. Included Projects: [LAREX](https://github.com/chreul/LAREX), [OCRopus](https://github.com/tmbdev/ocropy), [calamari](https://github.com/ChWick/calamari) and [nashi](https://github.com/andbue/nashi).
- OCR CLI
  - Ocrocis - Project manager interface for Ocropy, see also [external project homepage](http://cistern.cis.lmu.de/ocrocis/)
  - Pdf2PdfOCR - A tool to OCR a PDF (or supported images) and add a text "layer" (a "pdf sandwich") in the original file making it a searchable PDF. GUI included. Tesseract and cuneiform supported.
  - tesseract-recognize - Tesseract-based tool that outputs result in Page XML format ([docker image](https://hub.docker.com/r/mauvilsa/tesseract-recognize)).
  - OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
- OCR engines
  - gocr - OCR engine under the GNU Public License led by Joerg Schulenburg.
  - Ocrad - The GNU OCR. `GPL`
  - RWTH-OCR - The RWTH Aachen University Optical Character Recognition System
  - tesseract - The definitive Open Source OCR engine `Apache 2.0`
  - EasyOCR - OCR engine built on PyTorch by JaidedAI, `Apache 2.0`

Programming Languages

Python 16 C++ 10 JavaScript 6 Java 5 C 5 HTML 4 Go 3 Ruby 3 Makefile 2 Shell 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-ocr

Datasets

Ground Truth

Literature

Academic articles

Blog Posts and Tutorials

OCR Showcases

Software

OCR as a Service

OCR CLI

OCR engines

awesome-ocr

Datasets

Ground Truth

Literature

Academic articles

Blog Posts and Tutorials

OCR-related publication and link lists

OCR Showcases

Software

OCR as a Service

OCR CLI

OCR engines