https://github.com/opensemanticsearch/tesseract-ocr-cache

Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again
https://github.com/opensemanticsearch/tesseract-ocr-cache

cache caching ocr python tesseract tesseract-ocr tika tika-server

Last synced: 10 days ago
JSON representation

Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again

Host: GitHub
URL: https://github.com/opensemanticsearch/tesseract-ocr-cache
Owner: opensemanticsearch
License: gpl-3.0
Created: 2020-02-03T19:29:08.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2021-12-20T16:54:20.000Z (over 3 years ago)
Last Synced: 2025-04-08T03:01:39.216Z (3 months ago)
Topics: cache, caching, ocr, python, tesseract, tesseract-ocr, tika, tika-server
Language: Python
Homepage:
Size: 32.2 KB
Stars: 5
Watchers: 3
Forks: 1
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Tesseract OCR Cache

Tesseract OCR wrapper caching the OCR results, so Apache Tika-Server has not to reprocess slow and expensive OCR on same images again.

F.e. same images (logos or corporate identity elements), which appear in many PDF or Word documents.

Or for reindexing or new analysis of your documents because of changed ETL settings or new analysis features.

Therefore there is a tesseract wrapper which is called with same parameters like the original Tesseract command line interface (CLI):

# tesseract_cache

The commandline tool [tesseract_cache](tesseract_cache)/tesseract calls Tesseract OCR and caches the results to a file directory before returning the resulting text.

If you OCR the same image again, it doesn't call Tesseract OCR again but returns the result text from the cache.

# tesseract_fake

The commandline tool [tesseract_fake](tesseract_fake)/tesseract does not forward the call to Tesseract OCR.

It returns OCR results only if yet cached by former runs of tesseract_cache/tesseract.

If the image was not processed by OCR yet it will return only the string [Image (no OCR yet)].

Since OCR needs most resources for often a few additional information, this approach is used to index most document contents without expensive OCR processing to be able to search for most content much earlier.

By the OCR fake text or temporary status we get the info, if Apache Tika found some images in the document, so such documents are added to another task queue for expensive OCR with lower priority than the standard text extraction tasks.

# Setup

Just set Apache Tika to use the command tesseract in directory tesseract_cache instead of the original tesseract binary directory.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/opensemanticsearch/tesseract-ocr-cache

Awesome Lists containing this project

README