https://github.com/opensemanticsearch/tesseract-ocr-cache
Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again
https://github.com/opensemanticsearch/tesseract-ocr-cache
cache caching ocr python tesseract tesseract-ocr tika tika-server
Last synced: 10 days ago
JSON representation
Tesseract OCR wrapper for Apache Tika and/or Open Semantic ETL caching the OCR results, so Tika-Server or Open Semantic ETL has not to reprocess slow and expensive OCR on same images again
- Host: GitHub
- URL: https://github.com/opensemanticsearch/tesseract-ocr-cache
- Owner: opensemanticsearch
- License: gpl-3.0
- Created: 2020-02-03T19:29:08.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-12-20T16:54:20.000Z (over 3 years ago)
- Last Synced: 2025-04-08T03:01:39.216Z (3 months ago)
- Topics: cache, caching, ocr, python, tesseract, tesseract-ocr, tika, tika-server
- Language: Python
- Homepage:
- Size: 32.2 KB
- Stars: 5
- Watchers: 3
- Forks: 1
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Tesseract OCR Cache
Tesseract OCR wrapper caching the OCR results, so Apache Tika-Server has not to reprocess slow and expensive OCR on same images again.
F.e. same images (logos or corporate identity elements), which appear in many PDF or Word documents.
Or for reindexing or new analysis of your documents because of changed ETL settings or new analysis features.
Therefore there is a tesseract wrapper which is called with same parameters like the original Tesseract command line interface (CLI):
# tesseract_cache
The commandline tool
[tesseract_cache](tesseract_cache)/tesseract
calls Tesseract OCR and caches the results to a file directory before returning the resulting text.If you OCR the same image again, it doesn't call Tesseract OCR again but returns the result text from the cache.
# tesseract_fake
The commandline tool
[tesseract_fake](tesseract_fake)/tesseract
does not forward the call to Tesseract OCR.It returns OCR results only if yet cached by former runs of
tesseract_cache/tesseract
.If the image was not processed by OCR yet it will return only the string
[Image (no OCR yet)]
.Since OCR needs most resources for often a few additional information, this approach is used to index most document contents without expensive OCR processing to be able to search for most content much earlier.
By the OCR fake text or temporary status we get the info, if Apache Tika found some images in the document, so such documents are added to another task queue for expensive OCR with lower priority than the standard text extraction tasks.
# Setup
Just set Apache Tika to use the command tesseract in directory tesseract_cache instead of the original tesseract binary directory.