https://github.com/oshekharo/clean-image-extractor
Python program that uses the OpenCV library to clean bulk images, and then uses the Tesseract OCR library to extract text from the cleaned images
https://github.com/oshekharo/clean-image-extractor
opencv python tesseract-ocr
Last synced: about 1 year ago
JSON representation
Python program that uses the OpenCV library to clean bulk images, and then uses the Tesseract OCR library to extract text from the cleaned images
- Host: GitHub
- URL: https://github.com/oshekharo/clean-image-extractor
- Owner: OshekharO
- Created: 2023-02-16T17:24:37.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2025-02-04T04:50:36.000Z (over 1 year ago)
- Last Synced: 2025-02-04T05:26:52.952Z (over 1 year ago)
- Topics: opencv, python, tesseract-ocr
- Language: Python
- Homepage:
- Size: 11.7 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Clean Image Extractor
A Python-based tool that leverages the power of OpenCV and Tesseract OCR to cleanse images and extract text from them in a bulk manner.
## Prerequisite
Before proceeding, ensure that the [Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract/wiki) is installed on your system. Tesseract OCR is an open-source Optical Character Recognition engine used to recognize textual data from images.
## How it Works
The program runs in two significant steps:
1. **Image Cleaning**: Through OpenCV, the program processes each image, reducing noise and enhancing the image quality to ensure optimal text extraction.
2. **Text Extraction**: Utilizing the Tesseract OCR engine, the program extracts textual data from the cleaned images, writing the result to individual text files.
## Usage
Here's a breakdown of the core functions and how they interact:
- `clean_image()`: This function accepts an image as input, applying several image processing techniques via OpenCV to clean the image and eliminate noise.
- `extract_text()`: This function takes two parameters: the path to an image file and the path to an output text file. It loads the image, cleans it using the `clean_image()` function, and then uses Tesseract OCR to extract text from the cleaned image. The extracted text is then saved to the specified output text file.
- `main()`: This function serves as the orchestrator. It retrieves a list of image files in a specified directory, processing each image file using the `extract_text()` function. The resulting output text files are saved in a separate directory, with names following the format `text1.txt`, `text2.txt`, and so on.
### Disclaimer
Please note that the quality of the image impacts the accuracy of the text extraction. Better image quality would invariably lead to more accurate text extraction. Post-processing such as spell-checking might also be necessary to handle OCR's occasional recognition errors.