https://github.com/oshekharo/clean-image-extractor

Python program that uses the OpenCV library to clean bulk images, and then uses the Tesseract OCR library to extract text from the cleaned images
https://github.com/oshekharo/clean-image-extractor

opencv python tesseract-ocr

Last synced: about 1 year ago
JSON representation

Python program that uses the OpenCV library to clean bulk images, and then uses the Tesseract OCR library to extract text from the cleaned images

Host: GitHub
URL: https://github.com/oshekharo/clean-image-extractor
Owner: OshekharO
Created: 2023-02-16T17:24:37.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-02-04T04:50:36.000Z (over 1 year ago)
Last Synced: 2025-02-04T05:26:52.952Z (over 1 year ago)
Topics: opencv, python, tesseract-ocr
Language: Python
Homepage:
Size: 11.7 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Clean Image Extractor

A Python-based tool that leverages the power of OpenCV and Tesseract OCR to cleanse images and extract text from them in a bulk manner. 

## Prerequisite

Before proceeding, ensure that the [Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract/wiki) is installed on your system. Tesseract OCR is an open-source Optical Character Recognition engine used to recognize textual data from images.

## How it Works

The program runs in two significant steps:

1. **Image Cleaning**: Through OpenCV, the program processes each image, reducing noise and enhancing the image quality to ensure optimal text extraction.

2. **Text Extraction**: Utilizing the Tesseract OCR engine, the program extracts textual data from the cleaned images, writing the result to individual text files.

## Usage

Here's a breakdown of the core functions and how they interact:

- `clean_image()`: This function accepts an image as input, applying several image processing techniques via OpenCV to clean the image and eliminate noise.

- `extract_text()`: This function takes two parameters: the path to an image file and the path to an output text file. It loads the image, cleans it using the `clean_image()` function, and then uses Tesseract OCR to extract text from the cleaned image. The extracted text is then saved to the specified output text file.

- `main()`: This function serves as the orchestrator. It retrieves a list of image files in a specified directory, processing each image file using the `extract_text()` function. The resulting output text files are saved in a separate directory, with names following the format `text1.txt`, `text2.txt`, and so on.

### Disclaimer

Please note that the quality of the image impacts the accuracy of the text extraction. Better image quality would invariably lead to more accurate text extraction. Post-processing such as spell-checking might also be necessary to handle OCR's occasional recognition errors.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/oshekharo/clean-image-extractor

Awesome Lists containing this project

README