{"id":19251171,"url":"https://github.com/oshekharo/clean-image-extractor","last_synced_at":"2025-04-15T21:39:11.554Z","repository":{"id":117668510,"uuid":"602663173","full_name":"OshekharO/Clean-Image-Extractor","owner":"OshekharO","description":"Python program that uses the OpenCV library to clean bulk images, and then uses the Tesseract OCR library to extract text from the cleaned images","archived":false,"fork":false,"pushed_at":"2025-02-04T04:50:36.000Z","size":12,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-02-04T05:26:52.952Z","etag":null,"topics":["opencv","python","tesseract-ocr"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OshekharO.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-02-16T17:24:37.000Z","updated_at":"2025-02-04T04:50:39.000Z","dependencies_parsed_at":"2024-11-09T18:22:39.759Z","dependency_job_id":"df3530c0-9fa1-4ac7-b5ef-f386ad1ba2d8","html_url":"https://github.com/OshekharO/Clean-Image-Extractor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OshekharO%2FClean-Image-Extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OshekharO%2FClean-Image-Extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OshekharO%2FClean-Image-Extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OshekharO%2FClean-Image-Extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OshekharO","download_url":"https://codeload.github.com/OshekharO/Clean-Image-Extractor/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240347763,"owners_count":19787230,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["opencv","python","tesseract-ocr"],"created_at":"2024-11-09T18:20:33.604Z","updated_at":"2025-02-23T16:39:29.524Z","avatar_url":"https://github.com/OshekharO.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Clean Image Extractor\n\nA Python-based tool that leverages the power of OpenCV and Tesseract OCR to cleanse images and extract text from them in a bulk manner. \n\n## Prerequisite\n\nBefore proceeding, ensure that the [Tesseract OCR engine](https://github.com/tesseract-ocr/tesseract/wiki) is installed on your system. Tesseract OCR is an open-source Optical Character Recognition engine used to recognize textual data from images.\n\n## How it Works\n\nThe program runs in two significant steps:\n\n1. **Image Cleaning**: Through OpenCV, the program processes each image, reducing noise and enhancing the image quality to ensure optimal text extraction.\n\n2. **Text Extraction**: Utilizing the Tesseract OCR engine, the program extracts textual data from the cleaned images, writing the result to individual text files.\n\n## Usage\n\nHere's a breakdown of the core functions and how they interact:\n\n- `clean_image()`: This function accepts an image as input, applying several image processing techniques via OpenCV to clean the image and eliminate noise.\n\n- `extract_text()`: This function takes two parameters: the path to an image file and the path to an output text file. It loads the image, cleans it using the `clean_image()` function, and then uses Tesseract OCR to extract text from the cleaned image. The extracted text is then saved to the specified output text file.\n\n- `main()`: This function serves as the orchestrator. It retrieves a list of image files in a specified directory, processing each image file using the `extract_text()` function. The resulting output text files are saved in a separate directory, with names following the format `text1.txt`, `text2.txt`, and so on.\n\n### Disclaimer\n\nPlease note that the quality of the image impacts the accuracy of the text extraction. Better image quality would invariably lead to more accurate text extraction. Post-processing such as spell-checking might also be necessary to handle OCR's occasional recognition errors.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foshekharo%2Fclean-image-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Foshekharo%2Fclean-image-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Foshekharo%2Fclean-image-extractor/lists"}