{"id":19556049,"url":"https://github.com/thecomputeguy/pdfocrtool","last_synced_at":"2026-06-09T01:05:13.631Z","repository":{"id":184776691,"uuid":"370268565","full_name":"TheComputeGuy/PDFOCRtool","owner":"TheComputeGuy","description":"Add an OCR layer to *any* PDF","archived":false,"fork":false,"pushed_at":"2021-09-01T17:42:03.000Z","size":47384,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-26T07:43:42.101Z","etag":null,"topics":["ocr","ocrmypdf","pdf","pdftopng","python","tesseract"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TheComputeGuy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2021-05-24T07:40:38.000Z","updated_at":"2024-07-16T02:26:20.000Z","dependencies_parsed_at":"2023-07-30T07:50:36.814Z","dependency_job_id":null,"html_url":"https://github.com/TheComputeGuy/PDFOCRtool","commit_stats":null,"previous_names":["thecomputeguy/pdfocrtool"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/TheComputeGuy/PDFOCRtool","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheComputeGuy%2FPDFOCRtool","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheComputeGuy%2FPDFOCRtool/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheComputeGuy%2FPDFOCRtool/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheComputeGuy%2FPDFOCRtool/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TheComputeGuy","download_url":"https://codeload.github.com/TheComputeGuy/PDFOCRtool/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TheComputeGuy%2FPDFOCRtool/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34086670,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ocr","ocrmypdf","pdf","pdftopng","python","tesseract"],"created_at":"2024-11-11T04:36:34.571Z","updated_at":"2026-06-09T01:05:13.608Z","avatar_url":"https://github.com/TheComputeGuy.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF OCR Tool\n\nPDF OCR Tool is a Python-based tool for adding OCR to PDFs. It builds up on the great work done at [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF) and [pdftopng](https://github.com/vinayak-mehta/pdftopng) and acts as a wrapper for the same on Windows. It converts the incoming PDF into images and back to PDF to strip any old data/text present and then adds the OCR layer on top of it.\n\n## Installation\n\nDownload the appropriate ZIP file from [releases](https://github.com/Shubham-272/PDFOCRtool/releases) and unzip it.\n\nThe downloaded ZIP file consists of two batch scripts\n1. setup.bat\n2. install-lib.bat\n\nRun both the scripts in the same order **with administrative privileges**. This installs the prerequisites (Chocolatey as package manager, Python 3.8, Tesseract OCR engine and Ghostscript. And the python packages ocrmypdf, pdftopng and PyPDF2)\n\nOR\n\n1. Clone the repository\n```\ngit clone git@github.com:Shubham-272/PDFOCRtool.git\n```\n2. Ensure you have Python 3 installed (tested on Python 3.8)\n3. Install dependencies and libraries (preferably via chocolatey)\n```\nchoco install --pre tesseract -y\nchoco install ghostscript -y\npip install ocrmypdf pdftopng PyPDF2\n```\n4. The src folder has the python scripts for individual operations as well as for the overall conversion.\n\n## Usage\n\nThe **PDF OCR Tool.exe** runs the application. It consists of two options - the input PDF file and the folder where the output PDF should be saved. The output file will have the name paper.pdf\n\nIf cloned via git, the script pdfOCRtool.py is responsible for overall operation. You can run it via a terminal as\n```\npython pdfOCRtool.py\n```\n\nConversion takes time and the progress is shown in the console that opens alongside. Once the conversion is complete, the UI shows a message in green saying \"File converted and saved successfully!\"\n\nIn case of errors, the message is printed in the UI with a red background. Contact the maintainers with the error message for resolution.\n\n## Additional Language Support\n\nPDF OCR Tool is based on Tesseract OCR engine. Tesseract supports a wide range of languages (you can check the list [here](https://tesseract-ocr.github.io/tessdoc/Data-Files-in-different-versions.html))\n\nPDF OCR Tool installs only English language by default. For adding support for languages other than English, download the respective language pack (.traineddata file) from [here](https://github.com/tesseract-ocr/tessdata/) and place it in **C:\\\\Program Files\\\\Tesseract-OCR\\\\tessdata** (or wherever Tesseract OCR is installed).\n\nTo perform OCR on a PDF with a language other than English, specify the language(s) to be used for OCR during run time as a comma separated list.\n\n## Changelog\nv1.0.0 - Initial release, support English OCR\n\nv1.1.0 - Added language support via arguments\n\n## Contributing\nPull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.\n\n## License\n[GNU General Public License v3.0](https://choosealicense.com/licenses/gpl-3.0/)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthecomputeguy%2Fpdfocrtool","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthecomputeguy%2Fpdfocrtool","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthecomputeguy%2Fpdfocrtool/lists"}