{"id":18947150,"url":"https://github.com/qedsoftware/multipage-ocr","last_synced_at":"2025-04-15T22:31:33.095Z","repository":{"id":7357570,"uuid":"8682175","full_name":"qedsoftware/multipage-ocr","owner":"qedsoftware","description":"(Python) Execute tesseract OCR on a multi-page PDF.","archived":false,"fork":false,"pushed_at":"2023-06-30T22:19:44.000Z","size":12,"stargazers_count":18,"open_issues_count":1,"forks_count":11,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-29T03:51:16.686Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/qedsoftware.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-03-10T07:37:18.000Z","updated_at":"2024-05-29T18:31:00.000Z","dependencies_parsed_at":"2022-09-06T09:40:05.903Z","dependency_job_id":null,"html_url":"https://github.com/qedsoftware/multipage-ocr","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qedsoftware%2Fmultipage-ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qedsoftware%2Fmultipage-ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qedsoftware%2Fmultipage-ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/qedsoftware%2Fmultipage-ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/qedsoftware","download_url":"https://codeload.github.com/qedsoftware/multipage-ocr/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":249166109,"owners_count":21223384,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-08T13:09:08.943Z","updated_at":"2025-04-15T22:31:28.087Z","avatar_url":"https://github.com/qedsoftware.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Multipage-OCR\n===============\n\nDescription\n---------------\n\nThis is a simple python script that executes tesseract OCR on a multi-page PDF. \n\nEach page of the PDF is converted into an image, each image is converted to text, and all text files are concatenated to produce the final output.\n\nThe script allows you to specify ImageMagick parameters in the image conversion, along with some tesseract parameters for the OCR.\n\nWilliam Wu (w@qed.ai), 2013 March 9 \n\nPython 3\n---------\nThe original script has been updated on 2018 May 20 by [Ian Watt](http://github.com/watty62) to work with Python 3.6+ on Mac OSX.  The new version is [multipage-ocr_p3.py](multipage-ocr_p3.py) \n\nIf you experience errors at line 121 - convert PDF to image format - you can cure it by running \"brew install imagemagick\" as I did.  See the most popular answer to this [StackOverflow question](https://stackoverflow.com/questions/28627473/error-for-convert-command-in-command-line) which fixed the error.  \n\nPython 2\n---------\nThe [original script](multipage-ocr.py) created by William Wu is unchanged. \n\nDemo\n---------------\n\n$ python multipage_ocr.py -i input.pdf \n\n\tNumber of pages: 3\n\t/tmp/ocr_44XR0WPHN6\n\tConvert PDF to image: convert -density 300 -depth 8 input.pdf[0] -background white /tmp/ocr_44XR0WPHN6/0.jpg\n\tOCR on image: tesseract -psm 3 /tmp/ocr_44XR0WPHN6/0.jpg /tmp/ocr_44XR0WPHN6/0 quiet\n\tConvert PDF to image: convert -density 300 -depth 8 input.pdf[1] -background white /tmp/ocr_44XR0WPHN6/1.jpg\n\tOCR on image: tesseract -psm 3 /tmp/ocr_44XR0WPHN6/1.jpg /tmp/ocr_44XR0WPHN6/1 quiet\n\tConvert PDF to image: convert -density 300 -depth 8 input.pdf[2] -background white /tmp/ocr_44XR0WPHN6/2.jpg\n\tOCR on image: tesseract -psm 3 /tmp/ocr_44XR0WPHN6/2.jpg /tmp/ocr_44XR0WPHN6/2 quiet\n\tConcatenate OCR outputs: cat /tmp/ocr_44XR0WPHN6/0.txt /tmp/ocr_44XR0WPHN6/1.txt /tmp/ocr_44XR0WPHN6/2.txt \u003e input_ocr.txt\n\tCleanup temporary files: rm -r /tmp/ocr_44XR0WPHN6\n\n\nRequirements\n---------------\nSystem requirements: tesseract, pypdf\n\nTo install tesseract on Mac OS X:\n\n\t$ brew install tesseract\n\nTo install pypdf:\n\n\t$ pip install pypdf\n\nTo install imagemagick \n\n    $ brew install imagemagick\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqedsoftware%2Fmultipage-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fqedsoftware%2Fmultipage-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fqedsoftware%2Fmultipage-ocr/lists"}