{"id":30062610,"url":"https://github.com/r0mb0/batch_pdf_ocr_processor","last_synced_at":"2025-08-08T03:38:42.816Z","repository":{"id":308705323,"uuid":"1033776419","full_name":"R0mb0/Batch_PDF_OCR_Processor","owner":"R0mb0","description":"Batch process all PDF files in a folder to make them searchable with OCR using ocrmypdf and a simple PowerShell script. Output files are saved in an 'output' subfolder. Perfect for Windows users needing fast PDF text recovery.","archived":false,"fork":false,"pushed_at":"2025-08-07T10:58:14.000Z","size":15,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-07T12:33:34.774Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"PowerShell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/R0mb0.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"SUPPORT.md","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":null,"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"otechie":null,"lfx_crowdfunding":null,"custom":"PayPal.Me/R0mb0"}},"created_at":"2025-08-07T10:25:23.000Z","updated_at":"2025-08-07T10:58:18.000Z","dependencies_parsed_at":"2025-08-07T12:33:36.615Z","dependency_job_id":"62849b65-ba64-43f4-b692-ad9501192911","html_url":"https://github.com/R0mb0/Batch_PDF_OCR_Processor","commit_stats":null,"previous_names":["r0mb0/batch_pdf_ocr_processor"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/R0mb0/Batch_PDF_OCR_Processor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R0mb0%2FBatch_PDF_OCR_Processor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R0mb0%2FBatch_PDF_OCR_Processor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R0mb0%2FBatch_PDF_OCR_Processor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R0mb0%2FBatch_PDF_OCR_Processor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/R0mb0","download_url":"https://codeload.github.com/R0mb0/Batch_PDF_OCR_Processor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/R0mb0%2FBatch_PDF_OCR_Processor/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269360965,"owners_count":24404295,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-08T02:00:09.200Z","response_time":72,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-08T03:38:40.528Z","updated_at":"2025-08-08T03:38:42.801Z","avatar_url":"https://github.com/R0mb0.png","language":"PowerShell","funding_links":["PayPal.Me/R0mb0"],"categories":[],"sub_categories":[],"readme":"# Batch PDF OCR Processor for Windows\n\n**Batch process all PDF files in a folder to make them searchable with OCR using [ocrmypdf](https://ocrmypdf.readthedocs.io/en/latest/) and a simple PowerShell script. Output files are saved in an `output` subfolder. Perfect for Windows users needing fast PDF text recovery.**\n\n---\n\n## Features\n\n- Processes all PDF files in the current folder\n- Runs OCR to make PDFs searchable (text layer added)\n- Outputs processed PDFs to an `output` subfolder\n\n---\n\n## Prerequisites\n\n- Windows 10/11\n- PowerShell (already included in Windows)\n- [Chocolatey](https://chocolatey.org/) package manager (for easy installation)\n- [Python 3](https://www.python.org/) (with pip)\n- [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract)\n- [Ghostscript](https://www.ghostscript.com/)\n- [ocrmypdf](https://pypi.org/project/ocrmypdf/)\n\n### Optional but Recommended\n\n- **pngquant** (for better image compression)\n- **jbig2** (for advanced PDF compression, but see important Windows note below)\n\n---\n\n## Step-by-Step Installation (Stupid-Proof)\n\n### 1. Install Chocolatey\n\n**Chocolatey** lets you install Windows programs from the command line.\n\n1. Open **PowerShell as Administrator** (Right click PowerShell \u003e \"Run as Administrator\").\n2. Paste this command and press Enter:\n\n    ```powershell\n    Set-ExecutionPolicy Bypass -Scope Process -Force; `\n      [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; `\n      iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))\n    ```\n\n3. Close and reopen PowerShell **(as normal user is fine for next steps)**.\n\n---\n\n### 2. Install Python and Pip\n\nUsing Chocolatey (in PowerShell):\n\n```powershell\nchoco install python -y\n```\n\n- This will install Python **and** pip.\n- Close and reopen PowerShell after installation.\n- Test with:\n    ```powershell\n    python --version\n    pip --version\n    ```\n\n---\n\n### 3. Install Required Packages (ocrmypdf, tesseract, ghostscript)\n\n**Install Tesseract and Ghostscript** using Chocolatey:\n\n```powershell\nchoco install tesseract -y\nchoco install ghostscript -y\n```\n\n**Install ocrmypdf** (using pip):\n\n```powershell\npip install ocrmypdf\n```\n\n---\n\n### 4. (Optional) Install Additional Recommended Packages\n\n#### pngquant\n\n**For better image compression, install:**\n\n```powershell\nchoco install pngquant -y\n```\n\n#### jbig2 (Advanced, Optional, Not Directly Supported on Windows)\n\n**jbig2** is an optional dependency that can improve PDF compression.\n- **Important:** There is **no official Windows binary** and it is **not available via Chocolatey**.\n- If you require jbig2, you will need to manually compile it from source or find a trusted third-party binary for Windows. For most users, this step can be skipped.\n\n---\n\n### 5. Enable PowerShell Script Execution\n\n\u003e **IMPORTANT:**  \n\u003e By default, Windows may prevent running scripts.  \n\u003e Before running the script, in PowerShell, execute:\n\n```powershell\nSet-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass\n```\n\nThis change is **temporary** and only for the current PowerShell window.\n\n---\n\n## Usage\n\n1. **Place `ocr_batch.ps1` in the same folder as your PDFs.**\n2. **Open PowerShell in that folder** (Shift + Right Click in the folder \u003e \"Open PowerShell window here\").\n3. **Run the script:**\n\n    ```powershell\n    .\\ocr_batch.ps1\n    ```\n\n4. **Processed PDFs will appear in the `output` subfolder.**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fr0mb0%2Fbatch_pdf_ocr_processor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fr0mb0%2Fbatch_pdf_ocr_processor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fr0mb0%2Fbatch_pdf_ocr_processor/lists"}