https://github.com/r0mb0/batch_pdf_ocr_processor

Batch process all PDF files in a folder to make them searchable with OCR using ocrmypdf and a simple PowerShell script. Output files are saved in an 'output' subfolder. Perfect for Windows users needing fast PDF text recovery.
https://github.com/r0mb0/batch_pdf_ocr_processor

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/r0mb0/batch_pdf_ocr_processor
Owner: R0mb0
License: mit
Created: 2025-08-07T10:25:23.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-08-07T10:58:14.000Z (10 months ago)
Last Synced: 2025-08-07T12:33:34.774Z (10 months ago)
Language: PowerShell
Homepage:
Size: 14.6 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Support: SUPPORT.md

Awesome Lists containing this project

README

# Batch PDF OCR Processor for Windows

**Batch process all PDF files in a folder to make them searchable with OCR using [ocrmypdf](https://ocrmypdf.readthedocs.io/en/latest/) and a simple PowerShell script. Output files are saved in an `output` subfolder. Perfect for Windows users needing fast PDF text recovery.**

---

## Features

- Processes all PDF files in the current folder
- Runs OCR to make PDFs searchable (text layer added)
- Outputs processed PDFs to an `output` subfolder

---

## Prerequisites

- Windows 10/11
- PowerShell (already included in Windows)
- [Chocolatey](https://chocolatey.org/) package manager (for easy installation)
- [Python 3](https://www.python.org/) (with pip)
- [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract)
- [Ghostscript](https://www.ghostscript.com/)
- [ocrmypdf](https://pypi.org/project/ocrmypdf/)

### Optional but Recommended

- **pngquant** (for better image compression)
- **jbig2** (for advanced PDF compression, but see important Windows note below)

---

## Step-by-Step Installation (Stupid-Proof)

### 1. Install Chocolatey

**Chocolatey** lets you install Windows programs from the command line.

1. Open **PowerShell as Administrator** (Right click PowerShell > "Run as Administrator").
2. Paste this command and press Enter:

```powershell
Set-ExecutionPolicy Bypass -Scope Process -Force; `
[System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; `
iex ((New-Object System.Net.WebClient).DownloadString('https://community.chocolatey.org/install.ps1'))
```

3. Close and reopen PowerShell **(as normal user is fine for next steps)**.

---

### 2. Install Python and Pip

Using Chocolatey (in PowerShell):

```powershell
choco install python -y
```

- This will install Python **and** pip.
- Close and reopen PowerShell after installation.
- Test with:
```powershell
python --version
pip --version
```

---

### 3. Install Required Packages (ocrmypdf, tesseract, ghostscript)

**Install Tesseract and Ghostscript** using Chocolatey:

```powershell
choco install tesseract -y
choco install ghostscript -y
```

**Install ocrmypdf** (using pip):

```powershell
pip install ocrmypdf
```

---

### 4. (Optional) Install Additional Recommended Packages

#### pngquant

**For better image compression, install:**

```powershell
choco install pngquant -y
```

#### jbig2 (Advanced, Optional, Not Directly Supported on Windows)

**jbig2** is an optional dependency that can improve PDF compression.
- **Important:** There is **no official Windows binary** and it is **not available via Chocolatey**.
- If you require jbig2, you will need to manually compile it from source or find a trusted third-party binary for Windows. For most users, this step can be skipped.

---

### 5. Enable PowerShell Script Execution

> **IMPORTANT:**
> By default, Windows may prevent running scripts.
> Before running the script, in PowerShell, execute:

```powershell
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
```

This change is **temporary** and only for the current PowerShell window.

---

## Usage

1. **Place `ocr_batch.ps1` in the same folder as your PDFs.**
2. **Open PowerShell in that folder** (Shift + Right Click in the folder > "Open PowerShell window here").
3. **Run the script:**

```powershell
.\ocr_batch.ps1
```

4. **Processed PDFs will appear in the `output` subfolder.**

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/r0mb0/batch_pdf_ocr_processor

Awesome Lists containing this project

README