https://github.com/jfilter/pdf-scripts

📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs
https://github.com/jfilter/pdf-scripts

bash bash-script compress crop-image ocr pdf python repair verify

Last synced: 8 months ago
JSON representation

📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs

Host: GitHub
URL: https://github.com/jfilter/pdf-scripts
Owner: jfilter
License: gpl-3.0
Created: 2020-03-29T20:37:32.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2024-05-03T20:07:47.000Z (over 1 year ago)
Last Synced: 2025-04-25T07:51:28.408Z (8 months ago)
Topics: bash, bash-script, compress, crop-image, ocr, pdf, python, repair, verify
Language: Shell
Homepage:
Size: 114 KB
Stars: 68
Watchers: 2
Forks: 6
Open Issues: 5
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-rainmana - jfilter/pdf-scripts - 📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs (Shell)

README

          # PDF Scripts

Scripts (mostly Bash) to repair, verify, OCR, compress (etc.) PDFs.

*Currently in beta status, so except backward-incompatible changes.*

## Install

You need to have [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) installed.

The scripts use several software libraries. [setup.sh](./setup.sh) installs them for macOS (via brew) or Ubuntu/Debian.

## Usage

1. Go to root of this repository: `cd pdf-scripts`

2. Excute script `./pipeline.sh -l deu /path/to/document-in-german.pdf`

Please refer to the scripts for the command-line arguments and options. NB: It's not possible to combine options, e.g., use `-x -y` instead of `-xy`.

Most scripts work on individual PDFs as well as on folders full of PDFs.

## Overview

### [ocr_pdf.sh](./ocr_pdf.sh)

OCR PDFs with [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF).

### [repair_pdf.sh](./repair_pdf.sh)

Using: `pdftocairo` from [poppler](), `mutool clean` from [MuPDF](https://en.wikipedia.org/wiki/MuPDF), [qpdf](https://en.wikipedia.org/wiki/QPDF)

Caveat: May remove text in OCRd PDFs. Use `--check` to check for OCRd text in order to preserve it.

### [verify_pdf.sh](./verify_pdf.sh)

Checks if text can be extracted (if it's already on the PDF)

### [compress_pdf.sh](./compress_pdf.sh)

Using [ghostcript](https://askubuntu.com/a/256449) to compress images in PDFs.

### [reduce_size_pdf.sh](reduce_size_pdf.sh)

Use [compress_pdf.sh](./compress_pdf.sh) but also [pdfsizeopt](https://github.com/pts/pdfsizeopt) to reduze file size of PDFs.

### [clean_metadata_pdf.sh](./clean_metadata_pdf.sh)

Remove metadata with [exiftool](https://exiftool.org/).

### [is_ocrd_pdf.sh](./is_ocrd_pdf.sh)

Detect OCRd PDFs. See also [sort_ocrd_pdfs.sh](sort_by/sort_ocrd_pdfs.sh) to sort PDFs.

### [pipeline.sh](./pipeline.sh)

Combining several of the above scripts.

## FAQ

### Why Bash?

Bash is still the most-used shell. And the scipts comprise mostly of simple conditionals and sequences of CLI commands. This could also be done with Python's `psutil` but this would add yet another layer. However, at some point, I most probable port the scripts to simple POSIX-Shell.

## Related Work

- https://github.com/NicolasBernaerts/ubuntu-scripts/blob/master/pdf/

- [more tools for PDF processing in my blog post](https://johannesfilter.com/python-and-pdf-a-review-of-existing-tools/)

- https://github.com/baltpeter/scanprep

## Development

- focus on Bash v4+

- write Python 3.6+ scripts if Bash gets too complicated

- use Docker images if available

- should run on the major Unix-like OSs (Linux (e.g. Ubuntu), macOS)

- format code with [shfmt](https://github.com/mvdan/sh#shfmt), e.g., extension for [VS Code](https://github.com/foxundermoon/vs-shell-format)

- lint scripts with [shellcheck](https://github.com/koalaman/shellcheck), e.g., extension for [VS Code](https://github.com/timonwong/vscode-shellcheck)

## Common Commands

### Concat PDFs into one PDF

```bash

qpdf --empty --pages *.pdf -- out.pdf

```

### Images to PDF

```bash

convert *.jpg pictures.pdf

```

### Rotate PDFs

```bash

qpdf in.pdf  out.pdf --rotate=+90

```

## License

GPLv3.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jfilter/pdf-scripts

Awesome Lists containing this project

README