{"id":13587689,"url":"https://github.com/jfilter/pdf-scripts","last_synced_at":"2025-04-27T20:32:41.857Z","repository":{"id":40697873,"uuid":"251129143","full_name":"jfilter/pdf-scripts","owner":"jfilter","description":"📑 Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs","archived":false,"fork":false,"pushed_at":"2024-05-03T20:07:47.000Z","size":117,"stargazers_count":68,"open_issues_count":5,"forks_count":6,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-04-25T07:51:28.408Z","etag":null,"topics":["bash","bash-script","compress","crop-image","ocr","pdf","python","repair","verify"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jfilter.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-03-29T20:37:32.000Z","updated_at":"2025-03-17T13:29:54.000Z","dependencies_parsed_at":"2024-05-05T03:42:25.662Z","dependency_job_id":null,"html_url":"https://github.com/jfilter/pdf-scripts","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fpdf-scripts","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fpdf-scripts/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fpdf-scripts/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jfilter%2Fpdf-scripts/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jfilter","download_url":"https://codeload.github.com/jfilter/pdf-scripts/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251204618,"owners_count":21552253,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bash","bash-script","compress","crop-image","ocr","pdf","python","repair","verify"],"created_at":"2024-08-01T15:06:19.213Z","updated_at":"2025-04-27T20:32:41.392Z","avatar_url":"https://github.com/jfilter.png","language":"Shell","funding_links":[],"categories":["Shell"],"sub_categories":[],"readme":"# PDF Scripts\n\nScripts (mostly Bash) to repair, verify, OCR, compress (etc.) PDFs.\n\n*Currently in beta status, so except backward-incompatible changes.*\n\n## Install\n\nYou need to have [Bash](https://en.wikipedia.org/wiki/Bash_(Unix_shell)) installed.\n\nThe scripts use several software libraries. [setup.sh](./setup.sh) installs them for macOS (via brew) or Ubuntu/Debian.\n\n\n## Usage\n\n1. Go to root of this repository: `cd pdf-scripts`\n2. Excute script `./pipeline.sh -l deu /path/to/document-in-german.pdf`\n\nPlease refer to the scripts for the command-line arguments and options. NB: It's not possible to combine options, e.g., use `-x -y` instead of `-xy`.\n\nMost scripts work on individual PDFs as well as on folders full of PDFs.\n\n## Overview\n\n### [ocr_pdf.sh](./ocr_pdf.sh)\n\nOCR PDFs with [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF).\n\n### [repair_pdf.sh](./repair_pdf.sh)\n\nUsing: `pdftocairo` from [poppler](\u003chttps://en.wikipedia.org/wiki/Poppler_(software)\u003e), `mutool clean` from [MuPDF](https://en.wikipedia.org/wiki/MuPDF), [qpdf](https://en.wikipedia.org/wiki/QPDF)\n\nCaveat: May remove text in OCRd PDFs. Use `--check` to check for OCRd text in order to preserve it.\n\n\n### [verify_pdf.sh](./verify_pdf.sh)\n\nChecks if text can be extracted (if it's already on the PDF)\n\n### [compress_pdf.sh](./compress_pdf.sh)\n\nUsing [ghostcript](https://askubuntu.com/a/256449) to compress images in PDFs.\n\n### [reduce_size_pdf.sh](reduce_size_pdf.sh)\n\nUse [compress_pdf.sh](./compress_pdf.sh) but also [pdfsizeopt](https://github.com/pts/pdfsizeopt) to reduze file size of PDFs.\n\n### [clean_metadata_pdf.sh](./clean_metadata_pdf.sh)\n\nRemove metadata with [exiftool](https://exiftool.org/).\n\n### [is_ocrd_pdf.sh](./is_ocrd_pdf.sh)\n\nDetect OCRd PDFs. See also [sort_ocrd_pdfs.sh](sort_by/sort_ocrd_pdfs.sh) to sort PDFs.\n\n### [pipeline.sh](./pipeline.sh)\n\nCombining several of the above scripts.\n\n## FAQ\n\n### Why Bash?\n\nBash is still the most-used shell. And the scipts comprise mostly of simple conditionals and sequences of CLI commands. This could also be done with Python's `psutil` but this would add yet another layer. However, at some point, I most probable port the scripts to simple POSIX-Shell.\n\n## Related Work\n\n- https://github.com/NicolasBernaerts/ubuntu-scripts/blob/master/pdf/\n- [more tools for PDF processing in my blog post](https://johannesfilter.com/python-and-pdf-a-review-of-existing-tools/)\n- https://github.com/baltpeter/scanprep\n\n## Development\n\n- focus on Bash v4+\n- write Python 3.6+ scripts if Bash gets too complicated\n- use Docker images if available\n- should run on the major Unix-like OSs (Linux (e.g. Ubuntu), macOS)\n- format code with [shfmt](https://github.com/mvdan/sh#shfmt), e.g., extension for [VS Code](https://github.com/foxundermoon/vs-shell-format)\n- lint scripts with [shellcheck](https://github.com/koalaman/shellcheck), e.g., extension for [VS Code](https://github.com/timonwong/vscode-shellcheck)\n\n## Common Commands\n\n### Concat PDFs into one PDF\n```bash\nqpdf --empty --pages *.pdf -- out.pdf\n```\n\n### Images to PDF\n```bash\nconvert *.jpg pictures.pdf\n```\n\n### Rotate PDFs\n```bash\nqpdf in.pdf  out.pdf --rotate=+90\n```\n\n## License\n\nGPLv3.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fpdf-scripts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjfilter%2Fpdf-scripts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjfilter%2Fpdf-scripts/lists"}