{"id":20296947,"url":"https://github.com/nbehrnd/pdf_rewriter","last_synced_at":"2025-04-11T12:13:27.325Z","repository":{"id":154641988,"uuid":"228653431","full_name":"nbehrnd/pdf_rewriter","owner":"nbehrnd","description":"Optimize .pdf by «reprint to pdf» by ghostcript in color, or gray; if present, text layer, internal crosslinks (e.g., TOC) and hyperlinks (e.g., to websites) may be preserved.","archived":false,"fork":false,"pushed_at":"2025-01-22T10:52:52.000Z","size":2795,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-25T08:38:20.829Z","etag":null,"topics":["ghostscript","pdf","reduction","rewrite"],"latest_commit_sha":null,"homepage":"","language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/nbehrnd.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-12-17T16:08:21.000Z","updated_at":"2025-01-22T10:46:22.000Z","dependencies_parsed_at":"2024-11-14T15:54:55.213Z","dependency_job_id":null,"html_url":"https://github.com/nbehrnd/pdf_rewriter","commit_stats":null,"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nbehrnd%2Fpdf_rewriter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nbehrnd%2Fpdf_rewriter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nbehrnd%2Fpdf_rewriter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/nbehrnd%2Fpdf_rewriter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/nbehrnd","download_url":"https://codeload.github.com/nbehrnd/pdf_rewriter/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248398830,"owners_count":21097294,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ghostscript","pdf","reduction","rewrite"],"created_at":"2024-11-14T15:42:17.944Z","updated_at":"2025-04-11T12:13:27.314Z","avatar_url":"https://github.com/nbehrnd.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Background\n\nDepending on the pdf creator/engine used, a `.pdf` file may include\ncontent irrelevant for reading by a human. An example is the inclusion\nof complete sets of fonts though only a few glyphs are (or only one is)\nused on the \"electronic paper\". To print the `.pdf` with `ghostscript`\nagain into a `.pdf` may reduce the file size e.g., prior to organize a\npublication in a reference manager.\n\nBy far, this bash script *does not* claim to be the first one collecting\nbits and bolts to address the issue. It rather serves as an aide-memoire\nof finds encountered earlier, and to moderate `ghostscript` in Linux\naccordingly. Within reason, the snippets were joined as provided; thus,\nthe credit belongs to those already in the field.\n\n## Intended Use\n\nOne use consists of 1) the provision of the executable bit (`chmod\n  +x pdf_reprint.sh`), and 2) to set an alias in your `.bashrc` file.\nThen, the functionality described below is available all across your\nsystem.\n\n- Else, to reprint the `.pdf` while retaining the color, run either one\n  of the following commands\n\n  ``` shell\n  bash ./pdf_rewrite.sh --reprint input.pdf\n  bash ./pdf_rewrite.sh -r input.pdf\n  bash ./pdf_rewrite.sh --colour input.pdf\n  bash ./pdf_rewrite.sh --color input.pdf\n  bash ./pdf_rewrite.sh -c input.pdf\n  ```\n\n  *to replace* the original file `input.pdf` by a new version of smaller\n  file size. There will be a short note if the attempt was successful;\n  and if so, the improvement compared to the original `input.pdf` is\n  reported (percentage). Though you may process the file multiple times\n  with this method, savings in file size often quickly converge to be\n  insignificant in comparison to the remaining file size.\n\n  The credit for the underlying approach and implementation belongs to\n  Evan Langlois.[^1]\n\n- Often, an additional reduction of file size may be obtained by\n  reprinting the `.pdf` in gray-scale only. Either one of the following\n  commands to the script is equivalent\n\n  ``` shell\n  bash ./pdf_rewrite.sh --gray input.pdf\n  bash ./pdf_rewrite.sh --grey input.pdf\n  bash ./pdf_rewrite.sh -g input.pdf\n  ```\n\n  to replace file `input.pdf` by its rewritten form. The credit for this\n  approach belongs to user `slm` on the Unix stackexchange.[^2]\n\n- To process multiple `.pdf`, a for-loop in your shell could follow the\n  pattern of\n\n  ``` shell\n  for file in *.pdf\n  do\n      echo \"$file\"\n      bash ./pdf_rewrite.sh -r \"$file\"\n  done\n  ```\n\n  This equally provides you a brief progress report, too.\n\nNote, this script's primary aim is to obtain a file of small file size,\ne.g., as an attachment of an email while retaining the text easy to read\nand – if present – to retain a text layer searchable. Especially the\nreprint in half-tones however may render illustrations less\nintelligible. It is up to the creators of figures to use easy\ndiscernible markers in diagrams, as well as to use color scales suitable\nfor the color blind, and safe for this mimicked \"photocopying\". For the\nlater, a service like \u003chttps://colorbrewer2.org/\u003e may guide your\nselection.\n\nKeep a backup of the .pdf to be processed. Though the script may report\nproblems while processing the data (or even crash, which may destroy the\n.pdf), it *is not* a PDF validator such as e.g., veraPDF.[^3]\n\n## Benchmark\n\nInitially written for Linux Xubuntu 18.04.3 LTS and ghostscript\n(version 9.26), the script is known working well e.g., with\nDebian 13/trixie (currently *testing*) and GPL Ghostscript\n(version 10.02.1 published by 2023-11-01).\n\n- File `link2web.pdf` was compiled with pdfLaTeX based on an example\n  provided by\n  [www.texample.net](http://www.texample.net/tikz/examples/lune-of-hippocrates/).\n  This .pdf contains a color figure and link to an external reference.\n  Note, the simplification into half-tones (option `-g`) affects the\n  document printed, depending on the pdf viewer used, the box around the\n  link may remain colored *for the display on screen*.\n\n- The performance of the utility was tested on a couple of recent\n  publications in chemistry. To ease a potential replication,\n  publications used for the bench marke are – within reason – available\n  open access/CC.\n\nIn the table below, *savings* computes the difference of the file size\nprior and after the processing with either option, then reports this\nchange as percentage in respect to the file size of the originally\nsubmitted file after a single run of optimization.\n\nTypically, the simple reprint `-r` retaining the color is the fastest\napproach to reduce most of the file size in one run and hence already\n*good enough*. How much file size is saved seems to vary not only by the\nrelative amount of special (mathematical) characters, but among journals\nby the same publisher; see for instance the small savings for a reprint\nfor *J. Appl. Cryst.* vs. *Helv. Chim. Acta* both published by Wiley.\n\n| source | publisher | original | reprint `-r` | saved `%` | reprint `-g` | saved `%` |\n|:---|---:|---:|---:|---:|---:|---:|\n| [2023ACR3640](https://doi.org/10.1021/acs.accounts.3c00588) | ACS | 7.0 MB | 4.5 MB | 35.7 | 3.2 MB | 54.3 |\n| [2023ACR3654](https://doi.org/10.1021/acs.accounts.3c00595) | ACS | 2.4 MB | 1.6 MB | 33.3 | 1.6 MB | 33.3 |\n| [2023CrystGrowthDes8469](https://doi.org/10.1021/acs.cgd.3c00985) | ACS | 3.7 MB | 0.9 MB | 75.7 | 0.9 MB | 75.7 |\n| [2024CrystGrowthDes71](https://doi.org/10.1021/acs.cgd.3c00476) | ACS | 10.5 MB | 1.5 MB | 85.7 | 1.4 MB | 86.7 |\n| [2023CRV12135](https://doi.org/10.1021/acs.chemrev.3c00372) | ACS | 9.4 MB | 6.1 MB | 35.1 | 5.3 MB | 43.6 |\n| [2023CRV13291](https://doi.org/10.1021/acs.chemrev.3c00241) | ACS | 25.5 MB | 4.0 MB | 84.3 | 3.7 MB | 85.5 |\n| [2023CRV13713](https://doi.org/10.1021/acs.chemrev.3c00489) | ACS | 12.0 MB | 4.5 MB | 62.5 | 4.2 MB | 65.0 |\n| [2023JCE4674](https://doi.org/10.1021/acs.jchemed.3c00845) | ACS | 2.8 MB | 2.2 MB | 21.4 | 2.1 MB | 25.0 |\n| [2023JCE4728](https://doi.org/10.1021/acs.jchemed.3c00306) | ACS | 2.6 MB | 1.0 MB | 61.5 | 1.0 MB | 61.5 |\n| [2023JOC16679](https://doi.org/10.1021/acs.joc.3c01753) | ACS | 4.7 MB | 3.2 MB | 31.9 | 3.0 MB | 36.2 |\n| [2023JOC16719](https://doi.org/10.1021/acs.joc.3c00815) | ACS | 9.9 MB | 2.4 MB | 75.8 | 2.1 MB | 78.8 |\n| [2023OL9002](https://doi.org/10.1021/acs.orglett.3c03590) | ACS | 2.4 MB | 1.2 MB | 50.0 | 1.1 MB | 54.2 |\n| [2023OL9243](https://doi.org/10.1021/acs.orglett.3c03993) | ACS | 2.2 MB | 1.4 MB | 36.4 | 1.4 MB | 36.4 |\n| [2023Tetrahedron133750](https://doi.org/10.1016/j.tet.2023.133750) | Elsevier | 1.2 MB | 1.0 MB | 16.7 | 0.6 MB | 50.0 |\n| [2024Tetrahedron133787](https://doi.org/10.1016/j.tet.2023.133787) | Elsevier | 1.9 MB | 1.8 MB | 5.3 | 1.6 MB | 15.8 |\n| [2023TL154433](https://doi.org/10.1016/j.tetlet.2023.154433) | Elsevier | 831 kB | 721 kB | 13.2 | 497 kB | 40.2 |\n| [2024TL154885](https://doi.org/10.1016/j.tetlet.2023.154885) | Elsevier | 1.6 MB | 0.9 MB | 43.8 | 0.9 MB | 43.8 |\n| [2024PCCP713](https://doi.org/10.1039/d3cp05084j) | RSC | 1.0 MB | 1.0 MB | 0.0 | 0.5 MB | 50.0 |\n| [2024PCCP770](https://doi.org/10.1039/d3cp03800a) | RSC | 2.3 MB | 2.1 MB | 8.7 | 0.8 MB | 65.2 |\n| [2024TheorChemAcc4](https://doi.org/10.1007/s00214-023-03077-7) | Springer | 1.7 MB | 0.8 MB | 52.9 | 0.7 MB | 58.8 |\n| [2023TheorChemAcc133](https://doi.org/10.1007/s00214-023-03069-7) | Springer | 1.7 MB | 1.0 MB | 41.2 | 1.0 MB | 41.2 |\n| [2024JSulfurChem138](https://doi.org/10.1080/17415993.2023.2255711) | Taylor \u0026 Francis | 469 kB | 248 kB | 47.1 | 247 kB | 47.3 |\n| [2023JSulfurChem269](https://doi.org/10.1080/17415993.2022.2164196) | Taylor \u0026 Francis | 5.6 MB | 2.9 MB | 48.2 | 2.0 MB | 64.3 |\n| [2023Synthesis3777](https://doi.org/10.1055/a-2126-3774) | Thieme | 976 kB | 936 kB | 4.1 | 528 kB | 45.9 |\n| [2023Synthesis3947](https://doi.org/10.1055/s-0042-1751502) | Thieme | 2.2 MB | 2.2 MB | 0.0 | 1.9 MB | 13.6 |\n| [2024ACIEe202310983](https://doi.org/10.1002/anie.202310983) | Wiley | 877 kB | 800 kB | 8.8 | 507 kB | 42.2 |\n| [2024ACIEe202314446](https://doi.org/10.1002/anie.202314446) | Wiley | 2.5 MB | 2.4 MB | 4.0 | 1.2 MB | 52.0 |\n| [2023HCAe202300110](https://doi.org/10.1002/hlca.202300110) | Wiley | 10.4 MB | 5.5 MB | 47.1 | 2.9 MB | 72.1 |\n| [2023HCAe202300154](https://doi.org/10.1002/hlca.202300154) | Wiley | 10.4 MB | 5.7 MB | 45.2 | 2.2 MB | 78.8 |\n| [2023JApplCryst1618](https://doi.org/10.1107/S1600576723008324) | Wiley | 1.1 MB | 1.1 MB | 0.0 | 0.9 MB | 18.2 |\n| [2023JApplCryst1639](https://doi.org/10.1107/S1600576723008439) | Wiley | 2.8 MB | 2.7 MB | 3.6 | 1.2 MB | 57.1 |\n| link2web.pdf | pdflatex | 38.0 kB | 9.8 kB | 74.2 | 9.8 kB | 74.2 |\n\n## Disclaimer\n\nWhile rewriting the pdf file in question, the pdf metadata `Producer`\n(which can be an entry like `LaTeX with hyperref`), `CreationDate`, and\n`ModDate` are overwritten. Other metadata such as `TITLE`, `SUBJECT`,\n`KEYWORDS`, and `AUTHOR` are retained. This however does not seem to\naffect the retrieval of bibliographic metadata with a reference manager\nlike [zotero](https://www.zotero.org/); presumably, their work accesses\nand relies on the doi string the pdf of a journal publication instead.\n\nOn occasion, typographic ligatures like `fi` (as in the string *file*),\nor `fl` (as in *fluid*) are not faithfully processed – the searchable\ntext layer of the reprinted pdf might split them, or drop them\naltogether. Apparently, this issue depends both on the version of\nghostscript installed, and font / pdf-engine of the pdf to be processed\nbecause recent journal publications (like by ACS, member of STIX\nproject[^4] tend to be less frequently affected by this. This\npdf-reprinter is not tested on pdf about documents predominantly written\nin other scripts than Latin.\n\n## Footnotes\n\n[^1]: \u003chttps://tex.stackexchange.com/questions/18987/how-to-make-the-pdfs-produced-by-pdflatex-smaller?rq=1\u003e\n\n[^2]: \u003chttps://unix.stackexchange.com/questions/93959/how-to-convert-a-color-pdf-to-black-white\u003e\n\n[^3]: \u003chttps://openpreservation.org/tools/verapdf/\u003e\n\n[^4]: \u003chttps://en.wikipedia.org/wiki/STIX_Fonts_project\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnbehrnd%2Fpdf_rewriter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnbehrnd%2Fpdf_rewriter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnbehrnd%2Fpdf_rewriter/lists"}