{"id":13648144,"url":"https://github.com/cascremers/pdfdiff","last_synced_at":"2026-01-18T00:45:59.962Z","repository":{"id":46450248,"uuid":"9791548","full_name":"cascremers/pdfdiff","owner":"cascremers","description":"Command-line tool to inspect the difference between (the text in) two PDF files","archived":false,"fork":false,"pushed_at":"2022-03-31T11:18:25.000Z","size":10,"stargazers_count":223,"open_issues_count":3,"forks_count":17,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-09T22:37:53.470Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cascremers.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2013-05-01T13:04:34.000Z","updated_at":"2024-10-05T19:08:35.000Z","dependencies_parsed_at":"2022-08-12T12:50:55.627Z","dependency_job_id":null,"html_url":"https://github.com/cascremers/pdfdiff","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cascremers%2Fpdfdiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cascremers%2Fpdfdiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cascremers%2Fpdfdiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cascremers%2Fpdfdiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cascremers","download_url":"https://codeload.github.com/cascremers/pdfdiff/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250195033,"owners_count":21390230,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T01:04:00.163Z","updated_at":"2026-01-18T00:45:59.950Z","avatar_url":"https://github.com/cascremers.png","language":"Python","readme":"pdfdiff\n=======\n\nCommand-line tool to inspect the difference between (the text in) two PDF files.\n\n\nPurpose and function\n--------------------\n\n`pdfdiff` takes two arguments, each being the filename of a PDF file,\nand generates a textual diff between the two. It visualises this diff\nusing the first diff-viewer it finds on the system.\n\n`pdfdiff` relies on `pdftotext` to extract the plaintext from a PDF\nfile.  However, small changes in the text between two PDF files can make\na huge difference in the resulting extracted text. More often than not,\nthe difference is so large that doing a `diff` on the output does not\nyield a sensible result.\n\nThe main function of this program, `pdfdiff`, is to normalize the output\nof `pdftotext`, such that the result is suitable for diff viewing. To\nachieve this, it attempts to detect sentence endings to reformat\nparagraphs and lines.  Along the way, it removes some ligature encodings\nto give `diff` viewers an easier time. After this normalisation\nprocedure, `diff` viewers commonly yield a substantially better\ncomparison between the contents of the files.\n\nNote that if a single file is provided as input, `pdfdiff` will directly\noutput the normalised text, enabling its use as a preprocessor for other\ntools.\n\n\nRunning `pdfdiff.py`\n--------------------\n\nAfter downloading, either run it through python or make it executable\n(chmod +x pdfdiff.py) to use it directly from the commandline.\n\n\nRequirements\n------------\n\n\n- `pdftotext`, which is part of the `xpdf` package.\n\n- `Python`\n\n- A diff viewer, preferably one that supports unicode, like `kdiff3` or\n  `meld`. If these don't work for you, you can use `xxdiff`, `tkdiff`,\n  `opendiff`, `vimdiff`, or even good old `diff`.  You only need one of\n  these to use `pdfdiff.py`.\n\nNote that for most Linux distributions, installing `xpdf` is\nusually sufficient to get it working. (Afterwards one might\nwant to upgrade to a better diff viewer though).\n\n\nCaveats\n-------\n\n- `pdfdiff` ignores many elements of PDF files, such as figures. As a\n  result, if the (textual) difference between two files is empty, there\n  is no guarantee that the PDF files are identical.\n\n- Some PDF files do not contain embedded text. In this case `pdftotext`\n  will not work correctly, and will return empty diffs. In this case you\n  would need to resort to OCR (Optical Character Recognition) to extract\n  the text. This is outside the scope of this program.\n\n\nUse cases\n---------\n\n- (Scientific) Reviews: you reviewed version A of a paper, and receive\n  version B, and wonder what the changes are.\n\n\nLicense\n-------\n\nCurrently the `pdfdiff` sources are licensed under the GPL 2, as indicated\nin the source code. Contact Cas Cremers if you have any questions.\n\n","funding_links":[],"categories":["Python","Cross-Platform"],"sub_categories":["JavaScript"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcascremers%2Fpdfdiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcascremers%2Fpdfdiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcascremers%2Fpdfdiff/lists"}