{"id":14036921,"url":"https://github.com/JoshData/pdf-redactor","last_synced_at":"2025-07-27T04:32:50.127Z","repository":{"id":43612185,"uuid":"70952670","full_name":"JoshData/pdf-redactor","owner":"JoshData","description":"A general purpose PDF text-layer redaction tool for Python 2/3.","archived":false,"fork":false,"pushed_at":"2024-06-13T06:03:12.000Z","size":149,"stargazers_count":183,"open_issues_count":22,"forks_count":61,"subscribers_count":7,"default_branch":"primary","last_synced_at":"2024-11-13T02:03:12.143Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc0-1.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JoshData.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-14T22:54:41.000Z","updated_at":"2024-11-11T05:51:27.000Z","dependencies_parsed_at":"2024-10-30T05:24:15.285Z","dependency_job_id":null,"html_url":"https://github.com/JoshData/pdf-redactor","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshData%2Fpdf-redactor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshData%2Fpdf-redactor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshData%2Fpdf-redactor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JoshData%2Fpdf-redactor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JoshData","download_url":"https://codeload.github.com/JoshData/pdf-redactor/tar.gz/refs/heads/primary","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227762323,"owners_count":17816010,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-12T03:02:19.516Z","updated_at":"2024-12-02T16:31:08.092Z","avatar_url":"https://github.com/JoshData.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"pdf-redactor\n============\n\nA general-purpose PDF text-layer redaction tool, in pure Python, by Joshua Tauberer and Antoine McGrath.\n\npdf-redactor uses [pdfrw](https://github.com/pmaupin/pdfrw) under the hood to parse and write out the PDF.\n\n* * *\n\nThis Python module is a general tool to help you automatically redact text from PDFs. The tool operates on:\n\n* the text layer of the document's pages (content stream text)\n* plain text annotations\n* link target URLs\n* the Document Information Dictionary, a.k.a. the PDF metadata like Title and Author\n* embedded XMP metadata, if present\n\nGraphical elements, images, and other embedded resources are not touched.\n\nYou can:\n\n* Use regular expressions to perform text substitution on the text layer (e.g. replace social security numbers with \"XXX-XX-XXXX\").\n* Rewrite, remove, or add new metadata fields on a field-by-field basis (e.g. wipe out all metadata except for certain fields).\n* Rewrite, remove, or add XML metadata using functions that operate on the parsed XMP DOM (e.g. wipe out XMP metadata).\n\n## How to use pdf-redactor\n\nGet this module and then install its dependencies with:\n\n\tpip3 install -r requirements.txt\n\n`pdf_redactor.py` processes a PDF given on standard input and writes a new, redacted PDF to standard output:\n\n\tpython3 pdf_redactor.py \u003c document.pdf \u003e document-redacted.pdf\n\nHowever, you should use the `pdf_redactor` module as a library and pass in text filtering functions written in Python, since the command-line version of the tool does not yet actually do anything to the PDF. The [example.py](example.py) script shows how to redact Social Security Numbers:\n\n\tpython3 example.py \u003c tests/test-ssns.pdf \u003e document-redacted.pdf\n\n## Limitations\n\n### Not all content may be redacted\n\nThe PDF format is an incredibly complex data standard that has hundreds, if not thousands,\nof exotic capabilities used rarely or in specialized circumstances. Besides a document's text layer, metadata, and other components of a PDF document which this tool scans and can redact text from, there are many other components of PDF documents that this tool **does not look at**, such as:\n\n* embedded files, multimedia, and scripts\n* rich text annotations\n* forms\n* internal object names\n* digital signatures\n\nThere are so many exotic capabilities in PDF documents that it would be difficult to list them all, so this list is a very partial list. It would take a lot more effort to write a redaction tool that scanned all possible places content can be hidden inside a PDF besides the places that this tool looks at, so please be aware that it is **your responsibility** to ensure that the PDFs you use this tool on only use the capabilities of the PDF format that this tool knows how to redact.\n\n### Character replacement\n\nOne of the PDF format's strengths is that it embeds font information so that documents can be displayed even if the fonts used to create the PDF aren't available when the PDF is viewed. Most PDFs are optimized to only embed the font information for characters that are actually used in the document. So if a document doesn't contain a particular letter or symbol, information for rendering the letter or symbol is not stored in the PDF.\n\nThis has an unfortunate consequence for redaction in the text layer. Since redaction in the text layer works by performing simple text substitution in the text stream, you may create replacement text that contains characters that were _not_ previously in the PDF. Those characters simply won't show up when the PDF is viewed because the PDF didn't contain any information about how to display them.\n\nTo get around this problem, pdf_redactor checks your replacement text for new characters and replaces them with characters from the `content_replacement_glyphs` list (defaulting to `?`, `#`, `*`, and a space) if any of those characters _are_ present in the font information already stored in the PDF. Hopefully at least one of those characters _is_ present (maybe none are!), and in that case your replacement text will at least show up as something and not disappear.\n\n### Content stream compression\n\nBecause pdfrw doesn't support all content stream compression methods, you should use a tool like [qpdf](http://qpdf.sourceforge.net/) to decompress the PDF prior to using this tool, and then to re-compress and web-optimize (linearize) the PDF after. The full command would be something like:\n\n\tqpdf --stream-data=uncompress document.pdf - \\\n\t | python3 pdf_redactor.py \u003e /tmp/temp.pdf\n\t \u0026\u0026 qpdf --linearize /tmp/temp.pdf document-redacted.pdf\n\n(qpdf's first argument can't be standard input, unfortunately, so a one-liner isn't possible.)\n\n### Exotic fonts\n\nThis tool has a limited understanding of glyph-to-Unicode codepoint mappings. Some unusual fonts may not be processed correctly, in which case text layer redaction regular expressions may not match or substitution text may not render correctly.\n\n## Testing that it worked\n\nIf you're redacting metadata, you should check the output using `pdfinfo` from the `poppler-utils` package:\n\n\t# check that the metadata is fully redacted\n\tpdfinfo -meta document-redacted.pdf\n\n## Developing/testing the library\n\nTests require some additional packages:\n\n\tpip install -r requirements-dev.txt\n\tpython tests/run_tests.py\n\nThe file `tests/test-ssns.pdf` was generating by converting the file `tests/test-ssns.odft` to PDF in LibreOffice with the `Archive PDF/A-1a` option turned on so that it generates XMP metadata and `Export comments` turned on to export the comment.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJoshData%2Fpdf-redactor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FJoshData%2Fpdf-redactor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FJoshData%2Fpdf-redactor/lists"}