{"id":38914709,"url":"https://github.com/dipietrantonio/pdf4py","last_synced_at":"2026-01-17T15:25:00.562Z","repository":{"id":57451401,"uuid":"212579595","full_name":"dipietrantonio/pdf4py","owner":"dipietrantonio","description":"A PDF parser written in Python 3 with no external dependencies.","archived":false,"fork":false,"pushed_at":"2020-05-28T15:13:09.000Z","size":11815,"stargazers_count":58,"open_issues_count":0,"forks_count":3,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-01-13T16:41:31.178Z","etag":null,"topics":["information-extraction","parser","pdf","pdf-parsing","python"],"latest_commit_sha":null,"homepage":"https://pdf4py.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dipietrantonio.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-10-03T12:53:28.000Z","updated_at":"2025-11-03T19:49:24.000Z","dependencies_parsed_at":"2022-09-04T10:40:11.750Z","dependency_job_id":null,"html_url":"https://github.com/dipietrantonio/pdf4py","commit_stats":null,"previous_names":["halolegend94/pdf4py"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/dipietrantonio/pdf4py","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dipietrantonio%2Fpdf4py","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dipietrantonio%2Fpdf4py/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dipietrantonio%2Fpdf4py/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dipietrantonio%2Fpdf4py/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dipietrantonio","download_url":"https://codeload.github.com/dipietrantonio/pdf4py/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dipietrantonio%2Fpdf4py/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28511420,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-17T13:38:16.342Z","status":"ssl_error","status_checked_at":"2026-01-17T13:37:44.060Z","response_time":85,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["information-extraction","parser","pdf","pdf-parsing","python"],"created_at":"2026-01-17T15:25:00.487Z","updated_at":"2026-01-17T15:25:00.554Z","avatar_url":"https://github.com/dipietrantonio.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf4py\n\n[![Build Status](https://travis-ci.org/Halolegend94/pdf4py.svg?branch=master)](https://travis-ci.org/Halolegend94/pdf4py) [![Documentation Status](https://readthedocs.org/projects/pdf4py/badge/?version=latest)](https://pdf4py.readthedocs.io/en/latest/?badge=latest) [![PyPI version](https://badge.fury.io/py/pdf4py.svg)](https://badge.fury.io/py/pdf4py) ![PyPI - Downloads](https://img.shields.io/pypi/dm/pdf4py?color=brightgreen)\n\nA PDF parser written in Python 3 with no external dependencies.\n\nThe package `pdf4py` allows the user to analyze a PDF file at a very low level and in a very\nflexible way by giving access to its atomic components, the PDF objects. All through a very\nsimple API that can be used to build higher level functionalities (e.g. text and/or image\nextraction). In particular, it defines the class `Parser` that reads the *Cross Reference Table*\nof a PDF document and uses its entries to give the user the ability to locate PDF objects within\nthe file and parse them into suitable Python objects.\n\n**DISCLAIMER**: this package hasn't reached a stable version (\u003e= 1.0.0) yet. Although the parser\nAPI is quite simple it may change suddenly from one release to the next one. All breaking changes\nwill be properly notified in the release notes.\n\n\n## Quick example\n\nHere is a quick demonstration on how to use pdf4py. You can find more at the [tutorials page](https://pdf4py.readthedocs.io/en/latest/tutorials.html).\n\n```python\n\u003e\u003e\u003e from pdf4py.parser import Parser\n\u003e\u003e\u003e fp = open('tests/pdfs/0000.pdf', 'rb')\n\u003e\u003e\u003e parser = Parser(fp)\n\u003e\u003e\u003e info_ref = parser.trailer['Info']\n\u003e\u003e\u003e print(info_ref)\nPDFReference(object_number=114, generation_number=0)\n\u003e\u003e\u003e info = parser.parse_reference(info_ref)\n\u003e\u003e\u003e print(info)\n{'Creator': PDFLiteralString(value=b'PaperCept Conference Management System'),\n    ... , 'Producer': PDFLiteralString(value=b'PDFlib+PDI 7.0.3 (Perl 5.8.0/Linux)')}\n\u003e\u003e\u003e creator = info['Creator'].value.decode('utf8')\n\u003e\u003e\u003e print(creator)\nPaperCept Conference Management System\n```\n\n## Installation and updates\n\nYou can install `pdf4py` using pip:\n\n```\npython3 -m pip install pdf4py\n```\n\nor download one of the releases and use the `setup.py` script.\n\nThe `master` branch is used for development and it is not advised to use it in production.\n\nFor this package the semantic versioning (specification 2.0.0) is adopted.\n\n## Extracting text or images\n\nExtracting text from a PDF and other higher level analysis tasks are not natively supported as of now \nbecause of two reasons:\n\n- their complexity is not trivial and would require a not indifferent amount of work which now I prefer\ninvesting into developing a complete and reliable parser;\n- they are conceptually different tasks from PDF parsing, since the PDF does not define the concept of\ndocument as a sequence of paragraphs, images, and other objects that can be normally considered *content*.\n\nTherefore, they require a separate implementation built on top of `pdf4py`. In don't exclude that in\nfuture these functionalities will be made available as modules in this package, but I am not planning\nto do it anytime soon.\n\n\n## Why this package\n\nOne day at work I was asked to analyze some PDF files. To my surprise I had discovered that\nthere was not an established Python module to easily parse a PDF document. In order to understand\nwhy I delved into the PDF 1.7 specification: since that moment I've got interested more and more\nin the inner workings of one of the most important and ubiquitous file format. And what's\na better way to understand the PDF than writing a parser for it?\n\n\n## Documentation\n\nYou can read the documentation on [readthedocs.io](https://pdf4py.readthedocs.io/en/latest/).\n\n\n## Contributing\n\nContributions are more than welcome! Please, when writing code or documentation for this package remind:\n\n- to use the [numpy docstring conventions](https://numpydoc.readthedocs.io/en/latest/format.html) for documenting code.\n- to follow the [Python guideline (PEP 8)](https://www.python.org/dev/peps/pep-0008/) when writing code.\n- `pdf4py` is designed to be readable and easy to work with. I prefer readability over (not so significant)\n  performance improvements.\n- `pdf4py` is designed to be modular, flexible but also easy to use. It shouldn't be complicated for the user\n  to perform one particular task.\n- to adopt as much as possible a test-driven development process. Each contribution must be accompanied by a \n  test addition/modification.\n\nIf you are wondering in which way you can help, check the [TODO list](https://github.com/Halolegend94/pdf4py/blob/master/TODO.md). For now it will do as a simple \"road map\".  \n\nIf you have found a bug, please file a new issue here on GitHub. Proposing fixes, changes and additions can\nbe done through a pull request.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdipietrantonio%2Fpdf4py","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdipietrantonio%2Fpdf4py","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdipietrantonio%2Fpdf4py/lists"}