https://github.com/dipietrantonio/pdf4py
A PDF parser written in Python 3 with no external dependencies.
https://github.com/dipietrantonio/pdf4py
information-extraction parser pdf pdf-parsing python
Last synced: 5 months ago
JSON representation
A PDF parser written in Python 3 with no external dependencies.
- Host: GitHub
- URL: https://github.com/dipietrantonio/pdf4py
- Owner: dipietrantonio
- License: mit
- Created: 2019-10-03T12:53:28.000Z (over 6 years ago)
- Default Branch: master
- Last Pushed: 2020-05-28T15:13:09.000Z (about 6 years ago)
- Last Synced: 2026-01-13T16:41:31.178Z (5 months ago)
- Topics: information-extraction, parser, pdf, pdf-parsing, python
- Language: Python
- Homepage: https://pdf4py.readthedocs.io/en/latest/
- Size: 11.3 MB
- Stars: 58
- Watchers: 5
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf4py
[](https://travis-ci.org/Halolegend94/pdf4py) [](https://pdf4py.readthedocs.io/en/latest/?badge=latest) [](https://badge.fury.io/py/pdf4py) 
A PDF parser written in Python 3 with no external dependencies.
The package `pdf4py` allows the user to analyze a PDF file at a very low level and in a very
flexible way by giving access to its atomic components, the PDF objects. All through a very
simple API that can be used to build higher level functionalities (e.g. text and/or image
extraction). In particular, it defines the class `Parser` that reads the *Cross Reference Table*
of a PDF document and uses its entries to give the user the ability to locate PDF objects within
the file and parse them into suitable Python objects.
**DISCLAIMER**: this package hasn't reached a stable version (>= 1.0.0) yet. Although the parser
API is quite simple it may change suddenly from one release to the next one. All breaking changes
will be properly notified in the release notes.
## Quick example
Here is a quick demonstration on how to use pdf4py. You can find more at the [tutorials page](https://pdf4py.readthedocs.io/en/latest/tutorials.html).
```python
>>> from pdf4py.parser import Parser
>>> fp = open('tests/pdfs/0000.pdf', 'rb')
>>> parser = Parser(fp)
>>> info_ref = parser.trailer['Info']
>>> print(info_ref)
PDFReference(object_number=114, generation_number=0)
>>> info = parser.parse_reference(info_ref)
>>> print(info)
{'Creator': PDFLiteralString(value=b'PaperCept Conference Management System'),
... , 'Producer': PDFLiteralString(value=b'PDFlib+PDI 7.0.3 (Perl 5.8.0/Linux)')}
>>> creator = info['Creator'].value.decode('utf8')
>>> print(creator)
PaperCept Conference Management System
```
## Installation and updates
You can install `pdf4py` using pip:
```
python3 -m pip install pdf4py
```
or download one of the releases and use the `setup.py` script.
The `master` branch is used for development and it is not advised to use it in production.
For this package the semantic versioning (specification 2.0.0) is adopted.
## Extracting text or images
Extracting text from a PDF and other higher level analysis tasks are not natively supported as of now
because of two reasons:
- their complexity is not trivial and would require a not indifferent amount of work which now I prefer
investing into developing a complete and reliable parser;
- they are conceptually different tasks from PDF parsing, since the PDF does not define the concept of
document as a sequence of paragraphs, images, and other objects that can be normally considered *content*.
Therefore, they require a separate implementation built on top of `pdf4py`. In don't exclude that in
future these functionalities will be made available as modules in this package, but I am not planning
to do it anytime soon.
## Why this package
One day at work I was asked to analyze some PDF files. To my surprise I had discovered that
there was not an established Python module to easily parse a PDF document. In order to understand
why I delved into the PDF 1.7 specification: since that moment I've got interested more and more
in the inner workings of one of the most important and ubiquitous file format. And what's
a better way to understand the PDF than writing a parser for it?
## Documentation
You can read the documentation on [readthedocs.io](https://pdf4py.readthedocs.io/en/latest/).
## Contributing
Contributions are more than welcome! Please, when writing code or documentation for this package remind:
- to use the [numpy docstring conventions](https://numpydoc.readthedocs.io/en/latest/format.html) for documenting code.
- to follow the [Python guideline (PEP 8)](https://www.python.org/dev/peps/pep-0008/) when writing code.
- `pdf4py` is designed to be readable and easy to work with. I prefer readability over (not so significant)
performance improvements.
- `pdf4py` is designed to be modular, flexible but also easy to use. It shouldn't be complicated for the user
to perform one particular task.
- to adopt as much as possible a test-driven development process. Each contribution must be accompanied by a
test addition/modification.
If you are wondering in which way you can help, check the [TODO list](https://github.com/Halolegend94/pdf4py/blob/master/TODO.md). For now it will do as a simple "road map".
If you have found a bug, please file a new issue here on GitHub. Proposing fixes, changes and additions can
be done through a pull request.