An open API service indexing awesome lists of open source software.

https://github.com/lebedov/python-pdfbox

Python interface to Apache PDFBox command-line tools.
https://github.com/lebedov/python-pdfbox

pdf pdfbox python python3

Last synced: about 1 year ago
JSON representation

Python interface to Apache PDFBox command-line tools.

Awesome Lists containing this project

README

          

.. -*- rst -*-

python-pdfbox
=============

Package Description
-------------------
Provides a simple Python 3 interface to the
`Apache PDFBox `_
command-line tools.

.. image:: https://img.shields.io/pypi/v/python-pdfbox.svg
:target: https://pypi.python.org/pypi/python-pdfbox
:alt: Latest Version

Requirements
------------
Aside from Python 3 and those packages specified in
`setup.py `_,
python-pdfbox requires ``java`` to be present in the system path.

Some users have reported `issues on
MacOS `_ with certain
versions of Java. If you encounter such issues, try a recent release of OpenJDK
(14 or later).

Installation
------------
The package may be installed as follows: ::

pip install python-pdfbox

One may specify the location of the PDFBox jar file via the ``PDFBOX``
environmental variable. If not set, python-pdfbox looks for the jar file
in the platform-specific user cache directory and automatically downloads
the latest available version below 3.0.0 and caches it if not present.

Usage
-----
The interface currently exposes only several features in PDFBox (text extraction, conversion to images, extraction
of images): ::

import pdfbox
p = pdfbox.PDFBox()
p.extract_text('/path/to/my_file.pdf') # writes text to /path/to/my_file.txt
p.pdf_to_images('/path/to/my_file.pdf') # writes images to /path/to/my_file1.jpg, /path/to/my_file2.jpg, etc.
p.extract_images('/path/to/my_file.pdf') # writes images to /path/to/my_file-1.png, /path/to/my_file-2.png, etc.

Notes
-----
Owing to a change in command line interface, python-pdfbox cannot
currently use PDFBox 3.0.0.

Development
-----------
The latest release of the package may be obtained from
`GitHub `_.

Author
------
See the included `AUTHORS.rst
`_ file for more
information.

License
-------
This software is licensed under the
`Apache 2.0 License `_.
See the included `LICENSE.rst
`_ file for more
information.