https://github.com/lebedov/python-pdfbox
Python interface to Apache PDFBox command-line tools.
https://github.com/lebedov/python-pdfbox
pdf pdfbox python python3
Last synced: about 1 year ago
JSON representation
Python interface to Apache PDFBox command-line tools.
- Host: GitHub
- URL: https://github.com/lebedov/python-pdfbox
- Owner: lebedov
- License: other
- Created: 2017-11-09T04:22:04.000Z (over 8 years ago)
- Default Branch: master
- Last Pushed: 2023-01-24T13:51:20.000Z (over 3 years ago)
- Last Synced: 2025-04-09T19:17:52.716Z (about 1 year ago)
- Topics: pdf, pdfbox, python, python3
- Language: Python
- Size: 106 KB
- Stars: 75
- Watchers: 4
- Forks: 24
- Open Issues: 10
-
Metadata Files:
- Readme: README.rst
- License: LICENSE.rst
Awesome Lists containing this project
README
.. -*- rst -*-
python-pdfbox
=============
Package Description
-------------------
Provides a simple Python 3 interface to the
`Apache PDFBox `_
command-line tools.
.. image:: https://img.shields.io/pypi/v/python-pdfbox.svg
:target: https://pypi.python.org/pypi/python-pdfbox
:alt: Latest Version
Requirements
------------
Aside from Python 3 and those packages specified in
`setup.py `_,
python-pdfbox requires ``java`` to be present in the system path.
Some users have reported `issues on
MacOS `_ with certain
versions of Java. If you encounter such issues, try a recent release of OpenJDK
(14 or later).
Installation
------------
The package may be installed as follows: ::
pip install python-pdfbox
One may specify the location of the PDFBox jar file via the ``PDFBOX``
environmental variable. If not set, python-pdfbox looks for the jar file
in the platform-specific user cache directory and automatically downloads
the latest available version below 3.0.0 and caches it if not present.
Usage
-----
The interface currently exposes only several features in PDFBox (text extraction, conversion to images, extraction
of images): ::
import pdfbox
p = pdfbox.PDFBox()
p.extract_text('/path/to/my_file.pdf') # writes text to /path/to/my_file.txt
p.pdf_to_images('/path/to/my_file.pdf') # writes images to /path/to/my_file1.jpg, /path/to/my_file2.jpg, etc.
p.extract_images('/path/to/my_file.pdf') # writes images to /path/to/my_file-1.png, /path/to/my_file-2.png, etc.
Notes
-----
Owing to a change in command line interface, python-pdfbox cannot
currently use PDFBox 3.0.0.
Development
-----------
The latest release of the package may be obtained from
`GitHub `_.
Author
------
See the included `AUTHORS.rst
`_ file for more
information.
License
-------
This software is licensed under the
`Apache 2.0 License `_.
See the included `LICENSE.rst
`_ file for more
information.