https://github.com/alephdata/pdflib
Binary Python bindings for poppler utils for content extraction
https://github.com/alephdata/pdflib
pdflib poppler python-bindings
Last synced: 29 days ago
JSON representation
Binary Python bindings for poppler utils for content extraction
- Host: GitHub
- URL: https://github.com/alephdata/pdflib
- Owner: alephdata
- Created: 2018-04-11T10:51:29.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2021-05-12T12:23:17.000Z (about 4 years ago)
- Last Synced: 2025-05-07T03:03:38.300Z (29 days ago)
- Topics: pdflib, poppler, python-bindings
- Language: Python
- Homepage:
- Size: 2.33 MB
- Stars: 42
- Watchers: 18
- Forks: 5
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
pdflib
-------[](https://travis-ci.org/alephdata/pdflib)
Python binding for poppler.
## Installation
Using pip: `pip install pdflib`
From source:
- Clone poppler source code and compile it:
```
git clone --branch poppler-0.63.0 --depth 1 https://anongit.freedesktop.org/git/poppler/poppler.git poppler_src
cd poppler_src/
cmake -DENABLE_SPLASH=OFF -DBUILD_GTK_TESTS=OFF -DENABLE_UTILS=OFF -DENABLE_LIBOPENJPEG=none .
make
```- Set `POPPLER_SRC` environment variable
```
export POPPLER_ROOT=/pdflib/poppler_src/
```- Install cython
```
pip install cython
```- Build extension
```
python setup.py build_ext --inplace
```## Usage
```
>>> from pdflib import Document
>>> doc = Document("path/to/file.pdf")
```Getting metadata
```
>>> print(doc.metadata)
>>> print(doc.xmp_metadata)
```Getting text content of each page
```
>>> for page in doc:
print(' \n'.join(page.lines).strip())
```Getting images from each page
```
>>> for page in doc:
page.extract_images(path='images', prefix='img')
```LICENSE
-------
pdflib is available under GPL v3 (poppler is GPL).