https://github.com/py-pdf/pdf
A modern pure-Python library for reading PDF files
https://github.com/py-pdf/pdf
Last synced: 9 months ago
JSON representation
A modern pure-Python library for reading PDF files
- Host: GitHub
- URL: https://github.com/py-pdf/pdf
- Owner: py-pdf
- License: bsd-3-clause
- Created: 2022-04-04T19:26:24.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-04-05T19:29:48.000Z (almost 4 years ago)
- Last Synced: 2025-04-10T18:10:40.350Z (9 months ago)
- Language: Python
- Size: 4.74 MB
- Stars: 11
- Watchers: 3
- Forks: 3
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
[](https://badge.fury.io/py/pdffile)
[](https://github.com/py-pdf/pdf)
[](https://github.com/py-pdf/pdf/actions)
[](https://github.com/psf/black)
# pdf
A modern pure-Python library for reading PDF files.
The goal is to have a modern interface to handle PDF files which is consistent
with itself and typical Python syntax.
The library should be Python-only (hence no C-extensions), but allow to change
the backend. Similar in concept to [matplotlib backends](https://matplotlib.org/2.0.2/faq/usage_faq.html#what-is-a-backend) and [Keras backends](https://faroit.com/keras-docs/1.2.0/backend/).
The default backend could be PyPDF2.
Possible other backends could be [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/)
(using [MuPDF](https://mupdf.com/))
and [PikePDF](https://github.com/pikepdf/pikepdf) (using [QPDF](https://github.com/qpdf/qpdf)).
> **WARNING**: This library is UNSTABLE at the moment! Expect many changes!
## Installation
```bash
pip install pdffile
```
## Usage
### Retrieve Metadata
```pycon
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> len(doc)
1
>>> doc.metadata
Metadata(
title=None,
producer='pdfTeX-1.40.23',
creator='TeX',
creation_date=datetime.datetime(2022, 4, 3, 18, 5, 42),
modification_date=datetime.datetime(2022, 4, 3, 18, 5, 42)
other={
'/CreationDate': "D:20220403180542+02'00'",
'/ModDate': "D:20220403180542+02'00'",
'/Trapped': '/False',
'/PTEX.Fullbanner': 'This is pdfTeX, V...'})
```
### Encrypted PDFs
If you have an encrypted PDF, just provide the key:
```python
doc = pdf.PdfFile(pdf_path, password=password)
```
All following operations work just as described.
## Get Outline
```pycon
>>> import pdf
>>> doc = pdf.PdfFile(pdf_path, password=password)
>>> doc.outline
[
Links(page=5, text='1 Header'),
Links(page=5, text='1.1 A section'),
Links(page=9, text='2 Foobar'),
Links(page=108, text='References')
]
```
### Extract Text
```pycon
>>> import pdf
>>> doc = pdf.PdfFile("001-trivial/minimal-document.pdf")
>>> doc[0]
>>> doc[0].text
'Loremipsumdolorsitamet,consetetursadipscingelitr,seddiamnonumyeirmod\ntemporinviduntutlaboreetdoloremagnaaliquyamerat,seddiamvoluptua.Atvero\neosetaccusametjustoduodoloresetearebum.Stetclitakasdgubergren,noseataki-\nmatasanctusestLoremipsumdolorsitamet.Loremipsumdolorsitamet,consetetur\nsadipscingelitr,seddiamnonumyeirmodtemporinviduntutlaboreetdoloremagna\naliquyamerat,seddiamvoluptua.Atveroeosetaccusametjustoduodoloresetea\nrebum.Stetclitakasdgubergren,noseatakimatasanctusestLoremipsumdolorsit\namet.\n1\n'
```
Alternatively, you can use `doc.text` to get the text of all pages.