Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pymupdf/pymupdf
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://github.com/pymupdf/pymupdf
data-science epub extract-data font mupdf ocr pdf pdf-documents pymupdf python table-extraction tesseract text-processing text-shaping xps
Last synced: 11 days ago
JSON representation
PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
- Host: GitHub
- URL: https://github.com/pymupdf/pymupdf
- Owner: pymupdf
- License: agpl-3.0
- Created: 2012-10-06T18:54:25.000Z (about 12 years ago)
- Default Branch: main
- Last Pushed: 2024-08-30T11:04:39.000Z (2 months ago)
- Last Synced: 2024-08-31T09:58:30.577Z (2 months ago)
- Topics: data-science, epub, extract-data, font, mupdf, ocr, pdf, pdf-documents, pymupdf, python, table-extraction, tesseract, text-processing, text-shaping, xps
- Language: Python
- Homepage: https://pymupdf.readthedocs.io
- Size: 296 MB
- Stars: 4,980
- Watchers: 60
- Forks: 480
- Open Issues: 42
-
Metadata Files:
- Readme: README.md
- Changelog: changes.txt
- License: COPYING
- Support: docs/supported-files-table.rst
Awesome Lists containing this project
README
# PyMuPDF
**PyMuPDF** is a high performance **Python** library for data extraction, analysis, conversion & manipulation of [PDF (and other) documents](https://pymupdf.readthedocs.io/en/latest/the-basics.html#supported-file-types).
# Community
Join us on **Discord** here: [#pymupdf](https://discord.gg/TSpYGBW4eq)# Installation
**PyMuPDF** requires **Python 3.9 or later**, install using **pip** with:
`pip install PyMuPDF`
There are **no mandatory** external dependencies. However, some [optional features](#pymupdf-optional-features) become available only if additional packages are installed.
You can also try without installing by visiting [PyMuPDF.io](https://pymupdf.io/#examples).
# Usage
Basic usage is as follows:
```python
import pymupdf # imports the pymupdf library
doc = pymupdf.open("example.pdf") # open a document
for page in doc: # iterate the document pages
text = page.get_text() # get plain text encoded as UTF-8```
# Documentation
Full documentation can be found on [pymupdf.readthedocs.io](https://pymupdf.readthedocs.io).
* [fontTools](https://pypi.org/project/fonttools/) for creating font subsets.
* [pymupdf-fonts](https://pypi.org/project/pymupdf-fonts/) contains some nice fonts for your text output.
* [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition in images and document pages.# About
**PyMuPDF** adds **Python** bindings and abstractions to [MuPDF](https://mupdf.com/), a lightweight **PDF**, **XPS**, and **eBook** viewer, renderer, and toolkit. Both **PyMuPDF** and **MuPDF** are maintained and developed by [Artifex Software, Inc](https://artifex.com).
**PyMuPDF** was originally written by [Jorj X. McKie](mailto:[email protected]).
# License and Copyright
**PyMuPDF** is available under [open-source AGPL](https://www.gnu.org/licenses/agpl-3.0.html) and commercial license agreements. If you determine you cannot meet the requirements of the **AGPL**, please contact [Artifex](https://artifex.com/contact/pymupdf-inquiry.php) for more information regarding a commercial license.