https://github.com/weareprestatech/hotpdf

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six
https://github.com/weareprestatech/hotpdf

pdf python text-extraction text-search

Last synced: 4 months ago
JSON representation

hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

Host: GitHub
URL: https://github.com/weareprestatech/hotpdf
Owner: weareprestatech
License: mit
Created: 2024-01-12T09:16:14.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-12-15T11:15:09.000Z (over 1 year ago)
Last Synced: 2025-04-22T22:51:14.251Z (about 1 year ago)
Topics: pdf, python, text-extraction, text-search
Language: Python
Homepage: https://hotpdf.readthedocs.io/en/latest/
Size: 16.5 MB
Stars: 187
Watchers: 3
Forks: 9
Open Issues: 7
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # hotpdf

[![Documentation Status](https://readthedocs.org/projects/hotpdf/badge/?version=latest)](https://hotpdf.readthedocs.io/en/latest/?badge=latest)

[![latest](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml)

[![build](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml)

[![Coverage Status](https://coveralls.io/repos/github/weareprestatech/hotpdf/badge.svg?branch=main)](https://coveralls.io/github/weareprestatech/hotpdf?branch=main)

[![Unit tests](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml)

This project was started as an internal project @ [Prestatech](http://prestatech.com/) to parse PDF files in a fast and memory-efficient way to overcome the difficulties we were having while parsing big PDF files using libraries such as [pdfquery](https://github.com/jcushman/pdfquery) [[Comparison](https://imgur.com/a/5XuwEqq)].

hotpdf is a wrapper around [pdfminer.six](https://github.com/pdfminer/pdfminer.six) focusing on text extraction and text search operations on PDFs.

hotpdf can be used to find and extract text from PDFs.

Please [read the docs](https://hotpdf.readthedocs.io/en/latest/) to understand how the library can help you!

## Installation

The latest version of hotpdf can be installed directly from [PyPI](https://pypi.org/project/hotpdf/) with pip.

```bash

pip install hotpdf

```

## Local Setup

First, install the dependencies required by hotpdf

```bash

python3 -m pip install -e .

```

### Contributing

You should install the [pre-commit](https://github.com/weareprestatech/hotpdf/blob/main/.pre-commit-config.yaml) hooks with `pre-commit install`. This will run the linter, mypy, and ruff formatting before each commit.

Remember to run `pip install -e '.[dev]'` to install the extra dependencies for development.

For more examples of how to run the full test suite please refer to the [CI workflow](https://github.com/weareprestatech/hotpdf/blob/main/.github/workflows/test.yml).

We strive to keep the test coverage at 100% (but can't due to certain reasons - e.g., test file not available): if you want your contributions accepted please write tests for them :D

Some examples of running tests locally:

```bash

python3 -m pip install -e '.[dev]'               # install extra deps for testing

python3 -m pytest -n=auto tests/                      # run the test suite

# run tests with coverage

python3 -m pytest --cov-fail-under=96 -n=auto --cov=hotpdf --cov-report term-missing

```

### Documentation

We use [sphinx](https://www.sphinx-doc.org/en/master/) for generating our docs and host them on [readthedocs](https://readthedocs.org/)

Please update and add documentation if required, with your contributions.

Update the `.rst` files, rebuild them, and commit them along with your PRs.

```bash

cd docs

make clean

make html

```

This will generate the necessary documentation files. Once merged to `main` the docs will be updated automatically.

## Usage

**To view more detailed usage information, please [read the docs](https://hotpdf.readthedocs.io/en/latest/)**

Basic usage is as follows:

```python

from hotpdf import HotPdf

pdf_file_path = "test.pdf"

# Load pdf file into memory

hotpdf_document = HotPdf(pdf_file_path)

# Alternatively, you can also pass an opened PDF stream to be loaded

with open(pdf_file_path, "rb") as f:

   hotpdf_document_2 = HotPdf(f)

# You can also merge multiple HotPdf objects to get one single HotPdf object

merged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[hotpdf1, hotpdf2])

# Get the number of pages

print(len(hotpdf_document.pages))

# Find text

text_occurences = hotpdf_document.find_text("foo")

# Find text and its full span

text_occurences_full_span = hotpdf_document.find_text("foo", take_span=True)

# Extract text in the region

text_in_bbox = hotpdf_document.extract_text(

   x0=0,

   y0=0,

   x1=100,

   y1=10,

   page=0,

)

# Extract spans in the region

spans_in_bbox = hotpdf_document.extract_spans(

   x0=0,

   y0=0,

   x1=100,

   y1=10,

   page=0,

)

# Extract spans text in the region

spans_text_in_bbox = hotpdf_document.extract_spans_text(

   x0=0,

   y0=0,

   x1=100,

   y1=10,

   page=0,

)

# Extract full-page text

full_page_text = hotpdf_document.extract_page_text(page=0)

```

## Known Issues

1. (cid:x) characters in text - In some pdfs when extracted, some symbols like `€` might not be properly decoded, and instead be extracted as `(cid:128)`. 

This is a problem with the `pdfminer.six` library. We have fixed it from our side on our [fork](https://github.com/weareprestatech/pdfminer.six), and you can install it using pip. Until we can merge it to pdfminer.six repo and it gets released, we recommend that you install our fork with the fixes manually.

```bash

pip install --no-cache-dir git+https://github.com/weareprestatech/pdfminer.six.git@20240222#egg=pdfminer-six

```

## License

This project is licensed under the terms of the MIT license.

---

with ❤️ from the team @ [Prestatech GmbH](https://prestatech.com/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/weareprestatech/hotpdf

Awesome Lists containing this project

README