{"id":13688731,"url":"https://github.com/weareprestatech/hotpdf","last_synced_at":"2026-03-03T16:16:06.246Z","repository":{"id":221381042,"uuid":"742346070","full_name":"weareprestatech/hotpdf","owner":"weareprestatech","description":"hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six","archived":false,"fork":false,"pushed_at":"2024-12-15T11:15:09.000Z","size":17346,"stargazers_count":187,"open_issues_count":7,"forks_count":9,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-22T22:51:14.251Z","etag":null,"topics":["pdf","python","text-extraction","text-search"],"latest_commit_sha":null,"homepage":"https://hotpdf.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/weareprestatech.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-12T09:16:14.000Z","updated_at":"2025-04-22T13:26:30.000Z","dependencies_parsed_at":"2024-02-28T11:47:21.319Z","dependency_job_id":"a079259f-572b-4549-a4b9-53c9ca32e6eb","html_url":"https://github.com/weareprestatech/hotpdf","commit_stats":null,"previous_names":["weareprestatech/hotpdf"],"tags_count":25,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weareprestatech%2Fhotpdf","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weareprestatech%2Fhotpdf/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weareprestatech%2Fhotpdf/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/weareprestatech%2Fhotpdf/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/weareprestatech","download_url":"https://codeload.github.com/weareprestatech/hotpdf/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":251940395,"owners_count":21668530,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf","python","text-extraction","text-search"],"created_at":"2024-08-02T15:01:21.384Z","updated_at":"2026-03-03T16:16:06.241Z","avatar_url":"https://github.com/weareprestatech.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# hotpdf\n\n[![Documentation Status](https://readthedocs.org/projects/hotpdf/badge/?version=latest)](https://hotpdf.readthedocs.io/en/latest/?badge=latest)\n[![latest](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/python-publish.yml)\n[![build](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml/badge.svg)](https://github.com/weareprestatech/hotpdf/actions/workflows/build-badge.yml)\n[![Coverage Status](https://coveralls.io/repos/github/weareprestatech/hotpdf/badge.svg?branch=main)](https://coveralls.io/github/weareprestatech/hotpdf?branch=main)\n[![Unit tests](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml/badge.svg?branch=main)](https://github.com/weareprestatech/hotpdf/actions/workflows/test.yml)\n\n\nThis project was started as an internal project @ [Prestatech](http://prestatech.com/) to parse PDF files in a fast and memory-efficient way to overcome the difficulties we were having while parsing big PDF files using libraries such as [pdfquery](https://github.com/jcushman/pdfquery) [[Comparison](https://imgur.com/a/5XuwEqq)].\n\nhotpdf is a wrapper around [pdfminer.six](https://github.com/pdfminer/pdfminer.six) focusing on text extraction and text search operations on PDFs.\n\nhotpdf can be used to find and extract text from PDFs.\nPlease [read the docs](https://hotpdf.readthedocs.io/en/latest/) to understand how the library can help you!\n\n## Installation\n\nThe latest version of hotpdf can be installed directly from [PyPI](https://pypi.org/project/hotpdf/) with pip.\n\n```bash\npip install hotpdf\n```\n\n## Local Setup\n\nFirst, install the dependencies required by hotpdf\n\n```bash\npython3 -m pip install -e .\n```\n\n### Contributing\n\nYou should install the [pre-commit](https://github.com/weareprestatech/hotpdf/blob/main/.pre-commit-config.yaml) hooks with `pre-commit install`. This will run the linter, mypy, and ruff formatting before each commit.\n\nRemember to run `pip install -e '.[dev]'` to install the extra dependencies for development.\n\nFor more examples of how to run the full test suite please refer to the [CI workflow](https://github.com/weareprestatech/hotpdf/blob/main/.github/workflows/test.yml).\n\nWe strive to keep the test coverage at 100% (but can't due to certain reasons - e.g., test file not available): if you want your contributions accepted please write tests for them :D\n\nSome examples of running tests locally:\n\n```bash\npython3 -m pip install -e '.[dev]'               # install extra deps for testing\npython3 -m pytest -n=auto tests/                      # run the test suite\n# run tests with coverage\npython3 -m pytest --cov-fail-under=96 -n=auto --cov=hotpdf --cov-report term-missing\n```\n\n### Documentation\n\nWe use [sphinx](https://www.sphinx-doc.org/en/master/) for generating our docs and host them on [readthedocs](https://readthedocs.org/)\n\nPlease update and add documentation if required, with your contributions.\n\nUpdate the `.rst` files, rebuild them, and commit them along with your PRs.\n\n```bash\ncd docs\nmake clean\nmake html\n```\n\nThis will generate the necessary documentation files. Once merged to `main` the docs will be updated automatically.\n\n## Usage\n\n**To view more detailed usage information, please [read the docs](https://hotpdf.readthedocs.io/en/latest/)**\n\nBasic usage is as follows:\n\n```python\n\nfrom hotpdf import HotPdf\n\npdf_file_path = \"test.pdf\"\n\n# Load pdf file into memory\nhotpdf_document = HotPdf(pdf_file_path)\n\n# Alternatively, you can also pass an opened PDF stream to be loaded\nwith open(pdf_file_path, \"rb\") as f:\n   hotpdf_document_2 = HotPdf(f)\n\n# You can also merge multiple HotPdf objects to get one single HotPdf object\nmerged_hotpdf_object = HotPdf.merge_multiple(hotpdfs=[hotpdf1, hotpdf2])\n\n# Get the number of pages\nprint(len(hotpdf_document.pages))\n\n# Find text\ntext_occurences = hotpdf_document.find_text(\"foo\")\n\n# Find text and its full span\ntext_occurences_full_span = hotpdf_document.find_text(\"foo\", take_span=True)\n\n# Extract text in the region\ntext_in_bbox = hotpdf_document.extract_text(\n   x0=0,\n   y0=0,\n   x1=100,\n   y1=10,\n   page=0,\n)\n\n# Extract spans in the region\nspans_in_bbox = hotpdf_document.extract_spans(\n   x0=0,\n   y0=0,\n   x1=100,\n   y1=10,\n   page=0,\n)\n\n# Extract spans text in the region\nspans_text_in_bbox = hotpdf_document.extract_spans_text(\n   x0=0,\n   y0=0,\n   x1=100,\n   y1=10,\n   page=0,\n)\n\n# Extract full-page text\nfull_page_text = hotpdf_document.extract_page_text(page=0)\n```\n\n## Known Issues\n\n1. (cid:x) characters in text - In some pdfs when extracted, some symbols like `€` might not be properly decoded, and instead be extracted as `(cid:128)`. \n\nThis is a problem with the `pdfminer.six` library. We have fixed it from our side on our [fork](https://github.com/weareprestatech/pdfminer.six), and you can install it using pip. Until we can merge it to pdfminer.six repo and it gets released, we recommend that you install our fork with the fixes manually.\n\n\n\n```bash\npip install --no-cache-dir git+https://github.com/weareprestatech/pdfminer.six.git@20240222#egg=pdfminer-six\n```\n\n## License\n\nThis project is licensed under the terms of the MIT license.\n\n---\nwith ❤️ from the team @ [Prestatech GmbH](https://prestatech.com/)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fweareprestatech%2Fhotpdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fweareprestatech%2Fhotpdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fweareprestatech%2Fhotpdf/lists"}