https://github.com/anotherbyte-net/leaf-focus
Extract structured text from pdf files.
https://github.com/anotherbyte-net/leaf-focus
data-science machine-learning parser pdf utility
Last synced: 5 months ago
JSON representation
Extract structured text from pdf files.
- Host: GitHub
- URL: https://github.com/anotherbyte-net/leaf-focus
- Owner: anotherbyte-net
- License: apache-2.0
- Created: 2022-08-28T09:03:20.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2025-01-20T00:46:12.000Z (over 1 year ago)
- Last Synced: 2025-09-28T10:19:34.591Z (9 months ago)
- Topics: data-science, machine-learning, parser, pdf, utility
- Language: Python
- Homepage: https://anotherbyte-net.github.io/leaf-focus/
- Size: 2.59 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS
Awesome Lists containing this project
README
# leaf-focus
Extract structured text from pdf files.
## Install
Install from PyPI using pip:
```bash
pip install leaf-focus
```
[](https://pypi.org/project/leaf-focus/)

[](https://github.com/anotherbyte-net/leaf-focus/actions)
Download the [Xpdf command line tools](https://www.xpdfreader.com/download.html) and extract the executable files.
Provide the directory containing the executable files as `--exe-dir`.
## Usage
```text
usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
[--first FIRST] [--last LAST]
[--log-level {debug,info,warning,error,critical}]
input_pdf output_dir
Extract structured text from a pdf file.
positional arguments:
input_pdf path to the pdf file to read
output_dir path to the directory to save the extracted text files
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
--exe-dir EXE_DIR path to the directory containing xpdf executable files
--page-images save each page of the pdf as a separate image
--ocr run optical character recognition on each page of the
pdf
--first FIRST the first pdf page to process
--last LAST the last pdf page to process
--log-level {debug,info,warning,error,critical}
the log level: debug, info, warning, error, critical
```
### Examples
```bash
# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages
# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr
```
## Dependencies
- [xpdf](https://www.xpdfreader.com/download.html)
- [keras-ocr](https://github.com/faustomorales/keras-ocr)
- [Tensorflow](https://www.tensorflow.org) (can optionally be run more efficiently [using one or more GPUs](https://www.tensorflow.org/install/pip#hardware_requirements))