https://github.com/anotherbyte-net/leaf-focus

Extract structured text from pdf files.
https://github.com/anotherbyte-net/leaf-focus

data-science machine-learning parser pdf utility

Last synced: 6 months ago
JSON representation

Extract structured text from pdf files.

Host: GitHub
URL: https://github.com/anotherbyte-net/leaf-focus
Owner: anotherbyte-net
License: apache-2.0
Created: 2022-08-28T09:03:20.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2025-01-20T00:46:12.000Z (over 1 year ago)
Last Synced: 2025-09-28T10:19:34.591Z (10 months ago)
Topics: data-science, machine-learning, parser, pdf, utility
Language: Python
Homepage: https://anotherbyte-net.github.io/leaf-focus/
Size: 2.59 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 6
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

          # leaf-focus

Extract structured text from pdf files.

## Install

Install from PyPI using pip:

```bash

pip install leaf-focus

```

[![PyPI](https://img.shields.io/pypi/v/leaf-focus)](https://pypi.org/project/leaf-focus/)

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/leaf-focus)

[![GitHub Workflow Status (branch)](https://img.shields.io/github/actions/workflow/status/anotherbyte-net/leaf-focus/test-package.yml?branch=main)](https://github.com/anotherbyte-net/leaf-focus/actions)

Download the [Xpdf command line tools](https://www.xpdfreader.com/download.html) and extract the executable files.

Provide the directory containing the executable files as `--exe-dir`.

## Usage

```text

usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]

                  [--first FIRST] [--last LAST]

                  [--log-level {debug,info,warning,error,critical}]

                  input_pdf output_dir

Extract structured text from a pdf file.

positional arguments:

  input_pdf             path to the pdf file to read

  output_dir            path to the directory to save the extracted text files

optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

  --exe-dir EXE_DIR     path to the directory containing xpdf executable files

  --page-images         save each page of the pdf as a separate image

  --ocr                 run optical character recognition on each page of the

                        pdf

  --first FIRST         the first pdf page to process

  --last LAST           the last pdf page to process

  --log-level {debug,info,warning,error,critical}

                        the log level: debug, info, warning, error, critical

```

### Examples

```bash

# Extract the pdf information and embedded text.

leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages

# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.

leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr

```

## Dependencies

- [xpdf](https://www.xpdfreader.com/download.html)

- [keras-ocr](https://github.com/faustomorales/keras-ocr)

- [Tensorflow](https://www.tensorflow.org) (can optionally be run more efficiently [using one or more GPUs](https://www.tensorflow.org/install/pip#hardware_requirements))

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/anotherbyte-net/leaf-focus

Awesome Lists containing this project

README