An open API service indexing awesome lists of open source software.

https://github.com/bladeacer/pdf-fmt

A PDF extractor, processor and formatter. Supports regex based exclusions and other niceties.
https://github.com/bladeacer/pdf-fmt

pdf pdf-image-extractor pdf-table-extraction pdf-text-extraction python text-formatting

Last synced: about 1 month ago
JSON representation

A PDF extractor, processor and formatter. Supports regex based exclusions and other niceties.

Awesome Lists containing this project

README

          











# pdf-fmt

A PDF Text Extractor, Processor, and Formatter.

`pdf-fmt` is a powerful utility designed to extract text from PDF
documents and then clean, filter, and structure the output.

It is useful for converting raw PDF dumps into clean, formatted text.

Note that `pdf-fmt` is **under active development**, you might encounter bugs
and issues.

### Project Status

`pdf-fmt` is currently undergoing a major rewrite. Stay tuned.
> The script installer in the main branch will not work, use the compiled binary
> under the releases page.

### Features

* Raw text extraction
* Copy to clipboard and/or write to file
* Extensive configuration schema
* See [configuration](#configuration)
* Supports numerous formats
* See [handling non-PDF formats](#handling-non-pdf-formats)
* Image extraction
* PNG, WEBP, SVG, etc. supported
* Table extraction
* Experimental, will add a configuration file entry to configure behaviour
* and many others to come...

### Why I made this

There are plenty of PDF tooling out there, but they seems to be geared towards
OCR and generally do not help with extracting and processing the output text.

Personally, I use it to collate lecture slides for note taking and knowledge
management. I hope that it would be useful for you as well.

### What `pdf-fmt` is not

This is **not an OCR** (Optical Character Recognition) tool. It only processes
selectable text (with your cursor) found in the PDF structure. It is also able
to extract images and tables, though the output might not be perfect every time.

If your file contains images of text, you can use the image extraction feature
before passing the output images to your OCR.

### Handling non PDF formats

For converting non-PDF files (like `.docx`, `.pptx`, `.odt`) to PDF before
extraction, either **dependency** needs to be installed and accessible in your `$PATH`:

* [**LibreOffice's CLI** \(`soffice` or similar\)](https://www.libreoffice.org/)
* [**Pandoc**](https://pandoc.org/)

### Known issues

> Inaccurate locale enforcement e.g. localization -> localization even
> with UK locale enforcement enabled.

Upstream locale enforcement libraries may yield inaccurate words. I am working
on adding a configuration option to define your own locale mappings to override
Breame's.

# Quick Start

## Prerequisites

* You would need to have [Git](https://git-scm.com/install) and
[Python 3.10 or above](https://www.python.org/downloads/) installed
* To confirm, run `which git` and `which python` in a Linux/macOS terminal
* For Windows users, run `where git` and `where python` in Command Prompt

If you **only downloading the compiled binaries**, you can ignore this part.

These prerequisites also apply to compiling from source.

* Other prerequisites are documented in the section on [compiling from source](#compile-from-source)

## Install with uv

Requires [uv](https://github.com/astral-sh/uv).

```
uv tool install git+https://github.com/bladeacer/pdf-fmt
pdf-fmt
```

Or if you prefer a specific version.

```
uv tool install git+https://github.com/bladeacer/pdf-fmt@0.7.3
pdf-fmt
```

This should work for most platforms and architectures which are supported
by `uv`.

## Download from Release Page

You can get the compiled binary
[the latest release](https://github.com/bladeacer/pdf-fmt/releases/latest).

We recommend also downloading the associated `.sha256` files to verify checksums.
Place these and the executable in the same folder.

After downloading, Open PowerShell or the terminal on Linux/MacOS.

On Windows, run:

```ps1
cd ~/Downloads
CertUtil -hashfile pdf-fmt--.exe SHA256
mv pdf-fmt--.exe pdf-fmt.exe
./pdf-fmt.exe
```

After running `CertUtil`, open the `.sha256` file in your
favourite text editor. If the string in the terminal matches
the string in the file, your download is safe.

On Linux, run:

```bash
cd ~/Downloads
sha256sum --check pdf-fmt--.sha256
chmod +x pdf-fmt--
mv pdf-fmt-- pdf-fmt
./pdf-fmt
```

If you see OK after calling `sha256sum`, the file is verified.

On Mac, run:

```
cd ~/Downloads
shasum -a 256 --check pdf-fmt--.sha256
chmod +x pdf-fmt--
mv pdf-fmt-- pdf-fmt
xattr -d com.apple.quarantine pdf-fmt
./pdf-fmt
```

If you see OK after calling `shasum`, the file is verified.

You can also choose to do the following after this step:

* Adding it to your system `$PATH`
* Set an alias pointing to the binary or renaming it manually
* Creating the [configuration file](#configuration)

### Available architectures for binaries

| Platform | Architecture |
| --- | --- |
| Windows | x86-64 |
| Linux | x86-64 |
| Linux | arm64 |
| MacOS | x86-64 |
| MacOS | arm64 |

For other platforms or architectures, we recommend using `uv tool install`,
the script installer or compiling from source.

## About Downloaded Binaries

* Choose the binary **corresponding to your operating system**
* macOS is not supported.

If you wish to get an updated version of the executable, download the newer
latest version and remove the old executable file.
> If you wish to use `pdf-fmt` on macOS, you can use the other methods

### About Versioning

The version number might be different from the one in the above example.

* We encourage using the latest version, especially when major new features are added

## Script Installer

You can also use `pdf-fmt` via the script installer,
which sets up a isolated
[Python Virtual Environment](https://docs.python.org/3/library/venv.html)
to manage all dependencies.

### Reviewing the scripts

* The script will prompt for confirmation before starting the installation

**Before running scripts, please review their contents by opening the URL they
call in a browser.** E.g. `https://raw.githubusercontent.com/...`

* Alternatively, you can view them [here](./scripts/)

### Windows

[Set execution policy to RemoteSigned.](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy)

Then, open PowerShell.

```ps1
Invoke-RestMethod -Uri 'https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.ps1' -OutFile install.ps1
Get-Content install.ps1

.\install.ps1
```

### Linux or macOS

Open a terminal.

```bash
curl -o install.sh https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.sh
cat install.sh

chmod +x install.sh
./install.sh
```

## Using the Script Installer

The installer places the Python script inside your new `.venv` folder.
Activate the environment and run the script:

For Linux or macOS

```bash
source .venv/bin/activate
chmod +x ./pdf-fmt.py
./pdf-fmt.py
```

You might find the use of the [Makefile](./Makefile) helpful in this regard.

For Windows

```ps1
.venv\Scripts\activate
pdf-fmt
```

The output is printed to the terminal and **copied to your clipboard** by default.

To update the script, run **`git pull`** in the repository the script creates
under the `pdf-fmt` directory.

## Compile from Source

Requires running the script installer or the following commands. This example
assumes the use of Linux. See the [script usage example](#using-the-script-installer)
on how to activate virtual environment for each OS.

It is recommended to use [pyenv](https://github.com/pyenv/pyenv) to manage
different versions of Python. It is also recommended to install [ccache](https://github.com/ccache/ccache)
for compiled binaries to be cached. You would also need [the following `nuitka` requirements](https://github.com/Nuitka/Nuitka).

You might find the use of the [Makefile](./Makefile) helpful in this regard.

### Pyenv setup (optional)

After installing pyenv, follow its instructions on configuring with `pyenv init`.

Then, run the following immediately after you change directory into the cloned repository.

```bash
pyenv install 3.11
pyenv local 3.11
```

You can use any other target Python version, though `pdf-fmt` primarily supports
Python 3.10 or above.

### Linux/macOS

```bash
# Either clone the repository or change directory to it if you have used the
# script installer prior
git clone --depth 1 https://github.com/bladeacer/pdf-fmt
cd pdf-fmt
chmod +x ./scripts/compile.sh
./scripts/compile.sh
```

The [script](./scripts/compile.sh) creates a separate virtual environment for
compiling from source. It would output the binary to the `build/` directory once
compiling is done.
> Compilation too slow? Increase the number specified in the jobs count.
> **Only do this if you have sufficient CPU cores and hardware.**
> Remove the `--low-memory` flag at your own risk.
>
> If the compilation takes up too much memory, it will crash and exit without completing.

Compilation logs will be found at `nuitka-build.log`.
Crash reports would be found at `nuitka-crash-report.xml`.

Alternatively, you can call [this script on Linux or macOS](./scripts/compile.sh).

## Configuration

The configuration options available are documented in the
[`pdf-fmt.yaml`](./pdf-fmt.yaml) file.

* **`filters`**: Regex rules for character exclusion and pattern-based filtering
* excluding footers matching a regex pattern.
* includes optional spelling enforcement (UK or US English).
* **`conversion`**: Lists supported non-PDF formats (see
[handling non\-PDF formats](#handling-non-pdf-formats)).
* **`formatting`**: Controls line re-wrapping, indentation conversion
* converting single-space indents to Markdown lists
* enforcing capitalisation at the start of each line.
* **`actions`**: Defines post-extraction behaviour
* copying to the system clipboard and/or write to an output file.

For extensive customisation, you can consider create your own
configuration file. If you do, ensure that it is named `pdf-fmt.yaml`.

### Where to place the configuration file

`pdf-fmt` will look for the configuration file under the following locations.

* `$PDF_FMT_CONFIG_PATH` environment variable
* Default configuration directory
* `APPDATA` if you are on Windows
* `$XDG_CONFIG_HOME` or `~/.config` if you are on Linux
* The current working directory of the script

### Development status

Note: the configuration schema in this repository reflects the development branch.

The released binaries might not support some options yet. These are indicated
with `[DEV]`.

## Supported platforms

This table documents the currently supported platforms for `pdf-fmt` and
highlights platforms where we are seeking community confirmation of functionality.

* Primarily, we aim to support the latest, most widely used version of each platform
* This means that LTS or stable versions of a platform are sometimes preferred
when testing for compatibility

We welcome your contributions! Please help us by:

* Opening a pull request (PR) to confirm that `pdf-fmt` works on your platform,
noting any specific setup caveats or workarounds.
* Creating an issue if you encounter problems with the installer script or
compiling from source.

| Platform | Display Protocol | C Standard Library | Known to work? | Comments |
| :--- | :--- | :--- | :--- | :--- |
| **Alpine Linux x64 (musl-based)** | X11 | `musl` | Untested | Contributions are welcome |
| **Arch Linux x64** | Wayland | `glibc` | Untested | Contributions are welcome |
| **Arch Linux x64** | X11 | `glibc` | Untested | Contributions are welcome |
| **Debian x64 (glibc)** | Wayland | `glibc` | Untested | Contributions are welcome |
| **Debian x86 (glibc)** | X11 | `glibc` | Untested | Contributions are welcome |
| **EndeavourOS x64 (Arch-based)** | Wayland | `glibc` | Partial | Script works out of the box. Contributions are welcome for binary/compiling from source. |
| **EndeavourOS x64 (Arch-based)** | X11 | `glibc` | Yes | Binary/script/compiling from source works. |
| **Fedora x64 (RPM-based)** | Wayland | `glibc` | Partial | Binary works out of the box. Contributions are welcome for script/compiling from source |
| **Fedora x64 (RPM-based)** | X11 | `glibc` | Untested | Contributions are welcome |
| **FreeBSD stable x64** | X11 | `BSD libc` | Untested | Contributions are welcome |
| **NetBSD x64** | X11 | `BSD libc` | Untested | Contributions are welcome |
| **OpenBSD x64** | X11 | `BSD libc` | Untested | Contributions are welcome |
| **Ubuntu LTS x64 (Debian-based)** | Wayland | `glibc` | Untested | Contributions are welcome |
| **Ubuntu LTS x64 (Debian-based)** | X11 | `glibc` | Untested | Contributions are welcome |
| **macOS 14 (Sonoma)** | N/A | `libSystem` (BSD `libc`) | Untested | Contributions are welcome |
| **Windows 10 x64** | N/A | `MSVCRT` (via `MSVC`/`MinGW`) | Untested | Contributions are welcome |
| **Windows 11 x64** | N/A | `MSVCRT` (via `MSVC`/`MinGW`) | Partial | Binary works out of the box. Contributions are welcome for script/compiling from source |
| **Windows Subsystem for Linux (WSL) 2 x64** | N/A | `glibc`/`musl`| Untested | Contributions are welcome |

### Note: Linux users

To check the C Standard Library used on Linux, run `ldd --version`.

To check the Display Protocol currently used on Linux, run `echo $XDG_SESSION_TYPE`.

You may need to install [patchelf](https://github.com/NixOS/patchelf)

* See [Compile from source](#compile-from-source) for more details.

## Supported Python Versions

| Python Version | Known to work? | Comments |
| --- | --- | --- |
| 3.10 | Yes | Compiling from source, script works. Used as default compilation/script version. |
| 3.11 | Yes | Compiling from source, script works. |
| 3.12 | Yes | Compiling from source, script works. Used in GitHub Actions. |
| 3.13 | Partial | Compiling from source, script works. |
| 3.14 | Untested | PRs welcome |

## Contributing

Create your own fork or clone the repository. The below example shows cloning
this repository with the use of Linux.

Do note that this repository has its own [Code of Conduct](./CODE_OF_CONDUCT.md)
and [Contributing Guide](./CONTRIBUTING.md).

### Setup

```bash
git clone https://github.com/bladeacer/pdf-fmt
chmod +x scripts/setup.sh
./scripts/dev.sh
```

## Benchmarks

TBC

### A note on Compatibility

The script, compiled binaries and compiling from source should work for all major
operating systems that support `Git`, `Python`,
[`pdfplumber`](https://github.com/jsvine/pdfplumber) and
[`pyperclip`](https://github.com/asweigart/pyperclip).

> Note: These dependencies are slightly larger than their C equivalents, though this
> is a calculated trade off.

## Tests

### Unit Tests

Using `unittest`, which is of Python's standard library. You can make use of the
script installer for cloning the repository.

```py
python -m unittest discover -sv tests
```

Alternatively, you can run the [script](./scripts/tests.sh).

## License

GPLv3, See [license file](./LICENSE) for details.

### License Notice

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see https://www.gnu.org/licenses/.

## Credits

Existing PDF tooling for inspiration, LibreOffice CLI.
Nuitka for compilation, GitHub for hosting and CI.

My friend Potato for testing the binary on Windows.

My friend [Floodlight](https://github.com/Gonzalo-D-Sales) for testing the
binary on Fedora.

The code of conduct was adopted from the
[Contributor Covenant](https://www.contributor-covenant.org/).

The contributing guide was adopted from [conduct](https://github.com/sindresorhus/conduct).