{"id":49414154,"url":"https://github.com/bladeacer/pdf-fmt","last_synced_at":"2026-04-29T02:12:58.370Z","repository":{"id":319067670,"uuid":"1077419389","full_name":"bladeacer/pdf-fmt","owner":"bladeacer","description":"A PDF extractor, processor and formatter. Supports regex based exclusions and other niceties.","archived":false,"fork":false,"pushed_at":"2026-04-29T00:44:01.000Z","size":226,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-29T01:16:27.884Z","etag":null,"topics":["pdf","pdf-image-extractor","pdf-table-extraction","pdf-text-extraction","python","text-formatting"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/bladeacer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"bladeacer","patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2025-10-16T08:19:45.000Z","updated_at":"2026-04-29T00:43:55.000Z","dependencies_parsed_at":"2026-04-13T17:01:25.444Z","dependency_job_id":null,"html_url":"https://github.com/bladeacer/pdf-fmt","commit_stats":null,"previous_names":["bladeacer/pdf-fmt"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/bladeacer/pdf-fmt","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bladeacer%2Fpdf-fmt","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bladeacer%2Fpdf-fmt/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bladeacer%2Fpdf-fmt/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bladeacer%2Fpdf-fmt/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/bladeacer","download_url":"https://codeload.github.com/bladeacer/pdf-fmt/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/bladeacer%2Fpdf-fmt/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32407236,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-28T19:38:08.556Z","status":"online","status_checked_at":"2026-04-29T02:00:06.602Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["pdf","pdf-image-extractor","pdf-table-extraction","pdf-text-extraction","python","text-formatting"],"created_at":"2026-04-29T02:12:57.113Z","updated_at":"2026-04-29T02:12:58.365Z","avatar_url":"https://github.com/bladeacer.png","language":"Python","funding_links":["https://github.com/sponsors/bladeacer"],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003ca href=\"https://github.com/bladeacer/pdf-fmt/releases/latest\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/v/release/bladeacer/pdf-fmt?style=for-the-badge\u0026sort=semver\u0026logo=semantic-release\" referrerpolicy=\"noreferrer\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/bladeacer/pdf-fmt/blob/master/LICENSE\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/license/bladeacer/pdf-fmt?style=for-the-badge\" referrerpolicy=\"noreferrer\"\u003e\n    \u003c/a\u003e\n    \u003ca href=\"https://github.com/bladeacer/pdf-fmt/actions\"\u003e\n        \u003cimg src=\"https://img.shields.io/github/actions/workflow/status/bladeacer/pdf-fmt/release.yml?style=for-the-badge\u0026logo=github\" referrerpolicy=\"noreferrer\"\u003e\n    \u003c/a\u003e\n\u003c/p\u003e\n\n# pdf-fmt\n\nA PDF Text Extractor, Processor, and Formatter.\n\n`pdf-fmt` is a powerful utility designed to extract text from PDF\ndocuments and then clean, filter, and structure the output.\n\nIt is useful for converting raw PDF dumps into clean, formatted text.\n\nNote that `pdf-fmt` is **under active development**, you might encounter bugs\nand issues.\n\n### Project Status\n\n`pdf-fmt` is currently undergoing a major rewrite. Stay tuned.\n\u003e The script installer in the main branch will not work, use the compiled binary\n\u003e under the releases page.\n\n### Features\n\n* Raw text extraction\n  * Copy to clipboard and/or write to file\n* Extensive configuration schema\n  * See [configuration](#configuration)\n* Supports numerous formats\n  * See [handling non-PDF formats](#handling-non-pdf-formats)\n* Image extraction\n  * PNG, WEBP, SVG, etc. supported\n* Table extraction\n  * Experimental, will add a configuration file entry to configure behaviour\n* and many others to come...\n\n### Why I made this\n\nThere are plenty of PDF tooling out there, but they seems to be geared towards\nOCR and generally do not help with extracting and processing the output text.\n\nPersonally, I use it to collate lecture slides for note taking and knowledge\nmanagement. I hope that it would be useful for you as well.\n\n### What `pdf-fmt` is not\n\nThis is **not an OCR** (Optical Character Recognition) tool. It only processes\nselectable text (with your cursor) found in the PDF structure. It is also able\nto extract images and tables, though the output might not be perfect every time.\n\nIf your file contains images of text, you can use the image extraction feature\nbefore passing the output images to your OCR.\n\n### Handling non PDF formats\n\nFor converting non-PDF files (like `.docx`, `.pptx`, `.odt`) to PDF before\nextraction, either **dependency** needs to be installed and accessible in your `$PATH`:\n\n* [**LibreOffice's CLI** \\(`soffice` or similar\\)](https://www.libreoffice.org/)\n* [**Pandoc**](https://pandoc.org/)\n\n### Known issues\n\n\u003e Inaccurate locale enforcement e.g. localization -\u003e localization even\n\u003e with UK locale enforcement enabled.\n\nUpstream locale enforcement libraries may yield inaccurate words. I am working\non adding a configuration option to define your own locale mappings to override\nBreame's.\n\n# Quick Start\n\n## Prerequisites\n\n* You would need to have [Git](https://git-scm.com/install) and\n[Python 3.10 or above](https://www.python.org/downloads/) installed\n  * To confirm, run `which git` and `which python` in a Linux/macOS terminal\n  * For Windows users, run `where git` and `where python` in Command Prompt\n\nIf you **only downloading the compiled binaries**, you can ignore this part.\n\nThese prerequisites also apply to compiling from source.\n\n* Other prerequisites are documented in the section on [compiling from source](#compile-from-source)\n\n## Install with uv\n\nRequires [uv](https://github.com/astral-sh/uv).\n\n```\nuv tool install git+https://github.com/bladeacer/pdf-fmt\npdf-fmt\n```\n\nOr if you prefer a specific version.\n\n```\nuv tool install git+https://github.com/bladeacer/pdf-fmt@0.7.3\npdf-fmt\n```\n\nThis should work for most platforms and architectures which are supported \nby `uv`.\n\n## Download from Release Page\n\nYou can get the compiled binary\n[the latest release](https://github.com/bladeacer/pdf-fmt/releases/latest).\n\nWe recommend also downloading the associated `.sha256` files to verify checksums.\nPlace these and the executable in the same folder.\n\nAfter downloading, Open PowerShell or the terminal on Linux/MacOS.\n\nOn Windows, run:\n\n```ps1\ncd ~/Downloads\nCertUtil -hashfile pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e.exe SHA256\nmv pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e.exe pdf-fmt.exe\n./pdf-fmt.exe\n```\n\nAfter running `CertUtil`, open the `.sha256` file in your\nfavourite text editor. If the string in the terminal matches\nthe string in the file, your download is safe.\n\nOn Linux, run:\n\n```bash\ncd ~/Downloads\nsha256sum --check pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e.sha256\nchmod +x pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e\nmv pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e pdf-fmt\n./pdf-fmt\n```\n\nIf you see OK after calling `sha256sum`, the file is verified.\n\nOn Mac, run:\n\n```\ncd ~/Downloads\nshasum -a 256 --check pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e.sha256\nchmod +x pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e\nmv pdf-fmt-\u003carch\u003e-\u003cversion-no\u003e pdf-fmt\nxattr -d com.apple.quarantine pdf-fmt\n./pdf-fmt\n```\n\nIf you see OK after calling `shasum`, the file is verified.\n\nYou can also choose to do the following after this step:\n\n* Adding it to your system `$PATH`\n* Set an alias pointing to the binary or renaming it manually\n* Creating the [configuration file](#configuration)\n\n### Available architectures for binaries\n\n| Platform | Architecture |\n| --- | --- |\n| Windows | x86-64 |\n| Linux | x86-64 |\n| Linux | arm64 |\n| MacOS | x86-64 |\n| MacOS | arm64 |\n\nFor other platforms or architectures, we recommend using `uv tool install`,\nthe script installer or compiling from source.\n\n## About Downloaded Binaries\n\n* Choose the binary **corresponding to your operating system**\n* macOS is not supported.\n\nIf you wish to get an updated version of the executable, download the newer\nlatest version and remove the old executable file.\n\u003e If you wish to use `pdf-fmt` on macOS, you can use the other methods\n\n### About Versioning\n\nThe version number might be different from the one in the above example.\n\n* We encourage using the latest version, especially when major new features are added\n\n## Script Installer\n\nYou can also use `pdf-fmt` via the script installer,\nwhich sets up a isolated\n[Python Virtual Environment](https://docs.python.org/3/library/venv.html)\nto manage all dependencies.\n\n### Reviewing the scripts\n\n* The script will prompt for confirmation before starting the installation\n\n**Before running scripts, please review their contents by opening the URL they\ncall in a browser.** E.g. `https://raw.githubusercontent.com/...`\n\n* Alternatively, you can view them [here](./scripts/)\n\n### Windows\n\n[Set execution policy to RemoteSigned.](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.security/set-executionpolicy)\n\nThen, open PowerShell.\n\n```ps1\nInvoke-RestMethod -Uri 'https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.ps1' -OutFile install.ps1\nGet-Content install.ps1\n\n.\\install.ps1\n```\n\n### Linux or macOS\n\nOpen a terminal.\n\n```bash\ncurl -o install.sh https://raw.githubusercontent.com/bladeacer/pdf-fmt/refs/heads/main/scripts/install.sh\ncat install.sh\n\nchmod +x install.sh\n./install.sh\n```\n\n## Using the Script Installer\n\nThe installer places the Python script inside your new `.venv` folder.\nActivate the environment and run the script:\n\nFor Linux or macOS\n\n```bash\nsource .venv/bin/activate\nchmod +x ./pdf-fmt.py\n./pdf-fmt.py\n```\n\nYou might find the use of the [Makefile](./Makefile) helpful in this regard.\n\nFor Windows\n\n```ps1\n.venv\\Scripts\\activate\npdf-fmt\n```\n\nThe output is printed to the terminal and **copied to your clipboard** by default.\n\nTo update the script, run **`git pull`** in the repository the script creates\nunder the `pdf-fmt` directory.\n\n## Compile from Source\n\nRequires running the script installer or the following commands. This example\nassumes the use of Linux. See the [script usage example](#using-the-script-installer)\non how to activate virtual environment for each OS.\n\nIt is recommended to use [pyenv](https://github.com/pyenv/pyenv) to manage\ndifferent versions of Python. It is also recommended to install [ccache](https://github.com/ccache/ccache)\nfor compiled binaries to be cached. You would also need [the following `nuitka` requirements](https://github.com/Nuitka/Nuitka).\n\nYou might find the use of the [Makefile](./Makefile) helpful in this regard.\n\n### Pyenv setup (optional)\n\nAfter installing pyenv, follow its instructions on configuring with `pyenv init`.\n\nThen, run the following immediately after you change directory into the cloned repository.\n\n```bash\npyenv install 3.11\npyenv local 3.11\n```\n\nYou can use any other target Python version, though `pdf-fmt` primarily supports\nPython 3.10 or above.\n\n### Linux/macOS\n\n```bash\n# Either clone the repository or change directory to it if you have used the\n# script installer prior\ngit clone --depth 1 https://github.com/bladeacer/pdf-fmt\ncd pdf-fmt\nchmod +x ./scripts/compile.sh\n./scripts/compile.sh\n```\n\nThe [script](./scripts/compile.sh) creates a separate virtual environment for\ncompiling from source. It would output the binary to the `build/` directory once\ncompiling is done.\n\u003e Compilation too slow? Increase the number specified in the jobs count.\n\u003e **Only do this if you have sufficient CPU cores and hardware.**\n\u003e Remove the `--low-memory` flag at your own risk.\n\u003e\n\u003e If the compilation takes up too much memory, it will crash and exit without completing.\n\nCompilation logs will be found at `nuitka-build.log`.\nCrash reports would be found at `nuitka-crash-report.xml`.\n\nAlternatively, you can call [this script on Linux or macOS](./scripts/compile.sh).\n\n## Configuration\n\nThe configuration options available are documented in the\n[`pdf-fmt.yaml`](./pdf-fmt.yaml) file.\n\n* **`filters`**: Regex rules for character exclusion and pattern-based filtering\n  * excluding footers matching a regex pattern.\n  * includes optional spelling enforcement (UK or US English).\n* **`conversion`**: Lists supported non-PDF formats (see\n[handling non\\-PDF formats](#handling-non-pdf-formats)).\n* **`formatting`**: Controls line re-wrapping, indentation conversion\n  * converting single-space indents to Markdown lists\n  * enforcing capitalisation at the start of each line.\n* **`actions`**: Defines post-extraction behaviour\n  * copying to the system clipboard and/or write to an output file.\n\nFor extensive customisation, you can consider create your own\nconfiguration file. If you do, ensure that it is named `pdf-fmt.yaml`.\n\n### Where to place the configuration file\n\n`pdf-fmt` will look for the configuration file under the following locations.\n\n* `$PDF_FMT_CONFIG_PATH` environment variable\n* Default configuration directory\n  * `APPDATA` if you are on Windows\n  * `$XDG_CONFIG_HOME` or `~/.config` if you are on Linux\n* The current working directory of the script\n\n### Development status\n\nNote: the configuration schema in this repository reflects the development branch.\n\nThe released binaries might not support some options yet. These are indicated\nwith `[DEV]`.\n\n## Supported platforms\n\nThis table documents the currently supported platforms for `pdf-fmt` and\nhighlights platforms where we are seeking community confirmation of functionality.\n\n* Primarily, we aim to support the latest, most widely used version of each platform\n* This means that LTS or stable versions of a platform are sometimes preferred\nwhen testing for compatibility\n\nWe welcome your contributions! Please help us by:\n\n* Opening a pull request (PR) to confirm that `pdf-fmt` works on your platform,\nnoting any specific setup caveats or workarounds.\n* Creating an issue if you encounter problems with the installer script or\ncompiling from source.\n\n| Platform | Display Protocol | C Standard Library | Known to work? | Comments |\n| :--- | :--- | :--- | :--- | :--- |\n| **Alpine Linux x64 (musl-based)** | X11 | `musl` | Untested | Contributions are welcome |\n| **Arch Linux x64** | Wayland | `glibc` | Untested | Contributions are welcome |\n| **Arch Linux x64** | X11 | `glibc` | Untested | Contributions are welcome |\n| **Debian x64 (glibc)** | Wayland | `glibc` | Untested | Contributions are welcome |\n| **Debian x86 (glibc)** | X11 | `glibc` | Untested | Contributions are welcome |\n| **EndeavourOS x64 (Arch-based)** | Wayland | `glibc` | Partial | Script works out of the box. Contributions are welcome for binary/compiling from source. |\n| **EndeavourOS x64 (Arch-based)** | X11 | `glibc` | Yes | Binary/script/compiling from source works. |\n| **Fedora x64 (RPM-based)** | Wayland | `glibc` | Partial | Binary works out of the box. Contributions are welcome for script/compiling from source |\n| **Fedora x64 (RPM-based)** | X11 | `glibc` | Untested | Contributions are welcome |\n| **FreeBSD stable x64** | X11 | `BSD libc` | Untested | Contributions are welcome |\n| **NetBSD x64** | X11 | `BSD libc` | Untested | Contributions are welcome |\n| **OpenBSD x64** | X11 | `BSD libc` | Untested | Contributions are welcome |\n| **Ubuntu LTS x64 (Debian-based)** | Wayland | `glibc` | Untested | Contributions are welcome |\n| **Ubuntu LTS x64 (Debian-based)** | X11 | `glibc` | Untested | Contributions are welcome |\n| **macOS 14 (Sonoma)** | N/A | `libSystem` (BSD `libc`) | Untested | Contributions are welcome |\n| **Windows 10 x64** | N/A | `MSVCRT` (via `MSVC`/`MinGW`) | Untested | Contributions are welcome |\n| **Windows 11 x64** | N/A | `MSVCRT` (via `MSVC`/`MinGW`) | Partial | Binary works out of the box. Contributions are welcome for script/compiling from source |\n| **Windows Subsystem for Linux (WSL) 2 x64** | N/A | `glibc`/`musl`| Untested | Contributions are welcome |\n\n### Note: Linux users\n\nTo check the C Standard Library used on Linux, run `ldd --version`.\n\nTo check the Display Protocol currently used on Linux, run `echo $XDG_SESSION_TYPE`.\n\nYou may need to install [patchelf](https://github.com/NixOS/patchelf)\n\n* See [Compile from source](#compile-from-source) for more details.\n\n## Supported Python Versions\n\n| Python Version | Known to work? | Comments |\n| --- | --- | --- |\n| 3.10 | Yes | Compiling from source, script works. Used as default compilation/script version. |\n| 3.11 | Yes | Compiling from source, script works. |\n| 3.12 | Yes | Compiling from source, script works. Used in GitHub Actions. |\n| 3.13 | Partial | Compiling from source, script works. |\n| 3.14 | Untested | PRs welcome |\n\n## Contributing\n\nCreate your own fork or clone the repository. The below example shows cloning\nthis repository with the use of Linux.\n\nDo note that this repository has its own [Code of Conduct](./CODE_OF_CONDUCT.md)\nand [Contributing Guide](./CONTRIBUTING.md).\n\n### Setup\n\n```bash\ngit clone https://github.com/bladeacer/pdf-fmt\nchmod +x scripts/setup.sh\n./scripts/dev.sh\n```\n\n## Benchmarks\n\nTBC\n\n### A note on Compatibility\n\nThe script, compiled binaries and compiling from source should work for all major\noperating systems that support `Git`, `Python`,\n[`pdfplumber`](https://github.com/jsvine/pdfplumber) and\n[`pyperclip`](https://github.com/asweigart/pyperclip).\n\n\u003e Note: These dependencies are slightly larger than their C equivalents, though this\n\u003e is a calculated trade off.\n\n## Tests\n\n### Unit Tests\n\nUsing `unittest`, which is of Python's standard library. You can make use of the\nscript installer for cloning the repository.\n\n```py\npython -m unittest discover -sv tests\n```\n\nAlternatively, you can run the [script](./scripts/tests.sh).\n\n## License\n\nGPLv3, See [license file](./LICENSE) for details.\n\n### License Notice\n\nThis program is free software: you can redistribute it and/or modify it under\nthe terms of the GNU General Public License as published by the Free Software\nFoundation, either version 3 of the License, or (at your option) any later version.\n\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY\nWARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A\nPARTICULAR PURPOSE. See the GNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License along with\nthis program. If not, see https://www.gnu.org/licenses/.\n\n## Credits\n\nExisting PDF tooling for inspiration, LibreOffice CLI.\nNuitka for compilation, GitHub for hosting and CI.\n\nMy friend Potato for testing the binary on Windows.\n\nMy friend [Floodlight](https://github.com/Gonzalo-D-Sales) for testing the\nbinary on Fedora.\n\nThe code of conduct was adopted from the\n[Contributor Covenant](https://www.contributor-covenant.org/).\n\nThe contributing guide was adopted from [conduct](https://github.com/sindresorhus/conduct).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbladeacer%2Fpdf-fmt","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbladeacer%2Fpdf-fmt","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbladeacer%2Fpdf-fmt/lists"}