https://github.com/eli64s/pdflex
✁ Python tools for PDF automation.
https://github.com/eli64s/pdflex
pdf-automation pdf-converter pdf-data-extraction pdf-document pdf-document-parser pdf-document-processor pdf-extractor pdf-generator pdf-library pdf-manipulation pdf-parser pdf-processor pdf-python pdf-regex pdf-search pdf-text-extraction pdf-tools python-pdf python-pdf-tools
Last synced: about 2 months ago
JSON representation
✁ Python tools for PDF automation.
- Host: GitHub
- URL: https://github.com/eli64s/pdflex
- Owner: eli64s
- License: mit
- Created: 2020-12-16T09:49:47.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2025-02-18T15:57:04.000Z (4 months ago)
- Last Synced: 2025-03-19T17:18:16.468Z (3 months ago)
- Topics: pdf-automation, pdf-converter, pdf-data-extraction, pdf-document, pdf-document-parser, pdf-document-processor, pdf-extractor, pdf-generator, pdf-library, pdf-manipulation, pdf-parser, pdf-processor, pdf-python, pdf-regex, pdf-search, pdf-text-extraction, pdf-tools, python-pdf, python-pdf-tools
- Language: Python
- Homepage:
- Size: 457 KB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
## What is `PDFlex?`
PDFlex is a powerful PDF processing toolkit for Python. It provides robust tools for PDF validation, text extraction, merging (with custom separator pages), searching, and more—all built to streamline your PDF automation workflows.
## Features
- **PDF Validation:** Quickly verify if a file is a valid PDF.
- **Text Extraction:** Extract text from PDFs using either PyMuPDF or PyPDF.
- **Directory Processing:** Process entire directories of PDFs for text extraction.
- **PDF Merging:** Merge multiple PDF files into one, automatically inserting a custom separator page between documents.
- The separator page displays the title (derived from the filename) with underscores and hyphens removed.
- Supports both portrait and landscape separator pages (ideal for lecture slides).
- **PDF Searching:** Recursively search for PDFs in a directory based on filename patterns (e.g., numeric float prefixes).---
## Quick Start
## Installation
PDFlex is available on PyPI. To install using pip:
```bash
pip install -U pdflex
```Alternatively, install in an isolated environment with pipx:
```bash
pipx install pdflex
```For the fastest installation using uv:
```bash
uv tool install pdflex
```---
## Usage
### Command-Line Interface (CLI)
PDFlex provides a convenient CLI for merging and searching PDFs. The CLI supports two primary commands: `merge` and `search`.
#### Merge Command
Merge multiple PDF files into a single document while automatically inserting a separator page before each document.
**Usage:**
```bash
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf
```Add the `--landscape` flag to create separator pages in landscape orientation:
```bash
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf --landscape
```#### Search and Merge Command
Search for PDF files in a directory based on filename filters (or search for lecture slides with numeric float prefixes) and merge them into one PDF.
**Usage:**
- **General Search:**
```bash
pdflex search /path/to/search -o merged_output.pdf --prefix "Chapter" --suffix ".pdf"
```- **Lecture Slides Merge:**
(Merges all PDFs whose filenames start with a numeric float prefix like `1.2_`, `3.2_`, etc., in sorted order. Separator pages will be in landscape orientation.)```bash
pdflex search /path/to/algorithms-and-computation -o merged_lectures.pdf --lecture
```### Python API Usage
You can also use PDFlex directly from your Python code. Below are examples for some common tasks.
#### Merging PDFs with Separator Pages
```python
from pathlib import Path
from pdflex.merge import merge_pdfs# List of PDF file paths to merge
pdf_files = [
"/path/to/document1.pdf",
"/path/to/document2.pdf"
]# Merge files, using landscape separator pages (ideal for lecture slides)
merge_pdfs(pdf_files, output_path="merged_output.pdf", landscape=True)
```#### Searching for PDFs by Filename
```python
from pdflex.search import search_pdfs, search_numeric_prefixed_pdfs# General search: Find PDFs that start with a prefix and/or end with a suffix
pdf_list = search_pdfs("/path/to/search", prefix="Chapter", suffix=".pdf")
print("Found PDFs:", pdf_list)# Lecture slides: Find PDFs with numeric float prefixes (e.g., "1.2_Intro.pdf")
lecture_slides = search_numeric_prefixed_pdfs("/path/to/algorithms-and-computation")
print("Found lecture slides:", lecture_slides)
```---
## Contributing
Contributions are welcome! Whether it's bug reports, feature requests, or code contributions, please feel free to:
1. Open an [issue][github-issues]
2. Submit a [pull request][github-pulls]
3. Improve documentation.
4. Share your ideas!---
## Acknowledgments
This project is built upon several awesome PDF open-source projects:
- [pypdf](https://github.com/pymupdf/PyMuPDF)
- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [reportlab](https://www.reportlab.com/opensource/)---
## License
PDFlex is released under the [MIT][mit-license] license.
Copyright (c) 2020 to present [PDFlex][pdflex] and contributors.
![]()
[pypi]: https://pypi.org/project/pdflex/
[pdflex]: https://github.com/eli64s/pdflex
[github-issues]: https://github.com/eli64s/pdflex/issues
[github-pulls]: https://github.com/eli64s/pdflex/pulls
[mit-license]: https://github.com/eli64s/pdflex/blob/main/LICENSE
[examples]: https://github.com/eli64s/pdflex/tree/main/docs/examples[python]: https://www.python.org/
[pip]: https://pip.pypa.io/en/stable/
[pipx]: https://pipx.pypa.io/stable/
[uv]: https://docs.astral.sh/uv/
[mkdocs]: https://www.mkdocs.org/
[mkdocs.yml]: https://www.mkdocs.org/user-guide/configuration/