https://github.com/eli64s/pdflex

✁ Python tools for PDF automation.
https://github.com/eli64s/pdflex

pdf-automation pdf-converter pdf-data-extraction pdf-document pdf-document-parser pdf-document-processor pdf-extractor pdf-generator pdf-library pdf-manipulation pdf-parser pdf-processor pdf-python pdf-regex pdf-search pdf-text-extraction pdf-tools python-pdf python-pdf-tools

Last synced: about 2 months ago
JSON representation

✁ Python tools for PDF automation.

Host: GitHub
URL: https://github.com/eli64s/pdflex
Owner: eli64s
License: mit
Created: 2020-12-16T09:49:47.000Z (over 4 years ago)
Default Branch: main
Last Pushed: 2025-02-18T15:57:04.000Z (4 months ago)
Last Synced: 2025-03-19T17:18:16.468Z (3 months ago)
Topics: pdf-automation, pdf-converter, pdf-data-extraction, pdf-document, pdf-document-parser, pdf-document-processor, pdf-extractor, pdf-generator, pdf-library, pdf-manipulation, pdf-parser, pdf-processor, pdf-python, pdf-regex, pdf-search, pdf-text-extraction, pdf-tools, python-pdf, python-pdf-tools
Language: Python
Homepage:
Size: 457 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

## What is `PDFlex?`

PDFlex is a powerful PDF processing toolkit for Python. It provides robust tools for PDF validation, text extraction, merging (with custom separator pages), searching, and more—all built to streamline your PDF automation workflows.

## Features

- **PDF Validation:** Quickly verify if a file is a valid PDF.
- **Text Extraction:** Extract text from PDFs using either PyMuPDF or PyPDF.
- **Directory Processing:** Process entire directories of PDFs for text extraction.
- **PDF Merging:** Merge multiple PDF files into one, automatically inserting a custom separator page between documents.
- The separator page displays the title (derived from the filename) with underscores and hyphens removed.
- Supports both portrait and landscape separator pages (ideal for lecture slides).
- **PDF Searching:** Recursively search for PDFs in a directory based on filename patterns (e.g., numeric float prefixes).

---

## Quick Start

## Installation

PDFlex is available on PyPI. To install using pip:

```bash
pip install -U pdflex
```

Alternatively, install in an isolated environment with pipx:

```bash
pipx install pdflex
```

For the fastest installation using uv:

```bash
uv tool install pdflex
```

---

## Usage

### Command-Line Interface (CLI)

PDFlex provides a convenient CLI for merging and searching PDFs. The CLI supports two primary commands: `merge` and `search`.

#### Merge Command

Merge multiple PDF files into a single document while automatically inserting a separator page before each document.

**Usage:**

```bash
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf
```

Add the `--landscape` flag to create separator pages in landscape orientation:

```bash
pdflex merge /path/to/file1.pdf /path/to/file2.pdf -o merged_output.pdf --landscape
```

#### Search and Merge Command

Search for PDF files in a directory based on filename filters (or search for lecture slides with numeric float prefixes) and merge them into one PDF.

**Usage:**

- **General Search:**

```bash
pdflex search /path/to/search -o merged_output.pdf --prefix "Chapter" --suffix ".pdf"
```

- **Lecture Slides Merge:**
(Merges all PDFs whose filenames start with a numeric float prefix like `1.2_`, `3.2_`, etc., in sorted order. Separator pages will be in landscape orientation.)

```bash
pdflex search /path/to/algorithms-and-computation -o merged_lectures.pdf --lecture
```

### Python API Usage

You can also use PDFlex directly from your Python code. Below are examples for some common tasks.

#### Merging PDFs with Separator Pages

```python
from pathlib import Path
from pdflex.merge import merge_pdfs

# List of PDF file paths to merge
pdf_files = [
"/path/to/document1.pdf",
"/path/to/document2.pdf"
]

# Merge files, using landscape separator pages (ideal for lecture slides)
merge_pdfs(pdf_files, output_path="merged_output.pdf", landscape=True)
```

#### Searching for PDFs by Filename

```python
from pdflex.search import search_pdfs, search_numeric_prefixed_pdfs

# General search: Find PDFs that start with a prefix and/or end with a suffix
pdf_list = search_pdfs("/path/to/search", prefix="Chapter", suffix=".pdf")
print("Found PDFs:", pdf_list)

# Lecture slides: Find PDFs with numeric float prefixes (e.g., "1.2_Intro.pdf")
lecture_slides = search_numeric_prefixed_pdfs("/path/to/algorithms-and-computation")
print("Found lecture slides:", lecture_slides)
```

---

## Contributing

Contributions are welcome! Whether it's bug reports, feature requests, or code contributions, please feel free to:

1. Open an [issue][github-issues]
2. Submit a [pull request][github-pulls]
3. Improve documentation.
4. Share your ideas!

---

## Acknowledgments

This project is built upon several awesome PDF open-source projects:

- [pypdf](https://github.com/pymupdf/PyMuPDF)
- [pdfplumber](https://github.com/jsvine/pdfplumber)
- [reportlab](https://www.reportlab.com/opensource/)

---

## License

PDFlex is released under the [MIT][mit-license] license.

Copyright (c) 2020 to present [PDFlex][pdflex] and contributors.

[pypi]: https://pypi.org/project/pdflex/
[pdflex]: https://github.com/eli64s/pdflex
[github-issues]: https://github.com/eli64s/pdflex/issues
[github-pulls]: https://github.com/eli64s/pdflex/pulls
[mit-license]: https://github.com/eli64s/pdflex/blob/main/LICENSE
[examples]: https://github.com/eli64s/pdflex/tree/main/docs/examples

[python]: https://www.python.org/
[pip]: https://pip.pypa.io/en/stable/
[pipx]: https://pipx.pypa.io/stable/
[uv]: https://docs.astral.sh/uv/
[mkdocs]: https://www.mkdocs.org/
[mkdocs.yml]: https://www.mkdocs.org/user-guide/configuration/

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eli64s/pdflex

Awesome Lists containing this project

README