https://github.com/mathieubuisson/unpdf
https://github.com/mathieubuisson/unpdf
Last synced: 26 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/mathieubuisson/unpdf
- Owner: MathieuBuisson
- Created: 2026-04-26T16:00:12.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-04-26T22:11:24.000Z (about 2 months ago)
- Last Synced: 2026-04-26T22:15:47.678Z (about 2 months ago)
- Language: Python
- Size: 4.88 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Agents: AGENTS.md
Awesome Lists containing this project
README
# unpdf
`unpdf` is a minimalist command-line utility designed to convert PDF documents into Markdown format, optimized for use with LLMs (Large Language Models). It leverages the power of `pymupdf4llm` to extract text, tables, and images while preserving the document structure.
## Features
- **Batch Processing**: Convert single files or entire directories.
- **Directory Preservation**: Recursively scans folders and mirrors the input structure in the output directory.
- **Smart Skip**: Automatically skips existing output files unless forced to overwrite.
- **Error Handling**: Gracefully handles encrypted or corrupt PDFs by logging warnings and continuing the batch.
## Technology Stack
- **Language**: Python 3.13+
- **PDF Engine**: [PyMuPDF4LLM](https://github.com/pymupdf/pymupdf4llm)
- **CLI**: Standard `argparse`
## Installation
### Prerequisites
- Python 3.13 or higher.
### Setup
1. Clone the repository:
```bash
git clone
cd unpdf
```
2. Install dependencies:
```bash
pip install .
```
## Usage
```text
usage: unpdf [-h] INPUT -o DIR [--recurse] [--force] [--version]
positional arguments:
INPUT PDF file or folder containing PDFs
options:
-o DIR, --output DIR Output folder (required)
--recurse Recursively scan subfolders
--force Overwrite existing output files
--version Show version information and exit
```
### Examples
**Convert a single file:**
```bash
unpdf document.pdf -o ./output
```
**Convert an entire folder recursively:**
```bash
unpdf ./docs -o ./markdown_docs --recurse
```
## Project Structure
```text
unpdf/
├── src/unpdf/ # Source package
│ ├── __init__.py
│ ├── __main__.py # Entry point: python -m unpdf
│ ├── cli.py # CLI argument parsing
│ ├── scanner.py # PDF discovery and path mapping
│ ├── converter.py # Single-file PDF→Markdown conversion
│ └── runner.py # Batch orchestration and statistics
├── tests/ # Test suite
│ ├── test_cli.py
│ ├── test_scanner.py
│ ├── test_converter.py
│ └── test_runner.py
├── pyproject.toml # Centralized configuration
├── SPEC.md # Technical specifications
└── README.md # User documentation
```
## Testing
Run the test suite using `pytest`:
```bash
pytest tests/
```
## Code Quality
This project uses several tools to maintain code quality:
```bash
black src/ tests/ # Format code
mypy src/ tests/ # Type checking
bandit -r src/ tests/ # Security analysis
```
## Usage
```bash
# After pip install .
unpdf document.pdf -o ./output
# OR without installation
python -m unpdf document.pdf -o ./output
```
# Options