An open API service indexing awesome lists of open source software.

https://github.com/funkatron/pdf_to_md

CLI tool to convert PDF files to Markdown with OCR capabilities
https://github.com/funkatron/pdf_to_md

Last synced: 10 months ago
JSON representation

CLI tool to convert PDF files to Markdown with OCR capabilities

Awesome Lists containing this project

README

          

# PDF to Markdown

A command-line tool to convert PDF files to Markdown, with OCR capabilities for image-based content.

## Features

- Converts searchable PDF text to Markdown
- Uses OCR (Optical Character Recognition) for image-based PDF content
- Simple page separation with Markdown headers
- Configurable OCR resolution

## Installation

### Prerequisites

- Python 3.6 or higher
- Tesseract OCR (for OCR functionality)

#### Installing Tesseract OCR

**macOS (using Homebrew):**
```bash
brew install tesseract
```

**Linux (Ubuntu/Debian):**
```bash
sudo apt-get install tesseract-ocr
```

**Windows:**
Download and install from [Tesseract GitHub page](https://github.com/UB-Mannheim/tesseract/wiki).

### Installing pdf_to_md

#### From PyPI (recommended)

```bash
pip install pdf_to_md
```

#### From source

```bash
git clone https://github.com/funkatron/pdf_to_md.git
cd pdf_to_md
pip install -e .
```

## Usage

```bash
pdf_to_md input.pdf [-o output.md] [--dpi 300]
```

### Options

- `input.pdf`: Path to the PDF file to convert
- `-o, --output`: Path to the output Markdown file (default: same name as input with .md extension)
- `--dpi`: DPI for OCR rendering (default: 300)

## Examples

Convert a PDF file to Markdown:

```bash
pdf_to_md document.pdf
```

Specify an output file:

```bash
pdf_to_md document.pdf -o output.md
```

Use higher resolution for OCR (may improve accuracy but slower):

```bash
pdf_to_md document.pdf --dpi 600
```

### Programmatic Usage

You can also use pdf_to_md as a library in your Python code:

```python
from pathlib import Path
from pdf_to_md.converter import convert

# Convert a PDF file to Markdown
pdf_path = Path("document.pdf")
md_path = Path("output.md")
dpi = 300

# Perform the conversion
convert(pdf_path, md_path, dpi)
```

See the [example script](https://github.com/funkatron/pdf_to_md/blob/main/examples/example.py) for a complete example.

## License

MIT License