https://github.com/funkatron/pdf_to_md

CLI tool to convert PDF files to Markdown with OCR capabilities
https://github.com/funkatron/pdf_to_md

Last synced: 10 months ago
JSON representation

CLI tool to convert PDF files to Markdown with OCR capabilities

Host: GitHub
URL: https://github.com/funkatron/pdf_to_md
Owner: funkatron
License: mit
Created: 2025-05-15T15:32:53.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-05-15T15:33:09.000Z (about 1 year ago)
Last Synced: 2025-08-16T12:34:16.455Z (11 months ago)
Language: Python
Size: 3.91 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # PDF to Markdown

A command-line tool to convert PDF files to Markdown, with OCR capabilities for image-based content.

## Features

- Converts searchable PDF text to Markdown

- Uses OCR (Optical Character Recognition) for image-based PDF content

- Simple page separation with Markdown headers

- Configurable OCR resolution

## Installation

### Prerequisites

- Python 3.6 or higher

- Tesseract OCR (for OCR functionality)

#### Installing Tesseract OCR

**macOS (using Homebrew):**

```bash

brew install tesseract

```

**Linux (Ubuntu/Debian):**

```bash

sudo apt-get install tesseract-ocr

```

**Windows:**

Download and install from [Tesseract GitHub page](https://github.com/UB-Mannheim/tesseract/wiki).

### Installing pdf_to_md

#### From PyPI (recommended)

```bash

pip install pdf_to_md

```

#### From source

```bash

git clone https://github.com/funkatron/pdf_to_md.git

cd pdf_to_md

pip install -e .

```

## Usage

```bash

pdf_to_md input.pdf [-o output.md] [--dpi 300]

```

### Options

- `input.pdf`: Path to the PDF file to convert

- `-o, --output`: Path to the output Markdown file (default: same name as input with .md extension)

- `--dpi`: DPI for OCR rendering (default: 300)

## Examples

Convert a PDF file to Markdown:

```bash

pdf_to_md document.pdf

```

Specify an output file:

```bash

pdf_to_md document.pdf -o output.md

```

Use higher resolution for OCR (may improve accuracy but slower):

```bash

pdf_to_md document.pdf --dpi 600

```

### Programmatic Usage

You can also use pdf_to_md as a library in your Python code:

```python

from pathlib import Path

from pdf_to_md.converter import convert

# Convert a PDF file to Markdown

pdf_path = Path("document.pdf")

md_path = Path("output.md")

dpi = 300

# Perform the conversion

convert(pdf_path, md_path, dpi)

```

See the [example script](https://github.com/funkatron/pdf_to_md/blob/main/examples/example.py) for a complete example.

## License

MIT License

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/funkatron/pdf_to_md

Awesome Lists containing this project

README