https://github.com/carloscasalar/pdf2md-ocr

A Python CLI tool to convert PDFs to markdown using OCR
https://github.com/carloscasalar/pdf2md-ocr

cli markdown-generator pdf-converter python

Last synced: 5 months ago
JSON representation

A Python CLI tool to convert PDFs to markdown using OCR

Host: GitHub
URL: https://github.com/carloscasalar/pdf2md-ocr
Owner: carloscasalar
License: other
Created: 2025-11-16T19:40:22.000Z (7 months ago)
Default Branch: main
Last Pushed: 2025-11-16T19:47:59.000Z (7 months ago)
Last Synced: 2025-11-16T21:21:29.474Z (7 months ago)
Topics: cli, markdown-generator, pdf-converter, python
Language: Python
Homepage:
Size: 33.2 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

# pdf2md-ocr

Simple CLI tool to convert PDFs to Markdown using [Marker AI](https://github.com/VikParuchuri/marker).

## Quick Start

**Recommended (no installation needed):**

```bash
uvx pdf2md-ocr input.pdf -o output.md
```

**Traditional installation:**

```bash
pip install pdf2md-ocr
pdf2md-ocr input.pdf -o output.md
```

## Usage

```bash
# Convert PDF to Markdown (output same name with .md extension)
pdf2md-ocr document.pdf

# Specify output file
pdf2md-ocr document.pdf -o result.md

# Convert specific page range (page numbering starts at 1)
pdf2md-ocr document.pdf --start-page 2 --end-page 5

# Convert from page 3 to the end
pdf2md-ocr document.pdf --start-page 3

# Convert from the beginning to page 10
pdf2md-ocr document.pdf --end-page 10

# Show cache location and size
pdf2md-ocr document.pdf --show-cache-info

# Show help
pdf2md-ocr --help

# Show version
pdf2md-ocr --version
```

### Page Range Options

- `--start-page N`: Starting page number (1-based, inclusive). If omitted, starts from page 1.
- `--end-page M`: Ending page number (1-based, inclusive). If omitted, goes to the last page.

Both options are optional and can be combined:

- Use only `--start-page` to convert from a specific page to the end.
- Use only `--end-page` to convert from the beginning to a specific page.
- Use both to convert a specific range.

**Important: Page numbering starts at 1** (not 0).

## First Run

The first time you run pdf2md-ocr, it will download ~2-3GB of AI models. These models are cached locally and reused for all future conversions.

**To see where models are cached:**

```bash
pdf2md-ocr input.pdf --show-cache-info
```

This will show the cache location and size after conversion. Cache locations, typically:

- macOS: `~/Library/Caches/datalab/models/`
- Linux: `~/.cache/datalab/models/`
- Windows: `%LOCALAPPDATA%\datalab\models\`

**To clear the cache:** Simply delete the cache directory shown in the info above, or use `make clean-cache` if developing locally.

Subsequent runs will be much faster since the models are already cached.

## Requirements

- Python 3.10 or higher
- ~2GB disk space for AI models (one-time download)

## System Requirements

pdf2md-ocr requires native system libraries for PDF processing. These need to be installed separately on your system:

### macOS (Homebrew)

```bash
brew install gobject-introspection pango
export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
```

For permanent setup, add the export line to your shell profile (`~/.zshrc`, `~/.bash_profile`, etc.):

```bash
echo 'export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"' >> ~/.zshrc
```

### Linux (Ubuntu/Debian)

```bash
sudo apt-get update
sudo apt-get install libgobject-2.0-0 libpango-1.0-0
```

### Linux (Fedora/RHEL)

```bash
sudo dnf install gobject-introspection pango
```

### Windows

Download and install GTK+ 3 from the [GTK-for-Windows-Runtime-Environment-Installer](https://github.com/tschoonj/GTK-for-Windows-Runtime-Environment-Installer).

## Development

For development, a Makefile is provided with common tasks:

```bash
# Install dependencies
make install-dev

# Run tests
make test

# Run tests with verbose output
make test-verbose

# Clean build artifacts
make clean

# Clear AI model cache (frees ~3GB disk space)
make clean-cache

# Build distribution packages
make build

# See all available commands
make help
```

## How It Works

This tool is a minimal wrapper around the excellent [marker-pdf](https://github.com/VikParuchuri/marker) library, which uses AI models to:

1. Detect text, tables, and equations in PDFs
2. Extract content with proper formatting
3. Convert to clean Markdown

## License

GPL-3.0-or-later

This project is licensed under the GNU General Public License v3.0 or later to comply with the [marker-pdf](https://github.com/VikParuchuri/marker) library license (GPL-3.0-or-later).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/carloscasalar/pdf2md-ocr

Awesome Lists containing this project

README