https://github.com/carloscasalar/pdf2md-ocr
A Python CLI tool to convert PDFs to markdown using OCR
https://github.com/carloscasalar/pdf2md-ocr
cli markdown-generator pdf-converter python
Last synced: 5 months ago
JSON representation
A Python CLI tool to convert PDFs to markdown using OCR
- Host: GitHub
- URL: https://github.com/carloscasalar/pdf2md-ocr
- Owner: carloscasalar
- License: other
- Created: 2025-11-16T19:40:22.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-11-16T19:47:59.000Z (7 months ago)
- Last Synced: 2025-11-16T21:21:29.474Z (7 months ago)
- Topics: cli, markdown-generator, pdf-converter, python
- Language: Python
- Homepage:
- Size: 33.2 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# pdf2md-ocr
Simple CLI tool to convert PDFs to Markdown using [Marker AI](https://github.com/VikParuchuri/marker).
## Quick Start
**Recommended (no installation needed):**
```bash
uvx pdf2md-ocr input.pdf -o output.md
```
**Traditional installation:**
```bash
pip install pdf2md-ocr
pdf2md-ocr input.pdf -o output.md
```
## Usage
```bash
# Convert PDF to Markdown (output same name with .md extension)
pdf2md-ocr document.pdf
# Specify output file
pdf2md-ocr document.pdf -o result.md
# Convert specific page range (page numbering starts at 1)
pdf2md-ocr document.pdf --start-page 2 --end-page 5
# Convert from page 3 to the end
pdf2md-ocr document.pdf --start-page 3
# Convert from the beginning to page 10
pdf2md-ocr document.pdf --end-page 10
# Show cache location and size
pdf2md-ocr document.pdf --show-cache-info
# Show help
pdf2md-ocr --help
# Show version
pdf2md-ocr --version
```
### Page Range Options
- `--start-page N`: Starting page number (1-based, inclusive). If omitted, starts from page 1.
- `--end-page M`: Ending page number (1-based, inclusive). If omitted, goes to the last page.
Both options are optional and can be combined:
- Use only `--start-page` to convert from a specific page to the end.
- Use only `--end-page` to convert from the beginning to a specific page.
- Use both to convert a specific range.
**Important: Page numbering starts at 1** (not 0).
## First Run
The first time you run pdf2md-ocr, it will download ~2-3GB of AI models. These models are cached locally and reused for all future conversions.
**To see where models are cached:**
```bash
pdf2md-ocr input.pdf --show-cache-info
```
This will show the cache location and size after conversion. Cache locations, typically:
- macOS: `~/Library/Caches/datalab/models/`
- Linux: `~/.cache/datalab/models/`
- Windows: `%LOCALAPPDATA%\datalab\models\`
**To clear the cache:** Simply delete the cache directory shown in the info above, or use `make clean-cache` if developing locally.
Subsequent runs will be much faster since the models are already cached.
## Requirements
- Python 3.10 or higher
- ~2GB disk space for AI models (one-time download)
## System Requirements
pdf2md-ocr requires native system libraries for PDF processing. These need to be installed separately on your system:
### macOS (Homebrew)
```bash
brew install gobject-introspection pango
export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"
```
For permanent setup, add the export line to your shell profile (`~/.zshrc`, `~/.bash_profile`, etc.):
```bash
echo 'export DYLD_LIBRARY_PATH="/opt/homebrew/lib:$DYLD_LIBRARY_PATH"' >> ~/.zshrc
```
### Linux (Ubuntu/Debian)
```bash
sudo apt-get update
sudo apt-get install libgobject-2.0-0 libpango-1.0-0
```
### Linux (Fedora/RHEL)
```bash
sudo dnf install gobject-introspection pango
```
### Windows
Download and install GTK+ 3 from the [GTK-for-Windows-Runtime-Environment-Installer](https://github.com/tschoonj/GTK-for-Windows-Runtime-Environment-Installer).
## Development
For development, a Makefile is provided with common tasks:
```bash
# Install dependencies
make install-dev
# Run tests
make test
# Run tests with verbose output
make test-verbose
# Clean build artifacts
make clean
# Clear AI model cache (frees ~3GB disk space)
make clean-cache
# Build distribution packages
make build
# See all available commands
make help
```
## How It Works
This tool is a minimal wrapper around the excellent [marker-pdf](https://github.com/VikParuchuri/marker) library, which uses AI models to:
1. Detect text, tables, and equations in PDFs
2. Extract content with proper formatting
3. Convert to clean Markdown
## License
GPL-3.0-or-later
This project is licensed under the GNU General Public License v3.0 or later to comply with the [marker-pdf](https://github.com/VikParuchuri/marker) library license (GPL-3.0-or-later).