https://github.com/funkatron/pdf_to_md
CLI tool to convert PDF files to Markdown with OCR capabilities
https://github.com/funkatron/pdf_to_md
Last synced: 10 months ago
JSON representation
CLI tool to convert PDF files to Markdown with OCR capabilities
- Host: GitHub
- URL: https://github.com/funkatron/pdf_to_md
- Owner: funkatron
- License: mit
- Created: 2025-05-15T15:32:53.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-05-15T15:33:09.000Z (about 1 year ago)
- Last Synced: 2025-08-16T12:34:16.455Z (11 months ago)
- Language: Python
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# PDF to Markdown
A command-line tool to convert PDF files to Markdown, with OCR capabilities for image-based content.
## Features
- Converts searchable PDF text to Markdown
- Uses OCR (Optical Character Recognition) for image-based PDF content
- Simple page separation with Markdown headers
- Configurable OCR resolution
## Installation
### Prerequisites
- Python 3.6 or higher
- Tesseract OCR (for OCR functionality)
#### Installing Tesseract OCR
**macOS (using Homebrew):**
```bash
brew install tesseract
```
**Linux (Ubuntu/Debian):**
```bash
sudo apt-get install tesseract-ocr
```
**Windows:**
Download and install from [Tesseract GitHub page](https://github.com/UB-Mannheim/tesseract/wiki).
### Installing pdf_to_md
#### From PyPI (recommended)
```bash
pip install pdf_to_md
```
#### From source
```bash
git clone https://github.com/funkatron/pdf_to_md.git
cd pdf_to_md
pip install -e .
```
## Usage
```bash
pdf_to_md input.pdf [-o output.md] [--dpi 300]
```
### Options
- `input.pdf`: Path to the PDF file to convert
- `-o, --output`: Path to the output Markdown file (default: same name as input with .md extension)
- `--dpi`: DPI for OCR rendering (default: 300)
## Examples
Convert a PDF file to Markdown:
```bash
pdf_to_md document.pdf
```
Specify an output file:
```bash
pdf_to_md document.pdf -o output.md
```
Use higher resolution for OCR (may improve accuracy but slower):
```bash
pdf_to_md document.pdf --dpi 600
```
### Programmatic Usage
You can also use pdf_to_md as a library in your Python code:
```python
from pathlib import Path
from pdf_to_md.converter import convert
# Convert a PDF file to Markdown
pdf_path = Path("document.pdf")
md_path = Path("output.md")
dpi = 300
# Perform the conversion
convert(pdf_path, md_path, dpi)
```
See the [example script](https://github.com/funkatron/pdf_to_md/blob/main/examples/example.py) for a complete example.
## License
MIT License