https://github.com/expectedparrot/paper2md

Convert academic papers (PDF) to markdown
https://github.com/expectedparrot/paper2md

Last synced: about 1 month ago
JSON representation

Convert academic papers (PDF) to markdown

Host: GitHub
URL: https://github.com/expectedparrot/paper2md
Owner: expectedparrot
Created: 2026-04-06T08:35:55.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-04-06T10:33:07.000Z (2 months ago)
Last Synced: 2026-04-25T03:59:04.441Z (about 1 month ago)
Language: Python
Size: 2.63 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # paper2md

Convert academic-paper PDFs to clean Markdown with extracted PNG figures — ready for LLM / agent ingestion.

```

paper.pdf  →  paper_md/

               ├── paper.md      # full text with ![Figure N](img_N.png) references

               ├── img_1.png

               ├── img_2.png

               └── …

```

## Install

```bash

# Lightweight (PyMuPDF backend only)

pip install "paper2md @ git+https://github.com/expectedparrot/paper2md.git"

# With marker-pdf for academic-layout awareness (recommended)

pip install "paper2md[marker] @ git+https://github.com/expectedparrot/paper2md.git"

# With Reducto cloud API

pip install "paper2md[reducto] @ git+https://github.com/expectedparrot/paper2md.git"

```

## Quick start

### CLI

```bash

# Auto-selects best available backend (prefers marker)

paper2md paper.pdf

# Explicit backend, custom output dir

paper2md paper.pdf --backend pymupdf --output ./out

# Adjust figure DPI (pymupdf backend only, default 150)

paper2md paper.pdf --backend pymupdf --dpi 200

# Print markdown to stdout

paper2md paper.pdf --print

```

### Python API

```python

from paper2md.converter import convert

result = convert("paper.pdf")

result.markdown      # full markdown string

result.images        # {"img_1.png": Path(...), ...}

result.output_dir    # Path to folder with paper.md + PNGs

result.backend_used  # "marker" or "pymupdf"

```

Choose a specific backend:

```python

result = convert("paper.pdf", backend="pymupdf")

result = convert("paper.pdf", backend="marker")

result = convert("paper.pdf", output_dir="/tmp/my_paper")

```

### Reducto (cloud API)

The Reducto backend is a separate module (not wired into the main `convert()` function):

```bash

pip install "paper2md[reducto] @ git+https://github.com/expectedparrot/paper2md.git"

export REDUCTO_API_KEY=your_key_here

```

```python

from paper2md.reducto import convert_with_reducto

from pathlib import Path

md, images = convert_with_reducto(Path("paper.pdf"), Path("./out"))

```

## How it works

1. The chosen backend converts the PDF to Markdown text and extracts any embedded images.

2. All extracted figures are renamed to `img_1.png`, `img_2.png`, etc. and saved to the output directory.

3. Image references (`![…](…)`) in the Markdown are rewritten to match the canonical filenames.

4. If the backend produced images that aren't referenced in the text (common with pymupdf), they are appended in an **Extracted Figures** section so no figure is lost.

5. Non-ASCII characters common in academic PDFs (Greek letters, math symbols, special dashes) are transliterated to ASCII equivalents for LaTeX compatibility.

6. The final `paper.md` + all PNGs are written to the output directory.

## Backends

| Backend | Quality | Speed | Cost | Install |

|---------|---------|-------|------|---------|

| **marker** | Best | Slow | Free | `pip install "paper2md[marker] @ git+https://github.com/expectedparrot/paper2md.git"` |

| **pymupdf** | Good | Fast | Free | bundled |

| **Reducto** | Best | Fast | Paid | `pip install "paper2md[reducto] @ git+https://github.com/expectedparrot/paper2md.git"` + API key |

**marker-pdf** is recommended for academic papers — it handles multi-column layouts, LaTeX equations, figure captions, and tables well. It is the default backend when installed.

**pymupdf** is bundled and works out of the box. Good for simple layouts.

**Reducto** is a cloud API option for high-volume or high-quality needs.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/expectedparrot/paper2md

Awesome Lists containing this project

README