https://github.com/zemse/pdfly
Fast, dependency-light PDF → Markdown CLI in pure Rust (also JSON/HTML/text). Tables, images, reading order — single static binary.
https://github.com/zemse/pdfly
Last synced: 17 days ago
JSON representation
Fast, dependency-light PDF → Markdown CLI in pure Rust (also JSON/HTML/text). Tables, images, reading order — single static binary.
- Host: GitHub
- URL: https://github.com/zemse/pdfly
- Owner: zemse
- License: mit
- Created: 2026-06-05T21:13:27.000Z (21 days ago)
- Default Branch: main
- Last Pushed: 2026-06-06T14:50:13.000Z (20 days ago)
- Last Synced: 2026-06-06T16:20:58.406Z (20 days ago)
- Language: Rust
- Size: 27 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Notice: NOTICE
Awesome Lists containing this project
README
# pdfly
A fast, dependency-light **PDF → Markdown** command-line tool written in pure Rust.
It also emits JSON (with bounding boxes), HTML, and plain text, and can split a
document into one Markdown file per chapter.
Pure Rust, no native libraries, no GPU, no network — a single static binary.
## Install / build
```bash
# install the `pdfly` binary from crates.io
cargo install pdfly
# ...or from git
cargo install --git https://github.com/zemse/pdfly
# ...or build locally
cargo build --release # binary at target/release/pdfly
```
## Usage
`pdfly read ` converts a PDF and prints the result to **stdout** by default.
Pass `--out ` to write a file instead; the format is inferred from the
extension (`.md`, `.json`, `.html`, `.txt`) unless you override it with `--format`.
```bash
# PDF -> Markdown on stdout
pdfly read report.pdf
# write to a file (format inferred from the extension)
pdfly read report.pdf -o report.md
pdfly read report.pdf -o report.json
# pick a format explicitly (still stdout)
pdfly read report.pdf -f json
# only some pages
pdfly read report.pdf --pages 1,3,5-7
# encrypted PDF
pdfly read secret.pdf -p mypassword
# split a book into one Markdown file per chapter (+ index.md) in a directory
pdfly read book.pdf -o out/ --split
pdfly read book.pdf -o out/ --split --split-level 2 # split on H1 and H2
# images: extract to files (default), embed as base64, or drop
# (external images require --out; stdout output drops images)
pdfly read report.pdf -o report.md --image-output external --image-format png
pdfly read report.pdf -o report.md --image-output embedded
pdfly read report.pdf --image-output off
# use the PDF's own tags (tagged PDFs) instead of layout heuristics
pdfly read tagged.pdf --use-struct-tree
# write a tagged PDF (adds a structure tree) / an annotated debug PDF (need --out)
pdfly read report.pdf -o report.md --tagged-pdf
pdfly read report.pdf -o report.md --annotate
# redact sensitive data; detect strikethrough; HTML tables in Markdown
pdfly read report.pdf --sanitize --detect-strikethrough --markdown-with-html
# faster on big PDFs (deterministic)
pdfly read big.pdf --threads 8
# report processing time and throughput (pages/sec)
pdfly read big.pdf --timing
```
### OCR for scanned PDFs (optional)
OCR is a pure-Rust optional feature (no native deps). Build with it enabled and
point to [ocrs](https://github.com/robertknight/ocrs) `.rten` model files:
```bash
cargo build --release --features ocr
export PDFRS_OCR_DETECTION_MODEL=/path/to/text-detection.rten
export PDFRS_OCR_RECOGNITION_MODEL=/path/to/text-recognition.rten
pdfly read scanned.pdf # image-only pages are OCR'd automatically
```
The default build omits OCR entirely, keeping the binary small.
Run `pdfly read --help` for all options.
## What it does
- **Text extraction**: a content-stream interpreter over `lopdf` recovers positioned
text runs with fonts, sizes, weights, and colors (ToUnicode / encoding / CID width
decoding).
- **Layout analysis**: line assembly, multi-column line splitting, body-font
statistics, heading detection (relative font-size ranking → levels 1–6), list
detection (bulleted/numbered), border-based table detection, and **XY-Cut++**
reading order.
- **Header/footer** removal (repeated running content), **content-safety**
filtering (tiny / off-page text), and optional **sanitization**.
- **Renderers**: GFM Markdown, schema-aligned JSON with bounding boxes, standalone
HTML, plain text, and chapter-wise Markdown.
## Origins
A from-scratch Rust reimplementation of the data-extraction core of
[opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)
(Apache-2.0). Algorithms were studied and reimplemented clean-room; no code was
copied. See [ARCHITECTURE.md](./ARCHITECTURE.md) for how the original works and
[TASKS.md](./TASKS.md) for open issues and remaining work. The XY-Cut++ reading order follows
opendataloader's `XYCutPlusPlusSorter`; layout heuristics are informed by
veraPDF's `wcag-algorithms`.
## Known limitations
- Dense multi-column academic papers (full-width abstract over a two-column body)
can still interleave in reading order (improved, not perfect).
- Type1 (`FontFile`) subset fonts with non-standard built-in encodings and no
`/ToUnicode` may still mis-decode (embedded TrueType/CFF and standard glyph
names now decode).
- Borderless (column-aligned) table detection is on by default; pass
`--table-method ruled` to restrict detection to ruled-border tables only.
- `--tagged-pdf` writes marked content + a structure tree (round-trips via
`--use-struct-tree`) but does not yet emit a `/ParentTree` or run formal
PDF/UA conformance validation.
- LaTeX formulas and chart/image descriptions need local ML models (not built).
## Tests
```bash
cargo test
```
Tests run against a committed corpus (`tests/corpus/`) using snapshot/invariant
checks (no external Java oracle required).