{"id":50706109,"url":"https://github.com/zemse/pdfly","last_synced_at":"2026-06-09T12:01:16.631Z","repository":{"id":362929450,"uuid":"1260767795","full_name":"zemse/pdfly","owner":"zemse","description":"Fast, dependency-light PDF → Markdown CLI in pure Rust (also JSON/HTML/text). Tables, images, reading order — single static binary.","archived":false,"fork":false,"pushed_at":"2026-06-06T14:50:13.000Z","size":28338,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-06T16:20:58.406Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/zemse.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-05T21:13:27.000Z","updated_at":"2026-06-06T14:50:17.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/zemse/pdfly","commit_stats":null,"previous_names":["zemse/pdfly"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/zemse/pdfly","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zemse%2Fpdfly","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zemse%2Fpdfly/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zemse%2Fpdfly/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zemse%2Fpdfly/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/zemse","download_url":"https://codeload.github.com/zemse/pdfly/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/zemse%2Fpdfly/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34105565,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-09T12:01:16.003Z","updated_at":"2026-06-09T12:01:16.616Z","avatar_url":"https://github.com/zemse.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdfly\n\nA fast, dependency-light **PDF → Markdown** command-line tool written in pure Rust.\nIt also emits JSON (with bounding boxes), HTML, and plain text, and can split a\ndocument into one Markdown file per chapter.\n\nPure Rust, no native libraries, no GPU, no network — a single static binary.\n\n## Install / build\n\n```bash\n# install the `pdfly` binary from crates.io\ncargo install pdfly\n\n# ...or from git\ncargo install --git https://github.com/zemse/pdfly\n\n# ...or build locally\ncargo build --release   # binary at target/release/pdfly\n```\n\n## Usage\n\n`pdfly read \u003cfile\u003e` converts a PDF and prints the result to **stdout** by default.\nPass `--out \u003cpath\u003e` to write a file instead; the format is inferred from the\nextension (`.md`, `.json`, `.html`, `.txt`) unless you override it with `--format`.\n\n```bash\n# PDF -\u003e Markdown on stdout\npdfly read report.pdf\n\n# write to a file (format inferred from the extension)\npdfly read report.pdf -o report.md\npdfly read report.pdf -o report.json\n\n# pick a format explicitly (still stdout)\npdfly read report.pdf -f json\n\n# only some pages\npdfly read report.pdf --pages 1,3,5-7\n\n# encrypted PDF\npdfly read secret.pdf -p mypassword\n\n# split a book into one Markdown file per chapter (+ index.md) in a directory\npdfly read book.pdf -o out/ --split\npdfly read book.pdf -o out/ --split --split-level 2   # split on H1 and H2\n\n# images: extract to files (default), embed as base64, or drop\n# (external images require --out; stdout output drops images)\npdfly read report.pdf -o report.md --image-output external --image-format png\npdfly read report.pdf -o report.md --image-output embedded\npdfly read report.pdf --image-output off\n\n# use the PDF's own tags (tagged PDFs) instead of layout heuristics\npdfly read tagged.pdf --use-struct-tree\n\n# write a tagged PDF (adds a structure tree) / an annotated debug PDF (need --out)\npdfly read report.pdf -o report.md --tagged-pdf\npdfly read report.pdf -o report.md --annotate\n\n# redact sensitive data; detect strikethrough; HTML tables in Markdown\npdfly read report.pdf --sanitize --detect-strikethrough --markdown-with-html\n\n# faster on big PDFs (deterministic)\npdfly read big.pdf --threads 8\n\n# report processing time and throughput (pages/sec)\npdfly read big.pdf --timing\n```\n\n### OCR for scanned PDFs (optional)\n\nOCR is a pure-Rust optional feature (no native deps). Build with it enabled and\npoint to [ocrs](https://github.com/robertknight/ocrs) `.rten` model files:\n\n```bash\ncargo build --release --features ocr\nexport PDFRS_OCR_DETECTION_MODEL=/path/to/text-detection.rten\nexport PDFRS_OCR_RECOGNITION_MODEL=/path/to/text-recognition.rten\npdfly read scanned.pdf          # image-only pages are OCR'd automatically\n```\n\nThe default build omits OCR entirely, keeping the binary small.\n\nRun `pdfly read --help` for all options.\n\n## What it does\n\n- **Text extraction**: a content-stream interpreter over `lopdf` recovers positioned\n  text runs with fonts, sizes, weights, and colors (ToUnicode / encoding / CID width\n  decoding).\n- **Layout analysis**: line assembly, multi-column line splitting, body-font\n  statistics, heading detection (relative font-size ranking → levels 1–6), list\n  detection (bulleted/numbered), border-based table detection, and **XY-Cut++**\n  reading order.\n- **Header/footer** removal (repeated running content), **content-safety**\n  filtering (tiny / off-page text), and optional **sanitization**.\n- **Renderers**: GFM Markdown, schema-aligned JSON with bounding boxes, standalone\n  HTML, plain text, and chapter-wise Markdown.\n\n## Origins\n\nA from-scratch Rust reimplementation of the data-extraction core of\n[opendataloader-pdf](https://github.com/opendataloader-project/opendataloader-pdf)\n(Apache-2.0). Algorithms were studied and reimplemented clean-room; no code was\ncopied. See [ARCHITECTURE.md](./ARCHITECTURE.md) for how the original works and\n[TASKS.md](./TASKS.md) for open issues and remaining work. The XY-Cut++ reading order follows\nopendataloader's `XYCutPlusPlusSorter`; layout heuristics are informed by\nveraPDF's `wcag-algorithms`.\n\n## Known limitations\n\n- Dense multi-column academic papers (full-width abstract over a two-column body)\n  can still interleave in reading order (improved, not perfect).\n- Type1 (`FontFile`) subset fonts with non-standard built-in encodings and no\n  `/ToUnicode` may still mis-decode (embedded TrueType/CFF and standard glyph\n  names now decode).\n- Borderless (column-aligned) table detection is on by default; pass\n  `--table-method ruled` to restrict detection to ruled-border tables only.\n- `--tagged-pdf` writes marked content + a structure tree (round-trips via\n  `--use-struct-tree`) but does not yet emit a `/ParentTree` or run formal\n  PDF/UA conformance validation.\n- LaTeX formulas and chart/image descriptions need local ML models (not built).\n\n## Tests\n\n```bash\ncargo test\n```\n\nTests run against a committed corpus (`tests/corpus/`) using snapshot/invariant\nchecks (no external Java oracle required).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzemse%2Fpdfly","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fzemse%2Fpdfly","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fzemse%2Fpdfly/lists"}