{"id":50294378,"url":"https://github.com/tdiprima/pdfcraft","last_synced_at":"2026-05-28T08:01:14.468Z","repository":{"id":358720059,"uuid":"1242789943","full_name":"tdiprima/pdfcraft","owner":"tdiprima","description":"Python CLI to convert Markdown → PDF and PDF → text, with table detection, image OCR, and optional GPT summarization.","archived":false,"fork":false,"pushed_at":"2026-05-18T19:13:20.000Z","size":11,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-18T21:11:35.394Z","etag":null,"topics":["document-conversion","markdown","ocr","pdf","weasyprint"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tdiprima.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-18T18:58:01.000Z","updated_at":"2026-05-18T19:13:38.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/tdiprima/pdfcraft","commit_stats":null,"previous_names":["tdiprima/pdfcraft"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/tdiprima/pdfcraft","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fpdfcraft","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fpdfcraft/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fpdfcraft/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fpdfcraft/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tdiprima","download_url":"https://codeload.github.com/tdiprima/pdfcraft/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tdiprima%2Fpdfcraft/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33599465,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-28T02:00:06.440Z","response_time":99,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-conversion","markdown","ocr","pdf","weasyprint"],"created_at":"2026-05-28T08:01:13.097Z","updated_at":"2026-05-28T08:01:14.462Z","avatar_url":"https://github.com/tdiprima.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdfcraft\n\nA Python CLI for converting between Markdown and PDF, and extracting structured text from PDFs — including tables and embedded images, with optional AI summarization.\n\n## PDFs and Markdown don't play nicely with standard tools\n\nMarkdown-to-PDF converters often produce plain, unstyled output that ignores tables, code blocks, and emoji. Going the other direction is messier: basic PDF extractors drop tables, ignore images entirely, and silently discard content, leaving you with a `.txt` file that looks fine until you notice half the data is missing. Research papers, reports, and technical documents are especially bad — they're full of multi-column layouts, figures with captions, and data tables that most tools mangle or skip.\n\n## Three passes, nothing left behind\n\n`pdfcraft` handles both directions cleanly.\n\nWhen converting **Markdown to PDF**, it renders through an HTML intermediate using WeasyPrint, applying a typographically sensible stylesheet with emoji-compatible fonts, syntax-highlighted code blocks, tables, and a generated table of contents. The output is a properly formatted document, not a wall of unstyled text.\n\nWhen converting **PDF to text**, it runs three passes per page: plain text via `pdfplumber`, table detection rendered as pipe-delimited rows, and OCR on embedded images via PyMuPDF and Tesseract. Everything is assembled in page order so context is preserved. Pass `--summarize` and the extracted text goes to GPT-5.5, which returns the key points as a Markdown document and picks a descriptive filename automatically.\n\n## Example\n\nA research paper with body text, a data table on page 4, and a chart image on page 9:\n\n```\n$ pdfcraft pdf-to-text quarterly-report.pdf --summarize\n\nText saved to: quarterly-report.txt\nSummary saved to: q3-revenue-highlights-summary.md\n```\n\n`quarterly-report.txt` contains the full extraction, page-by-page, with tables and OCR'd image text inline. `q3-revenue-highlights-summary.md` is the GPT-5.5 summary, with a filename it chose based on the content.\n\nA folder of Markdown notes:\n\n```\n$ pdfcraft md-to-pdf ./docs --output-dir ./pdfs\n\nINFO converted docs/architecture.md -\u003e pdfs/architecture.pdf\nINFO converted docs/runbook.md -\u003e pdfs/runbook.pdf\nINFO Done. Converted 2 file(s), 0 error(s).\n```\n\n## Usage\n\n### Install\n\n```bash\nuv sync\n```\n\nSystem dependencies are also required:\n\n```bash\n# macOS\nbrew install tesseract pango\n\n# Ubuntu/Debian\nsudo apt install tesseract-ocr libpango-1.0-0\n```\n\nFor emoji rendering on Linux:\n\n```bash\nsudo apt install fonts-noto-color-emoji\n```\n\nPip install\n\n```sh\npip install -e .\n```\n\n### Markdown to PDF\n\n```bash\n# PDFs written alongside the .md files\npdfcraft md-to-pdf /path/to/docs\n\n# PDFs written to a separate output folder\npdfcraft md-to-pdf /path/to/docs --output-dir /path/to/pdfs\n```\n\n### PDF to text\n\n```bash\n# Extract text, tables, and image OCR\npdfcraft pdf-to-text path/to/file.pdf\n\n# Extract and summarize with GPT-5.5\nOPENAI_API_KEY=sk-... pdfcraft pdf-to-text path/to/file.pdf --summarize\n\n# Enable debug logging\npdfcraft pdf-to-text path/to/file.pdf --verbose\n```\n\n### Flags\n\n| Command | Flag | Description |\n|---|---|---|\n| `pdf-to-text` | `--summarize` | Send extracted text to GPT-5.5 and write a Markdown summary |\n| `pdf-to-text` | `--verbose`, `-v` | Enable debug logging |\n| `md-to-pdf` | `--output-dir` | Directory for output PDFs (default: same as input) |\n\n**Log level** can also be set via the `LOG_LEVEL` environment variable (`DEBUG`, `INFO`, `WARNING`).\n\n\u003c!--\nTests:\n\nuv sync\nuv run pytest\n\nVerbose output:\n\nuv run pytest -v\n\nRun one file:\n\nuv run pytest tests/test_md_to_pdf.py -v\n\nNote: test_convert_file_produces_pdf and test_convert_directory_converts_all require WeasyPrint + Pango installed. All utils/pure-logic tests run without system deps.\n--\u003e\n\n\u003cbr\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdiprima%2Fpdfcraft","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftdiprima%2Fpdfcraft","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftdiprima%2Fpdfcraft/lists"}