{"id":47974211,"url":"https://github.com/seonghobae/docvert","last_synced_at":"2026-04-04T10:52:29.260Z","repository":{"id":348182890,"uuid":"1196811285","full_name":"seonghobae/docvert","owner":"seonghobae","description":"DOCX/PDF to Markdown conversion LLM Agent","archived":false,"fork":false,"pushed_at":"2026-04-02T06:35:41.000Z","size":483,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-04-04T10:52:22.263Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/seonghobae.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-31T04:09:22.000Z","updated_at":"2026-04-02T06:35:47.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/seonghobae/docvert","commit_stats":null,"previous_names":["seonghobae/docvert"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/seonghobae/docvert","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seonghobae%2Fdocvert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seonghobae%2Fdocvert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seonghobae%2Fdocvert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seonghobae%2Fdocvert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/seonghobae","download_url":"https://codeload.github.com/seonghobae/docvert/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/seonghobae%2Fdocvert/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31397055,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-04T10:20:44.708Z","status":"ssl_error","status_checked_at":"2026-04-04T10:20:06.846Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-04-04T10:52:28.642Z","updated_at":"2026-04-04T10:52:29.249Z","avatar_url":"https://github.com/seonghobae.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DocVert\n\n**DocVert** is an intelligent, LLM-powered CLI tool and agent for converting DOCX and PDF documents into clean, semantic Markdown. It focuses heavily on preserving document structure, headings, lists, and visual elements, while extracting key document metadata into sidecar JSON files.\n\n## Features\n\n- **Robust DOCX Parsing**: Primary parsing using `python-docx` with heuristic heading detection. Fallback to `mammoth` for difficult layouts.\n- **Advanced PDF Parsing**: High-fidelity PDF extraction using `docling` as the primary engine. Fallback to `unstructured` for edge cases.\n- **Rich Output Format**:\n  - Generates clean, semantic `.md` files.\n  - Produces a sidecar `.json` file containing extraction confidence scores, metadata, and parsing warnings.\n  - Automatically extracts and saves images referenced in the source documents.\n- **Batch Processing \u0026 Caching**: Efficiently process large directories of files with built-in caching to avoid redundant parsing.\n- **Provider-Agnostic LLM Refinement**: Uses `litellm` under the hood, natively supporting OpenAI, Vertex AI, Anthropic, Bedrock, and local models via Ollama.\n- **Air-Gapped / Offline Deployment**: Pre-built Docker images via [GitHub Releases](https://github.com/seonghobae/docvert/releases) for secure, offline environments.\n- **Developer Ready**:\n  - 100% test coverage.\n  - Robust type hints powered by `pydantic`.\n  - Built-in CLI using modern Python tooling (`uv`).\n\n## Quick Start (Docker — Recommended)\n\n```bash\n# Build from source\ngit clone https://github.com/seonghobae/docvert.git\ncd docvert\ndocker build -t docvert:offline .\n\n# Convert a file\ndocker run --rm -v $(pwd):/data \\\n    docvert:offline convert /data/input.pdf --output-dir /data/out\n```\n\nOr install natively — see the full [Installation Guide](https://seonghobae.github.io/docvert/installation-guide/).\n\n## Documentation\n\nFull documentation is available at **[seonghobae.github.io/docvert](https://seonghobae.github.io/docvert/)**\n\n- [Installation Guide (English)](https://seonghobae.github.io/docvert/installation-guide/) — Per-OS zero-setup, no Homebrew required\n- [설치 가이드 (한국어)](https://seonghobae.github.io/docvert/web-manual-ko/) — OS별 제로셋업 설치 안내\n- [User Manual \u0026 CLI Reference](https://seonghobae.github.io/docvert/manual/) — CLI usage, LLM configuration, architecture\n- [Offline Deployment Runbook](https://seonghobae.github.io/docvert/operations/offline-release-runbook/) — Air-gapped setup guide\n- [Architecture Decision Records](https://seonghobae.github.io/docvert/architecture-decision-records/0001-parser-choices/) — Parser implementation choices\n\n## GitHub Releases (Offline Bundles)\n\nPre-built Docker images for air-gapped environments:\n\n**[github.com/seonghobae/docvert/releases](https://github.com/seonghobae/docvert/releases)**\n\n## License\n\nMIT License\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseonghobae%2Fdocvert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fseonghobae%2Fdocvert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fseonghobae%2Fdocvert/lists"}