{"id":50939449,"url":"https://github.com/patelvivekdev/pdf-textlayer","last_synced_at":"2026-06-17T12:30:57.820Z","repository":{"id":357656720,"uuid":"1237573132","full_name":"patelvivekdev/pdf-textlayer","owner":"patelvivekdev","description":null,"archived":false,"fork":false,"pushed_at":"2026-05-13T09:59:17.000Z","size":17,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-13T19:28:40.143Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/patelvivekdev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T09:58:43.000Z","updated_at":"2026-05-13T09:58:52.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/patelvivekdev/pdf-textlayer","commit_stats":null,"previous_names":["patelvivekdev/pdf-textlayer"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/patelvivekdev/pdf-textlayer","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patelvivekdev%2Fpdf-textlayer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patelvivekdev%2Fpdf-textlayer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patelvivekdev%2Fpdf-textlayer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patelvivekdev%2Fpdf-textlayer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/patelvivekdev","download_url":"https://codeload.github.com/patelvivekdev/pdf-textlayer/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/patelvivekdev%2Fpdf-textlayer/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34449277,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-17T02:00:05.408Z","response_time":127,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-17T12:30:56.930Z","updated_at":"2026-06-17T12:30:57.811Z","avatar_url":"https://github.com/patelvivekdev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf-textlayer\n\nAdd an invisible, searchable text layer to a PDF from OCR JSON. Pages without OCR text are left exactly as in the input — native text, embedded fonts, and annotations are preserved. Built on **[PyMuPDF](https://pymupdf.readthedocs.io/)**.\n\nPublished on **[PyPI](https://pypi.org/project/pdf-textlayer/)** · Developed on **[GitHub](https://github.com/patelvivekdev/pdf-textlayer)**.\n\n## What it does\n\nGiven a PDF and a JSON file from an OCR pipeline, `pdf-textlayer` writes a new PDF where every OCR-detected word is drawn at its bounding box in **PDF render mode 3** (invisible glyphs). The original page content is unchanged; the new layer is what `Ctrl+F`, screen readers, and indexers see.\n\n- **Mixed PDFs are handled correctly.** A 200-page document with two scanned exhibits gets the overlay only on the scanned pages. Pages with native text are passed through untouched.\n- **Adapter-based.** Out of the box it understands [liteparse](https://github.com/run-llama/liteparse) JSON. Other sources (Azure Document Intelligence, Textract, hOCR, MinerU, your own pipeline) can be added by writing a small adapter — see [Adapters](#adapters).\n- **Minimal dependencies.** Only `pymupdf`.\n\n## Installation\n\nPython **3.11+**.\n\n```bash\npip install pdf-textlayer\n# or\nuv add pdf-textlayer\n```\n\nThis installs the `pdf-textlayer` CLI and the `pdf_textlayer` Python package.\n\n## CLI\n\n```bash\n# 1. Produce OCR JSON for your PDF (example: liteparse)\nliteparse input.pdf --format json -o parsed.json\n\n# 2. Build the searchable PDF\npdf-textlayer input.pdf parsed.json output.pdf\n```\n\nOptions:\n\n```bash\npdf-textlayer input.pdf parsed.json output.pdf \\\n    --from liteparse \\           # adapter name (default: liteparse)\n    --min-confidence 0.5 \\       # drop low-confidence OCR items\n    --font china-s               # CJK font for Asian-language documents\n```\n\n## Python API\n\n```python\nfrom pdf_textlayer import make_searchable_pdf, Options\n\nstats = make_searchable_pdf(\n    \"input.pdf\",\n    \"parsed.json\",                       # path, or already-loaded dict\n    \"output.pdf\",\n    adapter=\"liteparse\",                 # default\n    options=Options(min_confidence=0.5, font=\"helv\"),\n)\nprint(stats)\n# {'pages': 19, 'ocr_pages': [1, 2, ..., 19], 'items_drawn': 7643}\n```\n\nIf you already have the OCR results in memory (e.g. from another library), build `ParsedPage` objects directly and skip the adapter:\n\n```python\nfrom pdf_textlayer import ParsedPage, TextBox, write_textlayer\n\npages = [\n    ParsedPage(\n        page_number=1,\n        boxes=[\n            TextBox(text=\"Hello\", x=72.0, y=108.0, width=46.0, height=12.0, confidence=0.98),\n            TextBox(text=\"world\", x=124.0, y=108.0, width=48.0, height=12.0, confidence=0.97),\n        ],\n    ),\n]\nwrite_textlayer(\"input.pdf\", pages, \"output.pdf\")\n```\n\nCoordinates use PyMuPDF's convention: origin at the **top-left** of the page, `x` right, `y` down, in PDF user-space points. `(x, y)` is the top-left corner of the box.\n\n## How mixed pages are handled\n\nFor each page in the input PDF, the adapter decides whether the page has OCR content to overlay:\n\n1. If the adapter yields a `ParsedPage` with non-empty `boxes` → the page receives an invisible overlay (subject to `min_confidence`).\n2. If the adapter omits the page, or yields it with no boxes → the page is left exactly as-is.\n\nFor the bundled liteparse adapter, \"has OCR content\" means the page contains at least one `textItems` entry with `fontName == \"OCR\"`. Native-text items keep the real font name and are skipped, so the original PDF text is never duplicated.\n\n## Adapters\n\nAn adapter is a small class that turns a source's JSON into normalized `ParsedPage` objects. The protocol is one method:\n\n```python\nfrom collections.abc import Iterable\nfrom pdf_textlayer import ParsedPage, register_adapter\n\nclass MyAdapter:\n    def parse(self, data) -\u003e Iterable[ParsedPage]:\n        for page in data[\"pages\"]:\n            yield ParsedPage(\n                page_number=page[\"index\"] + 1,        # 1-indexed\n                boxes=[...],                          # list[TextBox]\n            )\n\nregister_adapter(\"mine\", MyAdapter())\n# pdf-textlayer in.pdf in.json out.pdf --from mine\n```\n\nBundled adapters:\n\n| Name | Source | Notes |\n|------|--------|-------|\n| `liteparse` | [liteparse](https://github.com/run-llama/liteparse) `--format json` | Items with `fontName == \"OCR\"` per the [OCR API spec](https://github.com/run-llama/liteparse/blob/main/OCR_API_SPEC.md). |\n\nPRs adding more adapters (Azure DI, Textract, hOCR, …) are welcome.\n\n## Fonts and non-Latin text\n\nThe default font (`helv` = Helvetica) covers Latin-1. Unsupported glyphs are replaced with `?` so each item still emits a usable text run; set `Options(sanitize_unsupported=False)` to drop those items instead.\n\nFor CJK documents, pass a bundled PyMuPDF font:\n\n```bash\npdf-textlayer in.pdf parsed.json out.pdf --font china-s\n# china-s, china-t, japan, korea — bundled with PyMuPDF\n```\n\n## How the overlay is drawn\n\nFor each box, the font is scaled so the rendered text width matches the box width. The insertion point is the glyph baseline, placed at `(x, y + height)`. Text is written with PDF render mode 3 — present in the content stream, never painted — so search, copy/paste, and assistive tech work, but the visible page is unchanged.\n\n## Limitations\n\n- **Rotated text is drawn upright.** OCR sources that don't include rotation metadata can't be rendered rotated; in practice most pipelines de-rotate before emitting boxes.\n- **Invisible text uses PDF render mode 3.** All modern viewers and search indexers handle this correctly; very old tooling may not extract it.\n- **No de-duplication across overlay and native text.** If you point the tool at native-text pages but the adapter treats them as OCR, you'll get duplicated search hits. The bundled `liteparse` adapter avoids this by filtering on `fontName`.\n\n## Contributing\n\nBug reports, docs improvements, and new adapters are welcome — open an issue or PR on [GitHub](https://github.com/patelvivekdev/pdf-textlayer/issues).\n\n## License\n\nReleased under the **MIT License**. See [`LICENSE`](LICENSE) for the full text.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpatelvivekdev%2Fpdf-textlayer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpatelvivekdev%2Fpdf-textlayer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpatelvivekdev%2Fpdf-textlayer/lists"}