{"id":30240452,"url":"https://github.com/hypothesis/pdf-text-quality","last_synced_at":"2025-08-15T04:39:01.547Z","repository":{"id":40430222,"uuid":"487312206","full_name":"hypothesis/pdf-text-quality","owner":"hypothesis","description":"Tool for measuring the quality of PDF text layers","archived":false,"fork":false,"pushed_at":"2024-04-05T12:58:53.000Z","size":3078,"stargazers_count":3,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-04-14T13:53:02.619Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hypothesis.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2022-04-30T15:26:05.000Z","updated_at":"2024-04-14T13:53:02.620Z","dependencies_parsed_at":"2024-04-05T13:59:45.492Z","dependency_job_id":null,"html_url":"https://github.com/hypothesis/pdf-text-quality","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hypothesis/pdf-text-quality","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hypothesis%2Fpdf-text-quality","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hypothesis%2Fpdf-text-quality/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hypothesis%2Fpdf-text-quality/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hypothesis%2Fpdf-text-quality/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hypothesis","download_url":"https://codeload.github.com/hypothesis/pdf-text-quality/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hypothesis%2Fpdf-text-quality/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":270524460,"owners_count":24600196,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-15T02:00:12.559Z","response_time":110,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-08-15T04:38:45.435Z","updated_at":"2025-08-15T04:39:01.535Z","avatar_url":"https://github.com/hypothesis.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pdf-text-quality\n\nTools for evaluating the quality of the text layer in PDFs.\n\n## Background\n\nPages in PDF files which have been created from bitmap images rather than being\ncreated from a digital source (eg. by rendering a Word or LaTeX document to a\nPDF) typically combine the input bitmaps, as background graphics, with hidden\nPDF text drawing commands overlaid on top. The hidden text enables text\nselection to work.\n\nIf the words in the hidden text layer are not well aligned with the visible text\nin the bitmap image, or if the characters were mis-recognized, text selection\nwill be difficult, or produce incorrect results when text is copied.\n\nPoor text layers present a problem for text-centric annotation tools like\n[Hypothesis](https://web.hypothes.is). Detecting low-quality text layers can\nbe used as part of manual or automated processes to re-OCR affected pages\nusing a more modern/better-configured OCR tool.\n\n## Installation\n\n1. Install [Pipenv](https://pipenv.pypa.io/en/latest/)\n\n2. Install [Tesseract](https://github.com/tesseract-ocr/tesseract) and\n   [Poppler](https://poppler.freedesktop.org) utilities.\n\n   On macOS, these can be installed via Homebrew:\n\n   ```\n   brew install poppler tesseract\n   ```\n\n   On Linux, using apt:\n\n   ```\n   sudo apt-get install poppler-utils tesseract-ocr\n   ```\n\n   After this step, `pdftotext`, `pdftocairo` and `tesseract` binaries should\n   be available on your system.\n\n3. Clone this repository and install Python dependencies with:\n\n   ```\n   pipenv install --dev\n   ```\n\n## Usage\n\n```\npipenv run check-pdf [args] \u003cpdf_file\u003e\n```\n\nThis will check all pages in `pdf_file`. The range of pages to check can be\nadjusted using the `--first-page` and `--last-page` options.\n\nFor each checked page, a set of metrics is displayed. For example:\n\n```shellsession\n$ pipenv run check-pdf test-data/implementing-quicksort.pdf --first-page=1 --last-page=5\n\nChecking text pages 1 to 5\nPage 1 IoU: iou: 0.73, iou_x: 0.96, iou_y: 0.76, iou_weighted: 0.90\nPage 2 IoU: iou: 0.19, iou_x: 0.33, iou_y: 0.41, iou_weighted: 0.35\nPage 3 IoU: iou: 0.65, iou_x: 0.89, iou_y: 0.70, iou_weighted: 0.83\nPage 4 IoU: iou: 0.65, iou_x: 0.90, iou_y: 0.70, iou_weighted: 0.84\nPage 5 IoU: iou: 0.64, iou_x: 0.86, iou_y: 0.72, iou_weighted: 0.82\n```\n\nThe `iou_weighted` metric is the main indicator to look at. \"iou\" stands for\n[Intersection over Union](https://en.wikipedia.org/wiki/Jaccard_index) and\nindicates how well bounding boxes for words in the text layer match up to text\nfound in the page image via OCR. Values above 0.7 are generally \"good\", values\nfrom 0.5-0.7 are \"fair\" and values below 0.5 are poor.\n\nIn the above example, most pages have a \"good\" text layer, but page 2 does not.\nThe quality of pages within a PDF can vary because they have been generated by\ndifferent means, or because certain aspects of a particular page affected the\ntext layer construction. For example, a page that was skewed when placed in\nthe scanner may have OCR-ed poorly when originally processed.\n\nFor contrast, here is the output from a PDF file which has a digitally created\ncover page, with an accurate text layer, followed by a series of badly OCR-ed\npages with poor text layers:\n\n```shellsession\n$ pipenv run check-pdf bad-file.pdf\n\nChecking text pages 1 to 5\nPage 1 IoU: iou: 0.54, iou_x: 0.91, iou_y: 0.57, iou_weighted: 0.81\nPage 2 IoU: iou: 0.20, iou_x: 0.41, iou_y: 0.36, iou_weighted: 0.39\nPage 3 IoU: iou: 0.17, iou_x: 0.31, iou_y: 0.28, iou_weighted: 0.30\nPage 4 IoU: iou: 0.19, iou_x: 0.36, iou_y: 0.31, iou_weighted: 0.35\nPage 5 IoU: iou: 0.20, iou_x: 0.40, iou_y: 0.34, iou_weighted: 0.38\n```\n\nTo see the full set of options, run:\n\n```\npipenv run check-pdf --help\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhypothesis%2Fpdf-text-quality","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhypothesis%2Fpdf-text-quality","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhypothesis%2Fpdf-text-quality/lists"}