{"id":24337608,"url":"https://github.com/jsv4/pawlsparser","last_synced_at":"2026-04-18T16:38:33.512Z","repository":{"id":272231347,"uuid":"915905571","full_name":"JSv4/PawlsParser","owner":"JSv4","description":"Extract PAWLS tokens from PDFs","archived":false,"fork":false,"pushed_at":"2025-01-13T04:36:29.000Z","size":1120,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-11T18:54:18.610Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JSv4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-13T04:27:17.000Z","updated_at":"2025-01-13T04:36:44.000Z","dependencies_parsed_at":"2025-01-13T05:26:33.333Z","dependency_job_id":null,"html_url":"https://github.com/JSv4/PawlsParser","commit_stats":null,"previous_names":["jsv4/pawlsparser"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/JSv4/PawlsParser","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPawlsParser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPawlsParser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPawlsParser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPawlsParser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JSv4","download_url":"https://codeload.github.com/JSv4/PawlsParser/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPawlsParser/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31976800,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-18T16:27:12.723Z","status":"ssl_error","status_checked_at":"2026-04-18T16:27:11.140Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-18T06:16:32.879Z","updated_at":"2026-04-18T16:38:33.496Z","avatar_url":"https://github.com/JSv4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PdfTokenizer\n\n[![PyPI - Version](https://img.shields.io/pypi/v/pdftokenizer.svg)](https://pypi.org/project/pdftokenizer)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/pdftokenizer.svg)](https://pypi.org/project/pdftokenizer)\n\nA Python library for extracting text from PDFs with automatic OCR detection.\n\n## Features\n\n- 🔍 **Smart OCR Detection**: Automatically determines if OCR is needed by analyzing text extractability\n- 🔄 **Dual Extraction Methods**: Uses PdfPlumber for native PDFs and Tesseract for scanned documents\n- 🪟 **Windows Support**: Automatic Poppler download and setup for Windows users\n\n## Installation\n\n```console\npip install pdftokenizer\n```\n\n## Quick Start\n\n```python\nfrom pdftokenizer import extract_tokens_from_pdf\n\n# Read your PDF file\nwith open(\"document.pdf\", \"rb\") as f:\n    pdf_bytes = f.read()\n\n# Extract tokens - OCR will be used automatically if needed\npages = extract_tokens_from_pdf(pdf_bytes)\n\n# Force OCR if desired\npages_ocr = extract_tokens_from_pdf(pdf_bytes, force_ocr=True)\n```\n\n## How It Works\n\nThe library automatically determines whether to use OCR based on text extractability:\n\n1. Attempts to extract text from the PDF using PyPDF\n2. If the extracted text contains fewer than 10 characters (configurable threshold), the PDF is considered to need OCR\n3. Based on this detection:\n   - Text-based PDFs: Processed using PdfPlumber for efficient extraction\n   - Scanned/Image PDFs: Processed using Tesseract OCR\n\n## Requirements\n\n### Poppler\nPDF processing backend:\n- **Windows**: Automatically downloaded and configured\n- **Linux**: `apt-get install poppler-utils`\n- **macOS**: `brew install poppler`\n\n### Tesseract\nRequired for OCR functionality:\n- **Windows**: Download from [UB Mannheim](https://github.com/UB-Mannheim/tesseract/wiki)\n- **Linux**: `apt-get install tesseract-ocr`\n- **macOS**: `brew install tesseract`\n\n## License\n\n`pdftokenizer` is distributed under the terms of the [MIT](https://spdx.org/licenses/MIT.html) license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Fpawlsparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjsv4%2Fpawlsparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Fpawlsparser/lists"}