{"id":21047182,"url":"https://github.com/jsv4/pdf-preprocessors","last_synced_at":"2025-03-13T22:25:20.290Z","repository":{"id":194382052,"uuid":"690867960","full_name":"JSv4/PDF-Preprocessors","owner":"JSv4","description":"Collection of tools to provide text extract outputs for all PDFs that include x,y coordinate data as well as text","archived":false,"fork":false,"pushed_at":"2023-09-13T04:16:41.000Z","size":15,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-01-20T17:50:39.820Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/JSv4.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-13T03:37:34.000Z","updated_at":"2024-11-02T11:04:58.000Z","dependencies_parsed_at":"2023-09-13T04:39:21.361Z","dependency_job_id":null,"html_url":"https://github.com/JSv4/PDF-Preprocessors","commit_stats":null,"previous_names":["jsv4/pdf-preprocessors"],"tags_count":5,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPDF-Preprocessors","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPDF-Preprocessors/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPDF-Preprocessors/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/JSv4%2FPDF-Preprocessors/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/JSv4","download_url":"https://codeload.github.com/JSv4/PDF-Preprocessors/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243491338,"owners_count":20299291,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-19T14:35:51.880Z","updated_at":"2025-03-13T22:25:20.269Z","avatar_url":"https://github.com/JSv4.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Text Annotation Extractor with Tesseract\n\n## Description\nThis Python script is based on Allen AI's PAWLs project and is designed to perform OCR on PDF files to extract textual annotations present in each page. It leverages the pytesseract OCR engine. The output generated is aimed to be consistent across documents, making it ideal for machine learning training data, analytics, and other applications requiring standardized, high-quality text extraction.\n\nWhile the PAWLs project offers a powerful preprocessor for handling PDFs, it is not actively maintained. Therefore, this separate project has been initiated to fill that gap and to continuously provide a reliable text extraction solution. Future updates plan to introduce additional preprocessors based on other underlying PDF engines.\n\n## Requirements\n\n- Python 3.9\n- pytesseract\n- pdf2image\n- pandas\n\n### Installation\n\nInstall the required Python packages using pip (from repo root):\n\n```commandline\ncd pdfpreprocessor\npip install .\n```\n\n## Development\n\nRun linter (after install dependencies): \n\n```commandline\nhatch run lint:fmt\n```\n\n## Usage\n\nTo run the script, simply call the `process_tesseract` function, passing in the PDF file's path as an argument:\n\n\\`\\`\\`python\nfrom preprocessors.tesseract import process_tesseract\n\nannotations = process_tesseract(\"path/to/your/pdf/file.pdf\")\n\\`\\`\\`\n\n## Testing\n\nUnit tests should be written to cover each of these functions. Testing can help ensure that the OCR extraction and scaling logic work as expected.\n\n## Contributing\n\nPlease read `CONTRIBUTING.md` for details on our code of conduct, and the process for submitting pull requests to us.\n\n## License\n\nThis project is licensed under the Apache-2 License - see the `LICENSE.md` file for details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Fpdf-preprocessors","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjsv4%2Fpdf-preprocessors","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjsv4%2Fpdf-preprocessors/lists"}