{"id":18273591,"url":"https://github.com/rapidai/rapidocrpdf","last_synced_at":"2025-04-07T10:19:24.178Z","repository":{"id":153282734,"uuid":"628494961","full_name":"RapidAI/RapidOCRPDF","owner":"RapidAI","description":"Based on RapidOCR, extract the PDF content.","archived":false,"fork":false,"pushed_at":"2025-03-23T14:12:08.000Z","size":1160,"stargazers_count":156,"open_issues_count":0,"forks_count":18,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-31T09:03:13.504Z","etag":null,"topics":["convert","extract-pdf-data","ocr","ocr-pdf"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RapidAI.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-16T05:45:44.000Z","updated_at":"2025-03-28T00:56:54.000Z","dependencies_parsed_at":null,"dependency_job_id":"5a10dcea-5d83-4559-aa0b-7249be61b8e4","html_url":"https://github.com/RapidAI/RapidOCRPDF","commit_stats":null,"previous_names":[],"tags_count":4,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRPDF","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRPDF/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRPDF/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RapidAI%2FRapidOCRPDF/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RapidAI","download_url":"https://codeload.github.com/RapidAI/RapidOCRPDF/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247631834,"owners_count":20970069,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["convert","extract-pdf-data","ocr","ocr-pdf"],"created_at":"2024-11-05T12:06:59.518Z","updated_at":"2025-04-07T10:19:24.154Z","avatar_url":"https://github.com/RapidAI.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n    \u003cdiv align=\"center\"\u003e\n    \u003ch1\u003e\u003cb\u003e\u003ci\u003eRapidOCR 📄 PDF\u003c/i\u003e\u003c/b\u003e\u003c/h1\u003e\n    \u003c/div\u003e\n\n\u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/Python-\u003e=3.6,\u003c3.12-aff.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"\"\u003e\u003cimg src=\"https://img.shields.io/badge/OS-Linux%2C%20Win%2C%20Mac-pink.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/rapidocr-pdf/\"\u003e\u003cimg alt=\"PyPI\" src=\"https://img.shields.io/pypi/v/rapidocr-pdf\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pepy.tech/project/rapidocr-pdf\"\u003e\u003cimg src=\"https://static.pepy.tech/personalized-badge/rapidocr-pdf?period=total\u0026units=abbreviation\u0026left_color=grey\u0026right_color=blue\u0026left_text=Downloads\"\u003e\u003c/a\u003e\n\u003ca href=\"https://semver.org/\"\u003e\u003cimg alt=\"SemVer2.0\" src=\"https://img.shields.io/badge/SemVer-2.0-brightgreen\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\u003c/a\u003e\n\u003ca href=\"https://choosealicense.com/licenses/apache-2.0/\"\u003e\u003cimg alt=\"GitHub\" src=\"https://img.shields.io/github/license/RapidAI/RapidOCRPDF\"\u003e\u003c/a\u003e\n\n\u003c/div\u003e\n\n### 简介\n\n本仓库依托于[RapidOCR](https://github.com/RapidAI/RapidOCR)仓库，快速提取PDF中文字，包括扫描版PDF、加密版PDF、可直接复制文字版PDF。\n\n🔥🔥🔥 版式还原参见项目：[RapidLayoutRecover](https://github.com/RapidAI/RapidLayoutRecover)\n\n### 整体流程\n\n```mermaid\nflowchart LR\n\nA(PDF) --\u003e B{是否可以直接提取内容} --是--\u003e C(PyMuPDF)\nB --否--\u003e D(RapidOCR)\n\nC \u0026 D --\u003e E(结果)\n```\n\n### 安装\n\n```bash\n# 基于CPU 依赖rapidocr_onnxruntime\npip install rapidocr_pdf[onnxruntime]\n\n# 基于CPU 依赖rapidocr_openvino 更快\npip install rapidocr_pdf[openvino]\n\n# 基于GPU 依赖rapidocr_paddle\n# 1.安装 PaddlePaddle 框架 GPU 版, 参见: https://www.paddlepaddle.org.cn/\n# 2.安装 rapidocr_pdf[paddle]\npip install rapidocr_pdf[paddle]\n```\n\n### 使用\n\n脚本使用\n\n```python\nfrom rapidocr_pdf import PDFExtracter\n\npdf_extracter = PDFExtracter()\n\npdf_path = 'tests/test_files/direct_and_image.pdf'\ntexts = pdf_extracter(pdf_path, force_ocr=False)\nprint(texts)\n```\n\n命令行使用\n\n```bash\n$ rapidocr_pdf -h\nusage: rapidocr_pdf [-h] [-path FILE_PATH] [-f]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  -path FILE_PATH, --file_path FILE_PATH\n                        File path, PDF or images\n  -f, --force_ocr       Whether to use ocr for all pages.\n\n$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf\n```\n\n### 输入输出说明\n\n**输入**：`Union[str, Path, bytes]`\n\n**输出**：`List` \\[**页码**, **文本内容**, **置信度**\\]， 具体参见下例：\n\n```python\n[\n    ['0', '人之初，性本善。性相近，习相远。', '0.8969868'],\n    ['1', 'Men at their birth, are naturally good.', '0.8969868'],\n]\n```\n\n### [更新日志](https://github.com/RapidAI/RapidOCRPDF/releases)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frapidai%2Frapidocrpdf","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frapidai%2Frapidocrpdf","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frapidai%2Frapidocrpdf/lists"}