https://github.com/rapidai/rapidocrpdf
Based on RapidOCR, extract the PDF content.
https://github.com/rapidai/rapidocrpdf
convert extract-pdf-data ocr ocr-pdf
Last synced: 3 months ago
JSON representation
Based on RapidOCR, extract the PDF content.
- Host: GitHub
- URL: https://github.com/rapidai/rapidocrpdf
- Owner: RapidAI
- License: apache-2.0
- Created: 2023-04-16T05:45:44.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-23T14:12:08.000Z (4 months ago)
- Last Synced: 2025-03-31T09:03:13.504Z (4 months ago)
- Topics: convert, extract-pdf-data, ocr, ocr-pdf
- Language: Python
- Homepage:
- Size: 1.11 MB
- Stars: 156
- Watchers: 3
- Forks: 18
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
### 简介
本仓库依托于[RapidOCR](https://github.com/RapidAI/RapidOCR)仓库,快速提取PDF中文字,包括扫描版PDF、加密版PDF、可直接复制文字版PDF。
🔥🔥🔥 版式还原参见项目:[RapidLayoutRecover](https://github.com/RapidAI/RapidLayoutRecover)
### 整体流程
```mermaid
flowchart LRA(PDF) --> B{是否可以直接提取内容} --是--> C(PyMuPDF)
B --否--> D(RapidOCR)C & D --> E(结果)
```### 安装
```bash
# 基于CPU 依赖rapidocr_onnxruntime
pip install rapidocr_pdf[onnxruntime]# 基于CPU 依赖rapidocr_openvino 更快
pip install rapidocr_pdf[openvino]# 基于GPU 依赖rapidocr_paddle
# 1.安装 PaddlePaddle 框架 GPU 版, 参见: https://www.paddlepaddle.org.cn/
# 2.安装 rapidocr_pdf[paddle]
pip install rapidocr_pdf[paddle]
```### 使用
脚本使用
```python
from rapidocr_pdf import PDFExtracterpdf_extracter = PDFExtracter()
pdf_path = 'tests/test_files/direct_and_image.pdf'
texts = pdf_extracter(pdf_path, force_ocr=False)
print(texts)
```命令行使用
```bash
$ rapidocr_pdf -h
usage: rapidocr_pdf [-h] [-path FILE_PATH] [-f]optional arguments:
-h, --help show this help message and exit
-path FILE_PATH, --file_path FILE_PATH
File path, PDF or images
-f, --force_ocr Whether to use ocr for all pages.$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf
```### 输入输出说明
**输入**:`Union[str, Path, bytes]`
**输出**:`List` \[**页码**, **文本内容**, **置信度**\], 具体参见下例:
```python
[
['0', '人之初,性本善。性相近,习相远。', '0.8969868'],
['1', 'Men at their birth, are naturally good.', '0.8969868'],
]
```### [更新日志](https://github.com/RapidAI/RapidOCRPDF/releases)