https://github.com/rapidai/rapidocrpdf

Based on RapidOCR, extract the PDF content.
https://github.com/rapidai/rapidocrpdf

convert extract-pdf-data ocr ocr-pdf

Last synced: 3 months ago
JSON representation

Based on RapidOCR, extract the PDF content.

Host: GitHub
URL: https://github.com/rapidai/rapidocrpdf
Owner: RapidAI
License: apache-2.0
Created: 2023-04-16T05:45:44.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2025-03-23T14:12:08.000Z (4 months ago)
Last Synced: 2025-03-31T09:03:13.504Z (4 months ago)
Topics: convert, extract-pdf-data, ocr, ocr-pdf
Language: Python
Homepage:
Size: 1.11 MB
Stars: 156
Watchers: 3
Forks: 18
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        


    

    RapidOCR 📄 PDF

    


















### 简介

本仓库依托于[RapidOCR](https://github.com/RapidAI/RapidOCR)仓库，快速提取PDF中文字，包括扫描版PDF、加密版PDF、可直接复制文字版PDF。

🔥🔥🔥 版式还原参见项目：[RapidLayoutRecover](https://github.com/RapidAI/RapidLayoutRecover)

### 整体流程

```mermaid

flowchart LR

A(PDF) --> B{是否可以直接提取内容} --是--> C(PyMuPDF)

B --否--> D(RapidOCR)

C & D --> E(结果)

```

### 安装

```bash

# 基于CPU 依赖rapidocr_onnxruntime

pip install rapidocr_pdf[onnxruntime]

# 基于CPU 依赖rapidocr_openvino 更快

pip install rapidocr_pdf[openvino]

# 基于GPU 依赖rapidocr_paddle

# 1.安装 PaddlePaddle 框架 GPU 版, 参见: https://www.paddlepaddle.org.cn/

# 2.安装 rapidocr_pdf[paddle]

pip install rapidocr_pdf[paddle]

```

### 使用

脚本使用

```python

from rapidocr_pdf import PDFExtracter

pdf_extracter = PDFExtracter()

pdf_path = 'tests/test_files/direct_and_image.pdf'

texts = pdf_extracter(pdf_path, force_ocr=False)

print(texts)

```

命令行使用

```bash

$ rapidocr_pdf -h

usage: rapidocr_pdf [-h] [-path FILE_PATH] [-f]

optional arguments:

  -h, --help            show this help message and exit

  -path FILE_PATH, --file_path FILE_PATH

                        File path, PDF or images

  -f, --force_ocr       Whether to use ocr for all pages.

$ rapidocr_pdf -path tests/test_files/direct_and_image.pdf

```

### 输入输出说明

**输入**：`Union[str, Path, bytes]`

**输出**：`List` \[**页码**, **文本内容**, **置信度**\]， 具体参见下例：

```python

[

    ['0', '人之初，性本善。性相近，习相远。', '0.8969868'],

    ['1', 'Men at their birth, are naturally good.', '0.8969868'],

]

```

### [更新日志](https://github.com/RapidAI/RapidOCRPDF/releases)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/rapidai/rapidocrpdf

Awesome Lists containing this project

README

RapidOCR 📄 PDF