https://github.com/NoEdgeAI/pdfdeal

A python wrapper for the Doc2X API and comes with native texts processing (to improve PDF recall in RAG). | Doc2X API的python封装，同时附带本地的文本处理(提升PDF在RAG中的召回率)。
https://github.com/NoEdgeAI/pdfdeal

doc2x ocr pdf rag

Last synced: 4 months ago
JSON representation

A python wrapper for the Doc2X API and comes with native texts processing (to improve PDF recall in RAG). | Doc2X API的python封装，同时附带本地的文本处理(提升PDF在RAG中的召回率)。

Host: GitHub
URL: https://github.com/NoEdgeAI/pdfdeal
Owner: NoEdgeAI
License: mit
Created: 2024-05-28T15:11:11.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-02-10T15:49:26.000Z (about 1 year ago)
Last Synced: 2025-02-10T16:33:35.725Z (about 1 year ago)
Topics: doc2x, ocr, pdf, rag
Language: Python
Homepage: https://noedgeai.github.io/pdfdeal-docs/
Size: 497 KB
Stars: 221
Watchers: 2
Forks: 12
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

awesome-hacking-lists - NoEdgeAI/pdfdeal - A python wrapper for the Doc2X API and comes with native texts processing (to improve PDF recall in RAG). | Doc2X API的python封装，同时附带本地的文本处理(提升PDF在RAG中的召回率)。 (Python)

README

          




  pdfdeal





  









[![Downloads](https://static.pepy.tech/badge/pdfdeal)](https://pepy.tech/project/pdfdeal) ![GitHub License](https://img.shields.io/github/license/Menghuan1918/pdfdeal) ![PyPI - Version](https://img.shields.io/pypi/v/pdfdeal) ![GitHub Repo stars](https://img.shields.io/github/stars/Menghuan1918/pdfdeal)




[📄Documentation](https://menghuan1918.github.io/pdfdeal-docs/guide/)




🗺️ ENGLISH | [简体中文](README_CN.md)



Handle PDF more easily and simply, utilizing Doc2X's powerful document conversion capabilities for retained format file conversion/RAG enhancement.







## Introduction

### Doc2X Support

[Doc2X](https://doc2x.com/) is a new universal document OCR tool that can convert images or PDF files into Markdown/LaTeX text with formulas and text formatting. It performs better than similar tools in most scenarios. `pdfdeal` provides abstract packaged classes to use Doc2X for requests.

### Processing PDFs

Use various OCR or PDF recognition tools to identify images and add them to the original text. You can set the output format to use PDF, which will ensure that the recognized text retains the same page numbers as the original in the new PDF. It also offers various practical file processing tools.

After conversion and pre-processing of PDF using Doc2X, you can achieve better recognition rates when used with knowledge base applications such as [graphrag](https://github.com/microsoft/graphrag), [Dify](https://github.com/langgenius/dify), and [FastGPT](https://github.com/labring/FastGPT).

### Markdown Document Processing Features

`pdfdeal` also provides a series of powerful tools to handle Markdown documents:

- **Convert HTML tables to Markdown format**: Allows conversion of HTML formatted tables to Markdown format for easy use in Markdown documents.

- **Upload images to remote storage services**: Supports uploading local or online images in Markdown documents to remote storage services to ensure image persistence and accessibility.

- **Convert online images to local images**: Allows downloading and converting online images in Markdown documents to local images for offline use.

- **Document splitting and separator addition**: Supports splitting Markdown documents by headings or adding separators within documents for better organization and management.

For detailed feature introduction and usage, please refer to the [documentation link](https://menghuan1918.github.io/pdfdeal-docs/guide/Tools/).

## Cases

### graphrag

See [how to use it with graphrag](https://menghuan1918.github.io/pdfdeal-docs/demo/graphrag.html), [its not supported to recognize pdf](https://github.com/microsoft/graphrag), but you can use the CLI tool `doc2x` to convert it to a txt document for use.







### Fastgpt/Dify or other RAG system

Or for knowledge base applications, you can use `pdfdeal`'s built-in variety of enhancements to documents, such as uploading images to remote storage services, adding breaks by paragraph, etc. See [Integration with RAG applications](https://menghuan1918.github.io/pdfdeal-docs/demo/RAG_pre.html).









## Documentation

For details, please refer to the [documentation](https://menghuan1918.github.io/pdfdeal-docs/)

Or check out the [documentation repository pdfdeal-docs](https://github.com/Menghuan1918/pdfdeal-docs).

## Quick Start

For details, please refer to the [documentation](https://menghuan1918.github.io/pdfdeal-docs/)

### Installation

Install using pip:

```bash

pip install --upgrade pdfdeal

```

If you need [document processing tools](https://menghuan1918.github.io/pdfdeal-docs/guide/Tools/):

```bash

pip install --upgrade "pdfdeal[rag]"

```

### Use the Doc2X PDF API to process all PDF files in a specified folder

```python

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key",debug=True)

success, failed, flag = client.pdf2file(

    pdf_file="tests/pdf",

    output_path="./Output",

    output_format="docx",

)

print(success)

print(failed)

print(flag)

```

### Use the Doc2X PDF API to process the specified PDF file and specify the name of the exported file

```python

from pdfdeal import Doc2X

client = Doc2X(apikey="Your API key",debug=True)

success, failed, flag = client.pdf2file(

    pdf_file="tests/pdf/sample.pdf",

    output_path="./Output/test/single/pdf2file",

    output_names=["sample1.zip"],

    output_format="md_dollar",

)

print(success)

print(failed)

print(flag)

```

See the online documentation for details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/NoEdgeAI/pdfdeal

Awesome Lists containing this project

README

pdfdeal