https://github.com/opendataloader-project/opendataloader-pdf

PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
https://github.com/opendataloader-project/opendataloader-pdf
ai dataloader document-parser document-parsing documents html json markdown ocr-recognition pdf pdf-converter pdf-parser pdf-to-html pdf-to-json pdf-to-markdown rag recognition sdk tables
Last synced: about 1 month ago
JSON representation
PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
Host: GitHub
URL: https://github.com/opendataloader-project/opendataloader-pdf
Owner: opendataloader-project
License: mpl-2.0
Created: 2025-05-13T05:48:02.000Z (10 months ago)
Default Branch: main
Last Pushed: 2026-01-29T07:24:13.000Z (about 1 month ago)
Last Synced: 2026-01-29T19:11:34.080Z (about 1 month ago)
Topics: ai, dataloader, document-parser, document-parsing, documents, html, json, markdown, ocr-recognition, pdf, pdf-converter, pdf-parser, pdf-to-html, pdf-to-json, pdf-to-markdown, rag, recognition, sdk, tables
Language: Java
Homepage: https://opendataloader.org
Size: 78.9 MB
Stars: 833
Watchers: 4
Forks: 47
Open Issues: 7
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: .github/SECURITY.md
- Support: SUPPORT.md
- Notice: NOTICE.md
Awesome Lists containing this project

README

          # OpenDataLoader PDF

**PDF Parsing for RAG** — Convert to Markdown & JSON, Fast, Local, No GPU

[![License](https://img.shields.io/pypi/l/opendataloader-pdf.svg)](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)

[![PyPI version](https://img.shields.io/pypi/v/opendataloader-pdf.svg)](https://pypi.org/project/opendataloader-pdf/)

[![npm version](https://img.shields.io/npm/v/@opendataloader/pdf.svg)](https://www.npmjs.com/package/@opendataloader/pdf)

[![Maven Central](https://img.shields.io/maven-central/v/org.opendataloader/opendataloader-pdf-core.svg)](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)

[![GHCR Version](https://ghcr-badge.egpl.dev/opendataloader-project/opendataloader-pdf-cli/latest_tag?trim=major&label=docker)](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)

[![Java](https://img.shields.io/badge/Java-11%2B-blue.svg)](https://github.com/opendataloader-project/opendataloader-pdf#java)

Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.

**Why developers choose OpenDataLoader:**

- **Deterministic** — Same input always produces same output (no LLM hallucinations)

- **Fast** — Process 100+ pages per second on CPU

- **Private** — 100% local, zero data transmission

- **Accurate** — Bounding boxes for every element, correct multi-column reading order

```bash

pip install -U opendataloader-pdf

```

```python

import opendataloader_pdf

# PDF to Markdown for RAG

opendataloader_pdf.convert(

    input_path="document.pdf",

    output_dir="output/",

    format="markdown,json"

)

```




## Why OpenDataLoader?

Building RAG pipelines? You've probably hit these problems:

| Problem | How We Solve It |

|---------|-----------------|

| **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |

| **Tables lose structure** | Border + cluster detection keeps rows/columns intact |

| **Headers/footers pollute context** | Auto-filtered before output |

| **No coordinates for citations** | Bounding box for every element |

| **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |

| **GPU required** | Pure CPU, rule-based — runs anywhere |




## Key Features

### For RAG & LLM Pipelines

- **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)

- **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations

- **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly

- **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed

- **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)

### Performance & Privacy

- **No GPU** — Fast, rule-based heuristics

- **Local-First** — Your documents never leave your machine

- **High Throughput** — Process thousands of PDFs efficiently

- **Multi-Language SDK** — Python, Node.js, Java, Docker

### Document Understanding

- **Tables** — Detects borders, handles merged cells

- **Lists** — Numbered, bulleted, nested

- **Headings** — Auto-detects hierarchy levels

- **Images** — Extracts with captions linked

- **Tagged PDF Support** — Uses native PDF structure when available

- **AI Safety** — Auto-filters prompt injection content




## Output Formats

| Format | Use Case |

|--------|----------|

| **JSON** | Structured data with bounding boxes, semantic types |

| **Markdown** | Clean text for LLM context, RAG chunks |

| **HTML** | Web display with styling |

| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |




## JSON Output Example

```json

{

  "type": "heading",

  "id": 42,

  "level": "Title",

  "page number": 1,

  "bounding box": [72.0, 700.0, 540.0, 730.0],

  "heading level": 1,

  "font": "Helvetica-Bold",

  "font size": 24.0,

  "text color": "[0.0]",

  "content": "Introduction"

}

```

| Field | Description |

|-------|-------------|

| `type` | Element type: heading, paragraph, table, list, image, caption |

| `id` | Unique identifier for cross-referencing |

| `page number` | 1-indexed page reference |

| `bounding box` | `[left, bottom, right, top]` in PDF points |

| `heading level` | Heading depth (1+) |

| `font`, `font size` | Typography info |

| `content` | Extracted text |

[Full JSON Schema →](https://opendataloader.org/docs/json-schema)




## Quick Start

- [Python](https://opendataloader.org/docs/quick-start-python)

- [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)

- [Docker](https://opendataloader.org/docs/quick-start-docker)

- [Java](https://opendataloader.org/docs/quick-start-java)




## Advanced Options

```python

opendataloader_pdf.convert(

    input_path="document.pdf",

    output_dir="output/",

    format="json,markdown,pdf",

    # Image output mode: "off", "embedded" (Base64), or "external" (default)

    image_output="embedded",

    # Image format: "png" or "jpeg"

    image_format="jpeg",

    # Tagged PDF

    use_struct_tree=True,            # Use native PDF structure

)

```

[Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)




## AI Safety

PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:

- Hidden text (transparent, zero-size)

- Off-page content

- Suspicious invisible layers

This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)




## Tagged PDF Support

**Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.

**OpenDataLoader leverages this:**

- When a PDF has structure tags, we extract the **exact layout** the author intended

- Headings, lists, tables, reading order — all preserved from the source

- No guessing, no heuristics needed — **pixel-perfect semantic extraction**

```python

opendataloader_pdf.convert(

    input_path="accessible_document.pdf",

    use_struct_tree=True  # Use native PDF structure tags

)

```

Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.

[Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)




## Hybrid Mode

For documents with complex tables or OCR needs, enable hybrid mode to route challenging pages to an AI backend while keeping simple pages fast and local.

**Results**: Table accuracy jumps from 0.49 → 0.93 (+90%) with acceptable speed trade-off.

```bash

pip install -U "opendataloader-pdf[hybrid]"

```

Terminal 1: Start the backend server

```bash

opendataloader-pdf-hybrid --port 5002

```

Terminal 2: Process PDFs with hybrid mode

```bash

opendataloader-pdf --hybrid docling-fast input.pdf

```

Or use in Python:

```python

opendataloader_pdf.convert(

    input_path="complex_tables.pdf",

    output_dir="output/",

    hybrid="docling-fast"  # Routes complex pages to AI backend

)

```

- **Local-first**: Simple pages processed locally, complex pages routed to backend

- **Fallback**: If backend unavailable, gracefully falls back to local processing

- **Privacy**: Run the backend locally in Docker for 100% on-premise

### Formula Extraction (LaTeX)

For PDFs containing mathematical formulas, enable formula enrichment to extract LaTeX representations:

```bash

# Start backend with formula enrichment

opendataloader-pdf-hybrid --enrich-formula

# Process with full backend mode (required for formula extraction)

opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf

```

Output in JSON:

```json

{

  "type": "formula",

  "page number": 1,

  "bounding box": [226.2, 144.7, 377.1, 168.7],

  "content": "\\frac{f(x+h) - f(x)}{h}"

}

```

Output in Markdown:

```markdown

$$

\frac{f(x+h) - f(x)}{h}

$$

```

Output in HTML (MathJax/KaTeX compatible):

```html

 $\frac{f(x+h) - f(x)}{h}$ 

```

> **Note**: Formula extraction requires `--hybrid-mode full` to route all pages to the backend where the formula enrichment model runs.

### Picture / Chart Description (Alt Text)

Generate AI-powered descriptions for images and charts in your PDFs. Useful for accessibility (alt text) and making visual content searchable in RAG pipelines.

```bash

# Start backend with picture description

opendataloader-pdf-hybrid --enrich-picture-description

# Process with full backend mode (required for picture description)

opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf

```

Output in JSON:

```json

{

  "type": "picture",

  "page number": 1,

  "bounding box": [72.0, 400.0, 540.0, 650.0],

  "description": "A bar chart showing waste generation by region from 2016 to 2030..."

}

```

Output in Markdown:

```markdown

![image 1](document_images/imageFile1.png)

*A bar chart showing waste generation by region from 2016 to 2030...*

```

Output in HTML:

```html



A bar chart showing waste generation by region from 2016 to 2030...

```

You can also customize the prompt for better results with specific document types:

```bash

opendataloader-pdf-hybrid --enrich-picture-description \

  --picture-description-prompt "Describe this scientific figure in detail."

```

> **Note**: Picture description uses SmolVLM (256M), a lightweight vision model. Results are suitable for general context but may not capture precise data values from complex charts.

[Hybrid Mode Guide →](https://opendataloader.org/docs/hybrid-mode)




## LangChain Integration

OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.

```bash

pip install -U langchain-opendataloader-pdf

```

```python

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(

    file_path=["document.pdf"],

    format="text"

)

documents = loader.load()

# Use with any LangChain pipeline

for doc in documents:

    print(doc.page_content[:100])

```

- [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)

- [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)

- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)




## Benchmarks

We continuously benchmark against real-world documents.

[View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)

### Quick Comparison

| Engine                      | Overall  | Reading Order | Table    | Heading  | Speed (s/page) |

|-----------------------------|----------|---------------|----------|----------|----------------|

| **opendataloader**          | 0.72     | 0.91          | 0.49     | 0.76     | **0.05**       |

| **opendataloader [hybrid]** | **0.90** | **0.94**      | **0.93** | **0.83** | 0.43           |

| docling                     | 0.86     | 0.90          | 0.89     | 0.80     | 0.73           |

| marker                      | 0.83     | 0.89          | 0.81     | 0.80     | 53.93          |

| mineru                      | 0.82     | 0.86          | 0.87     | 0.74     | 5.96           |

| pymupdf4llm                 | 0.57     | 0.89          | 0.40     | 0.41     | 0.09           |

| markitdown                  | 0.29     | 0.88          | 0.00     | 0.00     | **0.04**       |

> Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.

### Visual Comparison

[![Benchmark](https://github.com/opendataloader-project/opendataloader-bench/raw/refs/heads/main/charts/benchmark.png)](https://github.com/opendataloader-project/opendataloader-bench)




## Roadmap

See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)




## Documentation

- [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)

- [JSON Schema Reference](https://opendataloader.org/docs/json-schema)

- [CLI Options](https://opendataloader.org/docs/cli-options-reference)

- [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)

- [AI Safety Features](https://opendataloader.org/docs/ai-safety)




## Frequently Asked Questions

### What is the best PDF parser for RAG?

For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.

### How do I extract tables from PDF for LLM?

OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.

### Can I use this without sending data to the cloud?

Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.

### What makes OpenDataLoader unique?

OpenDataLoader takes a different approach from many PDF parsers:

- **Rule-based extraction** — Deterministic output without GPU requirements

- **Bounding boxes for all elements** — Essential for citation systems

- **XY-Cut++ reading order** — Handles multi-column layouts correctly

- **Built-in AI safety filters** — Protects against prompt injection

- **Native Tagged PDF support** — Leverages accessibility metadata

This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.

### How do I get better accuracy for complex tables?

Enable hybrid mode with `pip install -U "opendataloader-pdf[hybrid]"`. This routes pages with complex tables to an AI backend (like docling-serve) while keeping simple pages fast and local. Table accuracy improves from 0.49 to 0.93 — matching or exceeding dedicated AI parsers while remaining faster and more cost-effective.




## Contributing

We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.




## License

[Mozilla Public License 2.0](LICENSE)

---

**Found this useful?** Give us a star to help others discover OpenDataLoader.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/opendataloader-project/opendataloader-pdf

Awesome Lists containing this project

README