https://github.com/opendataloader-project/opendataloader-pdf
PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
https://github.com/opendataloader-project/opendataloader-pdf
ai dataloader document-parser document-parsing documents html json markdown ocr-recognition pdf pdf-converter pdf-parser pdf-to-html pdf-to-json pdf-to-markdown rag recognition sdk tables
Last synced: 6 days ago
JSON representation
PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU
- Host: GitHub
- URL: https://github.com/opendataloader-project/opendataloader-pdf
- Owner: opendataloader-project
- License: mpl-2.0
- Created: 2025-05-13T05:48:02.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2026-01-12T15:12:40.000Z (14 days ago)
- Last Synced: 2026-01-12T18:03:16.833Z (14 days ago)
- Topics: ai, dataloader, document-parser, document-parsing, documents, html, json, markdown, ocr-recognition, pdf, pdf-converter, pdf-parser, pdf-to-html, pdf-to-json, pdf-to-markdown, rag, recognition, sdk, tables
- Language: Java
- Homepage: https://opendataloader.org
- Size: 78.9 MB
- Stars: 821
- Watchers: 4
- Forks: 43
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS
- Security: .github/SECURITY.md
- Support: SUPPORT.md
- Notice: NOTICE.md
Awesome Lists containing this project
README
# OpenDataLoader PDF
**PDF Parsing for RAG** — Convert to Markdown & JSON, Fast, Local, No GPU
[](https://github.com/opendataloader-project/opendataloader-pdf/blob/main/LICENSE)
[](https://pypi.org/project/opendataloader-pdf/)
[](https://www.npmjs.com/package/@opendataloader/pdf)
[](https://search.maven.org/artifact/org.opendataloader/opendataloader-pdf-core)
[](https://github.com/opendataloader-project/opendataloader-pdf/pkgs/container/opendataloader-pdf-cli)
[](https://github.com/opendataloader-project/opendataloader-pdf#java)
Convert PDFs into **LLM-ready Markdown and JSON** with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.
**Why developers choose OpenDataLoader:**
- **Deterministic** — Same input always produces same output (no LLM hallucinations)
- **Fast** — Process 100+ pages per second on CPU
- **Private** — 100% local, zero data transmission
- **Accurate** — Bounding boxes for every element, correct multi-column reading order
```bash
pip install -U opendataloader-pdf
```
```python
import opendataloader_pdf
# PDF to Markdown for RAG
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="output/",
format="markdown,json"
)
```
## Why OpenDataLoader?
Building RAG pipelines? You've probably hit these problems:
| Problem | How We Solve It |
|---------|-----------------|
| **Multi-column text reads left-to-right incorrectly** | XY-Cut++ algorithm preserves correct reading order |
| **Tables lose structure** | Border + cluster detection keeps rows/columns intact |
| **Headers/footers pollute context** | Auto-filtered before output |
| **No coordinates for citations** | Bounding box for every element |
| **Cloud APIs = privacy concerns** | 100% local, no data leaves your machine |
| **GPU required** | Pure CPU, rule-based — runs anywhere |
## Key Features
### For RAG & LLM Pipelines
- **Structured Output** — JSON with semantic types (heading, paragraph, table, list, caption)
- **Bounding Boxes** — Every element includes `[x1, y1, x2, y2]` coordinates for citations
- **Reading Order** — XY-Cut++ algorithm handles multi-column layouts correctly
- **Noise Filtering** — Headers, footers, hidden text, watermarks auto-removed
- **LangChain Integration** — [Official document loader](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
### Performance & Privacy
- **No GPU** — Fast, rule-based heuristics
- **Local-First** — Your documents never leave your machine
- **High Throughput** — Process thousands of PDFs efficiently
- **Multi-Language SDK** — Python, Node.js, Java, Docker
### Document Understanding
- **Tables** — Detects borders, handles merged cells
- **Lists** — Numbered, bulleted, nested
- **Headings** — Auto-detects hierarchy levels
- **Images** — Extracts with captions linked
- **Tagged PDF Support** — Uses native PDF structure when available
- **AI Safety** — Auto-filters prompt injection content
## Output Formats
| Format | Use Case |
|--------|----------|
| **JSON** | Structured data with bounding boxes, semantic types |
| **Markdown** | Clean text for LLM context, RAG chunks |
| **HTML** | Web display with styling |
| **Annotated PDF** | Visual debugging — see detected structures ([sample](https://opendataloader.org/demo/samples/01030000000000?view1=annot&view2=json)) |
## JSON Output Example
```json
{
"type": "heading",
"id": 42,
"level": "Title",
"page number": 1,
"bounding box": [72.0, 700.0, 540.0, 730.0],
"heading level": 1,
"font": "Helvetica-Bold",
"font size": 24.0,
"text color": "[0.0]",
"content": "Introduction"
}
```
| Field | Description |
|-------|-------------|
| `type` | Element type: heading, paragraph, table, list, image, caption |
| `id` | Unique identifier for cross-referencing |
| `page number` | 1-indexed page reference |
| `bounding box` | `[left, bottom, right, top]` in PDF points |
| `heading level` | Heading depth (1+) |
| `font`, `font size` | Typography info |
| `content` | Extracted text |
[Full JSON Schema →](https://opendataloader.org/docs/json-schema)
## Quick Start
- [Python](https://opendataloader.org/docs/quick-start-python)
- [Node.js / TypeScript](https://opendataloader.org/docs/quick-start-nodejs)
- [Docker](https://opendataloader.org/docs/quick-start-docker)
- [Java](https://opendataloader.org/docs/quick-start-java)
## Advanced Options
```python
opendataloader_pdf.convert(
input_path="document.pdf",
output_dir="output/",
format="json,markdown,pdf",
# Image output mode: "off", "embedded" (Base64), or "external" (default)
image_output="embedded",
# Image format: "png" or "jpeg"
image_format="jpeg",
# Tagged PDF
use_struct_tree=True, # Use native PDF structure
)
```
[Full CLI Options Reference →](https://opendataloader.org/docs/cli-options-reference)
## AI Safety
PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:
- Hidden text (transparent, zero-size)
- Off-page content
- Suspicious invisible layers
This is **enabled by default**. [Learn more →](https://opendataloader.org/docs/ai-safety)
## Tagged PDF Support
**Why it matters:** The [European Accessibility Act (EAA)](https://commission.europa.eu/strategy-and-policy/policies/justice-and-fundamental-rights/disability/union-equality-strategy-rights-persons-disabilities-2021-2030/european-accessibility-act_en) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.
**OpenDataLoader leverages this:**
- When a PDF has structure tags, we extract the **exact layout** the author intended
- Headings, lists, tables, reading order — all preserved from the source
- No guessing, no heuristics needed — **pixel-perfect semantic extraction**
```python
opendataloader_pdf.convert(
input_path="accessible_document.pdf",
use_struct_tree=True # Use native PDF structure tags
)
```
Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.
[Learn more about Tagged PDF →](https://opendataloader.org/docs/tagged-pdf)
## Hybrid Mode
For documents with complex tables or OCR needs, enable hybrid mode to route challenging pages to an AI backend while keeping simple pages fast and local.
**Results**: Table accuracy jumps from 0.49 → 0.93 (+90%) with acceptable speed trade-off.
```bash
pip install -U "opendataloader-pdf[hybrid]"
```
Terminal 1: Start the backend server
```bash
opendataloader-pdf-hybrid --port 5002
```
Terminal 2: Process PDFs with hybrid mode
```bash
opendataloader-pdf --hybrid docling-fast input.pdf
```
Or use in Python:
```python
opendataloader_pdf.convert(
input_path="complex_tables.pdf",
output_dir="output/",
hybrid="docling-fast" # Routes complex pages to AI backend
)
```
- **Local-first**: Simple pages processed locally, complex pages routed to backend
- **Fallback**: If backend unavailable, gracefully falls back to local processing
- **Privacy**: Run the backend locally in Docker for 100% on-premise
[Hybrid Mode Guide →](https://opendataloader.org/docs/hybrid-mode)
## LangChain Integration
OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.
```bash
pip install -U langchain-opendataloader-pdf
```
```python
from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader
loader = OpenDataLoaderPDFLoader(
file_path=["document.pdf"],
format="text"
)
documents = loader.load()
# Use with any LangChain pipeline
for doc in documents:
print(doc.page_content[:100])
```
- [LangChain Documentation](https://python.langchain.com/docs/integrations/document_loaders/opendataloader_pdf/)
- [GitHub Repository](https://github.com/opendataloader-project/langchain-opendataloader-pdf)
- [PyPI Package](https://pypi.org/project/langchain-opendataloader-pdf/)
## Benchmarks
We continuously benchmark against real-world documents.
[View full benchmark results →](https://github.com/opendataloader-project/opendataloader-bench)
### Quick Comparison
| Engine | Overall | Reading Order | Table | Heading | Speed (s/page) |
|-----------------------------|----------|---------------|----------|----------|----------------|
| **opendataloader** | 0.68 | 0.91 | 0.49 | 0.65 | **0.05** |
| **opendataloader [hybrid]** | **0.88** | **0.93** | **0.93** | 0.78 | 0.48 |
| docling | 0.86 | 0.90 | 0.89 | **0.80** | 0.73 |
| marker | 0.83 | 0.89 | 0.81 | **0.80** | 53.93 |
| mineru | 0.82 | 0.86 | 0.87 | 0.74 | 5.96 |
| pymupdf4llm | 0.57 | 0.89 | 0.40 | 0.41 | 0.09 |
| markitdown | 0.29 | 0.88 | 0.00 | 0.00 | **0.04** |
> Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. **Bold** indicates best performance.
### Visual Comparison
[](https://github.com/opendataloader-project/opendataloader-bench)
## Roadmap
See our [upcoming features and priorities →](https://opendataloader.org/docs/upcoming-roadmap)
## Documentation
- [Quick Start Guide](https://opendataloader.org/docs/quick-start-python)
- [JSON Schema Reference](https://opendataloader.org/docs/json-schema)
- [CLI Options](https://opendataloader.org/docs/cli-options-reference)
- [Tagged PDF Support](https://opendataloader.org/docs/tagged-pdf)
- [AI Safety Features](https://opendataloader.org/docs/ai-safety)
## Frequently Asked Questions
### What is the best PDF parser for RAG?
For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.
### How do I extract tables from PDF for LLM?
OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.
### Can I use this without sending data to the cloud?
Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.
### What makes OpenDataLoader unique?
OpenDataLoader takes a different approach from many PDF parsers:
- **Rule-based extraction** — Deterministic output without GPU requirements
- **Bounding boxes for all elements** — Essential for citation systems
- **XY-Cut++ reading order** — Handles multi-column layouts correctly
- **Built-in AI safety filters** — Protects against prompt injection
- **Native Tagged PDF support** — Leverages accessibility metadata
This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.
### How do I get better accuracy for complex tables?
Enable hybrid mode with `pip install -U "opendataloader-pdf[hybrid]"`. This routes pages with complex tables to an AI backend (like docling-serve) while keeping simple pages fast and local. Table accuracy improves from 0.49 to 0.93 — matching or exceeding dedicated AI parsers while remaining faster and more cost-effective.
## Contributing
We welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.
## License
[Mozilla Public License 2.0](LICENSE)
---
**Found this useful?** Give us a star to help others discover OpenDataLoader.