https://github.com/raphaelmansuy/edgeparse

EdgeParse converts any digital PDF into Markdown, JSON (with bounding boxes), HTML, or plain text — deterministically, without a JVM, without a GPU, and with best-in-class accuracy on the 200-document benchmark suite included in this repository.
https://github.com/raphaelmansuy/edgeparse
markdown markitdown pdf pdf-converter
Last synced: 3 months ago
JSON representation
Host: GitHub
URL: https://github.com/raphaelmansuy/edgeparse
Owner: raphaelmansuy
License: other
Created: 2026-03-22T08:02:45.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-29T01:33:07.000Z (4 months ago)
Last Synced: 2026-04-03T00:41:44.056Z (4 months ago)
Topics: markdown, markitdown, pdf, pdf-converter
Language: Rust
Homepage: https://edgeparse.com
Size: 62.9 MB
Stars: 73
Watchers: 1
Forks: 10
Open Issues: 9
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
Awesome Lists containing this project

awesome - raphaelmansuy/edgeparse - EdgeParse converts any digital PDF into Markdown, JSON (with bounding boxes), HTML, or plain text — deterministically, without a JVM, without a GPU, and with best-in-class accuracy on the 200-document benchmark suite included in this repository. (<a name="Rust"></a>Rust)
README

          # EdgeParse

**Fastest PDF extraction engine. Rust-native. Zero GPU, zero JVM, zero OCR models.**

[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)

[![Rust](https://img.shields.io/badge/Rust-1.85%2B-orange.svg)](https://www.rust-lang.org/)

[![crates.io](https://img.shields.io/crates/v/edgeparse-cli.svg)](https://crates.io/crates/edgeparse-cli)

[![PyPI](https://img.shields.io/pypi/v/edgeparse.svg)](https://pypi.org/project/edgeparse/)

[![npm](https://img.shields.io/npm/v/edgeparse.svg)](https://www.npmjs.com/package/edgeparse)

Extract Markdown, JSON (with bounding boxes), and HTML from any born-digital PDF

— deterministically, in milliseconds, on CPU.

- **How accurate is it?** — **0.787 overall** on the latest `opendataloader.org` PDF-to-Markdown benchmark, with the best score in every reported metric: reading order, tables, headings, paragraphs, text quality, table detection, and speed. [Benchmark details](#benchmark)

- **How fast?** — **0.064 s/doc** on the 200-document benchmark corpus on Apple M4 Max. Faster than OpenDataLoader, Docling, PyMuPDF4LLM, MarkItDown, and LiteParse.

- **Does it need a GPU or Java?** — No. No JVM, no GPU, no OCR models, no Python runtime for the CLI. Single ~15 MB binary.

- **RAG / LLM pipelines?** — Yes. Outputs structured Markdown for chunking, JSON with bounding boxes for citations, preserves reading order across multi-column layouts. [See integration examples](#rag--llm-integration)

Available as a **Rust library**, **CLI binary**, **Python package**, **Node.js package**, and **WebAssembly module** for in-browser PDF parsing.

---

## Table of Contents

- [Get Started in 30 Seconds](#get-started-in-30-seconds)

- [What Problems Does This Solve?](#what-problems-does-this-solve)

- [Release Channels](#release-channels)

- [Benchmark](#benchmark)

- [Capability Matrix](#capability-matrix)

- [Installation](#installation)

- [Python SDK](#python-sdk)

- [Node.js SDK](#nodejs-sdk)

- [WebAssembly SDK](#webassembly-sdk)

- [CLI Reference](#cli-reference)

- [Output Formats](#output-formats)

- [RAG / LLM Integration](#rag--llm-integration)

- [Agent Skill](#agent-skill)

- [Architecture](#architecture)

- [FAQ](#faq)

- [Tutorials](#tutorials)

- [Documentation](#documentation)

- [Project Layout](#project-layout)

- [Contributing](#contributing)

- [License](#license)

---

## Get Started in 30 Seconds

**Python** (Python 3.9+):

```bash

pip install edgeparse

```

```python

import edgeparse

# Markdown — ready for LLM context or RAG chunking

md = edgeparse.convert("report.pdf", format="markdown")

# JSON with bounding boxes — for citations and element-level control

data = edgeparse.convert("report.pdf", format="json")

```

**CLI** (Rust 1.85+):

```bash

cargo install edgeparse-cli

edgeparse report.pdf --format markdown --output-dir output/

```

**Node.js** (Node 18+):

```bash

npm install edgeparse

```

```js

import { convert } from 'edgeparse';

const md = convert('report.pdf', { format: 'markdown' });

```

---

## Release Channels

Tagged releases publish every supported distribution target through GitHub Actions:

| Channel | Artifact | Install / Pull |

|---------|----------|----------------|

| Rust crates | `pdf-cos`, `edgeparse-core`, `edgeparse-cli` | `cargo install edgeparse-cli` |

| Python SDK | `edgeparse` wheels + sdist | `pip install edgeparse` |

| Node.js SDK | `edgeparse` + 5 platform addons | `npm install edgeparse` |

| WebAssembly SDK | `edgeparse-wasm` | `npm install edgeparse-wasm` |

| CLI binaries | GitHub Release archives for macOS, Linux, Windows | [GitHub Releases](https://github.com/raphaelmansuy/edgeparse/releases) |

| Homebrew | `raphaelmansuy/edgeparse` tap | `brew tap raphaelmansuy/edgeparse && brew install edgeparse` |

| Containers | GHCR + Docker Hub multi-arch images | `docker pull ghcr.io/raphaelmansuy/edgeparse:0.2.1` |

Release automation and registry details: [docs/07-cicd-publishing.md](docs/07-cicd-publishing.md)

---

## What Problems Does This Solve?

| Problem | EdgeParse Solution | Status |

|---------|--------------------|--------|

| PDF text loses reading order in multi-column layouts | XY-Cut++ algorithm preserves correct reading sequence across columns, sidebars, and mixed layouts | ✅ Shipped |

| Table extraction is broken (merged cells, borderless tables) | Ruling-line table detection + borderless cluster method; `--table-method cluster` for complex cases | ✅ Shipped |

| OCR/ML tools add 500 MB+ of dependencies to a simple PDF pipeline | Zero GPU, zero OCR models, zero JVM — single 15 MB binary, pure Rust | ✅ Shipped |

| Heading hierarchy is lost (all text looks the same) | Font-metric + geometry-based heading classifier; MHS score 0.553 on the current benchmark | ✅ Shipped |

| PDFs can carry hidden prompt injection payloads | AI safety filters: hidden text, off-page content, tiny-text, invisible OCG layers detected and stripped | ✅ Shipped |

| Need bounding boxes to cite sources in RAG answers | Every element (`paragraph`, `heading`, `table`, `image`) has `[left, bottom, right, top]` coordinates in PDF points | ✅ Shipped |

| In-browser PDF parsing uploads data to a server | WebAssembly build — full Rust engine in the browser, PDF data never leaves the device | ✅ Shipped |

---

## Benchmark

Evaluated on **200 real-world PDFs** — academic papers, financial reports, multi-column layouts, complex tables, mixed-language documents — running on Apple M4 Max.

### Current comparison set

| Engine | NID ↑ | TEDS ↑ | MHS ↑ | PBF ↑ | TQS ↑ | TD F1 ↑ | Speed ↓ | Overall ↑ |

|--------|------:|-------:|------:|------:|------:|--------:|--------:|----------:|

| **EdgeParse** | **0.889** | **0.596** | **0.553** | **0.559** | **0.920** | **0.901** | **0.064 s/doc** | **0.787** |

| OpenDataLoader | 0.873 | 0.326 | 0.442 | 0.544 | 0.916 | 0.636 | 0.094 s/doc | 0.733 |

| Docling | 0.867 | 0.540 | 0.438 | 0.530 | 0.908 | 0.891 | 0.768 s/doc | 0.745 |

| PyMuPDF4LLM | 0.852 | 0.323 | 0.407 | 0.538 | 0.888 | 0.744 | 0.439 s/doc | 0.710 |

| EdgeParse (pre-frontier baseline) | 0.859 | 0.493 | 0.500 | 0.482 | 0.891 | 0.849 | 0.232 s/doc | 0.751 |

| MarkItDown | 0.808 | 0.193 | 0.001 | 0.362 | 0.861 | 0.558 | 0.149 s/doc | 0.564 |

| LiteParse | 0.815 | 0.000 | 0.001 | 0.383 | 0.887 | N/A | 0.196 s/doc | 0.564 |

EdgeParse now leads the entire comparison set on every reported benchmark metric, including speed. Relative to the previous EdgeParse baseline, the current pipeline increases reading-order accuracy, table structure similarity, paragraph boundaries, text quality, table-detection F1, and overall score while cutting latency from `0.232` to `0.064 s/doc`.

**When to choose what:**

| Use case | Recommendation |

|----------|---------------|

| Born-digital PDFs, latency matters, production deployment | **EdgeParse** — best accuracy/speed, zero dependencies |

| Complex scanned tables, GPU available, batch offline | Consider Docling or MinerU |

| Scanned documents requiring full OCR | Dedicated OCR pipeline |

### Metrics

| Metric | What it measures |

|--------|-----------------|

| **NID** | Reading order accuracy — normalised index distance |

| **TEDS** | Table structure accuracy — tree-edit distance vs. ground truth |

| **MHS** | Heading hierarchy accuracy |

| **PBF** | Paragraph boundary F1 |

| **TQS** | Text quality score |

| **TD F1** | Table detection F1 |

| **Overall** | Normalized aggregate benchmark score |

| **Speed** | Wall-clock seconds per document (full pipeline, 200 docs, parallel) |

### Running the benchmark

```bash

cargo build --release

cd benchmark

uv sync

uv run python run.py          # EdgeParse on all 200 docs

uv run python compare_all.py  # Compare against 9 engines

```

Results → `benchmark/prediction/edgeparse/` · HTML reports → `benchmark/reports/`

### Regression thresholds

`benchmark/thresholds.json` defines minimum acceptable scores for CI:

```json

{

  "nid": 0.85,

  "teds": 0.40,

  "mhs": 0.55,

  "table_detection_f1": 0.55,

  "elapsed_per_doc": 2.0

}

```

---

## Capability Matrix

| Capability | Available | Notes |

|-----------|-----------|-------|

| **Extraction** | | |

| Text with correct reading order | ✅ | XY-Cut++ across columns and sidebars |

| Bounding boxes for every element | ✅ | `[left, bottom, right, top]` in PDF points |

| Table extraction — ruling-line borders | ✅ | Default mode |

| Table extraction — borderless/cluster | ✅ | `--table-method cluster` |

| Heading hierarchy detection | ✅ | Numbered + unnumbered, all levels |

| List detection (numbered, bulleted, nested) | ✅ | |

| Image extraction with coordinates | ✅ | PNG or JPEG, embedded or external |

| Header / footer / watermark filtering | ✅ | |

| Multi-column layout support | ✅ | |

| CMap / ToUnicode font decoding | ✅ | Handles non-standard encodings |

| Tagged PDF structure tree | ✅ | `--use-struct-tree` preserves author intent |

| **Safety** | | |

| Hidden text filtering | ✅ | Prompt injection protection |

| Off-page content filtering | ✅ | |

| Tiny-text / invisible OCG layer filtering | ✅ | |

| PII sanitization | ✅ | `--sanitize` flag |

| **Output** | | |

| Markdown (GFM tables) | ✅ | |

| JSON with bounding boxes | ✅ | |

| HTML5 | ✅ | |

| Plain text (reading order preserved) | ✅ | |

| **SDKs** | | |

| Python (PyO3 native extension) | ✅ | Python 3.9+, pre-built wheels |

| Node.js (NAPI-RS native addon) | ✅ | Node 18+, pre-built addons |

| WebAssembly | ✅ | In-browser, no server required |

| Rust library | ✅ | `edgeparse-core` crate |

| CLI binary | ✅ | `edgeparse-cli` on crates.io |

| **Runtime** | | |

| GPU required | ❌ No | CPU only |

| JVM required | ❌ No | Pure Rust |

| OCR models required | ❌ No | Born-digital PDFs only |

| Parallel processing | ✅ | Rayon per-page parallelism |

| Deterministic / reproducible | ✅ | Same input → same output, always |

---

## Installation

### Python

```bash

pip install edgeparse

```

Python 3.9+. Pre-built wheels for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).

### CLI

```bash

cargo install edgeparse-cli

```

Requires [Rust 1.85+](https://rustup.rs/).

### Rust library

```toml

[dependencies]

edgeparse-core = "0.2.1"

```

Docs: [docs.rs/edgeparse-core](https://docs.rs/edgeparse-core) · [docs.rs/edgeparse-cli](https://docs.rs/edgeparse-cli)

### Node.js

```bash

npm install edgeparse

```

Node 18+. Pre-built native addons for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).

### Build from source

```bash

git clone https://github.com/raphaelmansuy/edgeparse.git

cd edgeparse

cargo build --release

# Binary: target/release/edgeparse

```

### System requirements

- macOS 12+, Linux (glibc 2.17+), or Windows 10+

- ~15 MB binary (stripped release build)

- No Java, no Python (for the CLI), no GPU

---

## Python SDK

**Package:** `edgeparse` · **Requires:** Python 3.9+ · **Source:** [`sdks/python/`](sdks/python/)

### `edgeparse.convert()`

```python

def convert(

    input_path: str | Path,

    *,

    format: str = "markdown",       # "markdown", "json", "html", "text"

    pages: str | None = None,        # e.g. "1,3,5-7"

    password: str | None = None,

    reading_order: str = "xycut",   # "xycut" or "off"

    table_method: str = "default",  # "default" or "cluster"

    image_output: str = "off",      # "off", "embedded", "external"

) -> str: ...

```

### `edgeparse.convert_file()`

```python

def convert_file(

    input_path: str | Path,

    output_dir: str | Path = "output",

    *,

    format: str = "markdown",

    pages: str | None = None,

    password: str | None = None,

) -> str: ...  # returns output file path

```

### Examples

```python

import edgeparse

# Basic Markdown extraction

md = edgeparse.convert("report.pdf", format="markdown")

# JSON with bounding boxes

json_str = edgeparse.convert("report.pdf", format="json")

# Password-protected, specific pages, borderless tables

md = edgeparse.convert(

    "secure.pdf",

    format="markdown",

    pages="1,3,5-7",

    password="secret",

    reading_order="xycut",

    table_method="cluster",

)

# Write output file  

out_path = edgeparse.convert_file("report.pdf", output_dir="output/", format="markdown")

```

### CLI entry point (Python package)

```bash

edgeparse report.pdf -f markdown -o output/

edgeparse *.pdf --format json --output-dir out/ --pages "1-3"

```

---

## Node.js SDK

**Package:** `edgeparse` · **Requires:** Node.js 18+ · **Source:** [`sdks/node/`](sdks/node/)

### `convert()`

```ts

import { convert } from 'edgeparse';

// Markdown

const md = convert('report.pdf', { format: 'markdown' });

// JSON with bounding boxes

const json = convert('report.pdf', { format: 'json' });

// With options

const result = convert('report.pdf', {

  format: 'markdown',

  pages: '1-5',

  readingOrder: 'xycut',

  tableMethod: 'cluster',

});

```

### `ConvertOptions`

```ts

interface ConvertOptions {

  format?: 'markdown' | 'json' | 'html' | 'text';  // default: "markdown"

  pages?: string;         // e.g. "1,3,5-7"

  password?: string;

  readingOrder?: 'xycut' | 'off';        // default: "xycut"

  tableMethod?: 'default' | 'cluster';   // default: "default"

  imageOutput?: 'off' | 'embedded' | 'external';  // default: "off"

}

```

### CLI (npm)

```bash

npx edgeparse report.pdf -f markdown -o output.md

npx edgeparse report.pdf --format json --pages "1-5"

```

---

## WebAssembly SDK

EdgeParse compiles to WebAssembly — **client-side PDF extraction in any modern browser with no server, no uploads, and no backend infrastructure.**

- Same Rust engine, identical output to CLI/Python/Node

- PDF data never leaves the user's device (privacy by design)

- Works offline after initial WASM load (~4 MB cached)

- Zero infrastructure cost — static hosting only

### Quick start

```typescript

import init, { convert_to_string } from 'edgeparse-wasm';

await init();  // load WASM binary once

const bytes = new Uint8Array(await file.arrayBuffer());

const markdown = convert_to_string(bytes, 'markdown');

const json     = convert_to_string(bytes, 'json');

const html     = convert_to_string(bytes, 'html');

```

### API

| Function | Returns | Description |

|----------|---------|-------------|

| `convert(bytes, format?, pages?, readingOrder?, tableMethod?)` | JS object | Structured `PdfDocument` with pages, elements, bounding boxes |

| `convert_to_string(bytes, format?, pages?, readingOrder?, tableMethod?)` | `string` | Formatted output (Markdown, JSON, HTML, or text) |

| `version()` | `string` | EdgeParse version |

### Live demo

**[edgeparse.com/demo/](https://edgeparse.com/demo/)** — drag-and-drop any PDF, all processing runs locally in your browser.

### Build from source

```bash

curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh

cd crates/edgeparse-wasm

wasm-pack build --target web --release

# Output: crates/edgeparse-wasm/pkg/

```

Full documentation: [docs/09-wasm-sdk.md](docs/09-wasm-sdk.md)

---

## CLI Reference

```

edgeparse [OPTIONS] ...

```

### Core options

| Flag | Default | Description |

|------|---------|-------------|

| `-o, --output-dir ` | — | Write output files to this directory |

| `-f, --format ` | `json` | Output format(s), comma-separated |

| `-p, --password ` | — | Password for encrypted PDFs |

| `--pages ` | — | Page range e.g. `"1,3,5-7"` |

| `-q, --quiet` | false | Suppress log output |

### Output format values

| Value | Description |

|-------|-------------|

| `json` | Structured JSON with bounding boxes and element types (default) |

| `markdown` | Standard Markdown with GFM tables |

| `markdown-with-html` | Markdown with HTML table fallback for complex tables |

| `markdown-with-images` | Markdown with embedded or linked images |

| `html` | Full HTML5 document with semantic elements |

| `text` | Plain UTF-8 text, reading order preserved |

Multiple formats: `--format markdown,json`

### Layout & extraction options

| Flag | Default | Description |

|------|---------|-------------|

| `--reading-order ` | `xycut` | Reading order: `xycut` or `off` |

| `--table-method ` | `default` | Table detection: `default` (ruling lines) or `cluster` (borderless) |

| `--keep-line-breaks` | false | Preserve original line breaks within paragraphs |

| `--use-struct-tree` | false | Use tagged PDF structure tree when available |

| `--include-header-footer` | false | Include headers and footers in output |

| `--sanitize` | false | Enable PII sanitization |

| `--replace-invalid-chars ` | `" "` | Replacement for invalid Unicode characters |

### Image options

| Flag | Default | Description |

|------|---------|-------------|

| `--image-output ` | `external` | `off`, `embedded` (base64), or `external` (files) |

| `--image-format ` | `png` | `png` or `jpeg` |

| `--image-dir ` | — | Directory for extracted image files |

### Separator options

| Flag | Description |

|------|-------------|

| `--markdown-page-separator ` | String inserted between Markdown pages |

| `--text-page-separator ` | String inserted between plain-text pages |

| `--html-page-separator ` | String inserted between HTML pages |

### Content safety options

| Flag | Default | Description |

|------|---------|-------------|

| `--content-safety-off ` | — | Disable safety filters: `all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg` |

### Hybrid backend options

| Flag | Default | Description |

|------|---------|-------------|

| `--hybrid ` | `off` | Hybrid backend: `off` or `docling-fast` |

| `--hybrid-mode ` | `auto` | Triage mode: `auto` or `full` |

| `--hybrid-url ` | — | Hybrid backend service URL |

| `--hybrid-timeout ` | `30000` | Timeout in milliseconds |

| `--hybrid-fallback` | false | Fall back to local extraction on hybrid error |

---

## Output Formats

| Format | Flag value | Best for |

|--------|-----------|----------|

| JSON with bounding boxes | `json` | RAG citations, element-level processing, source highlighting |

| Markdown (GFM) | `markdown` | LLM context windows, chunking pipelines, readable output |

| Markdown + HTML tables | `markdown-with-html` | Complex tables that don't render well in pure Markdown |

| Markdown + images | `markdown-with-images` | Documents where figures matter |

| HTML5 | `html` | Web display, accessibility pipelines |

| Plain text | `text` | Keyword search, simple NLP, legacy pipelines |

Combine formats: `--format markdown,json`

### JSON output example

```json

{

  "type": "heading",

  "id": 42,

  "level": "Title",

  "page number": 1,

  "bounding box": [72.0, 700.0, 540.0, 730.0],

  "heading level": 1,

  "font": "Helvetica-Bold",

  "font size": 24.0,

  "content": "Introduction"

}

```

| Field | Description |

|-------|-------------|

| `type` | Element type: `heading`, `paragraph`, `table`, `list`, `image`, `caption` |

| `id` | Unique identifier for cross-referencing |

| `page number` | 1-indexed page reference |

| `bounding box` | `[left, bottom, right, top]` in PDF points (72 pt = 1 inch) |

| `heading level` | Heading depth (1+) |

| `content` | Extracted text |

---

## RAG / LLM Integration

EdgeParse is designed for AI pipelines. Every element has a `bounding box` and `page number`, so you can cite exact sources in answers.

### Extract Markdown for chunking

```python

import edgeparse

md = edgeparse.convert("report.pdf", format="markdown")

# Feed directly into an LLM

response = llm.invoke(f"Summarize this document:\n\n{md}")

```

### Extract JSON for source citations

```python

import json, edgeparse

data = json.loads(edgeparse.convert("report.pdf", format="json"))

for element in data["kids"]:

    if element["type"] == "paragraph":

        # element["bounding box"] → highlight location in original PDF

        # element["page number"] → link back to source page

        print(f"p.{element['page number']}: {element['content'][:80]}")

```

### LangChain integration

EdgeParse has no official LangChain loader yet, but integrates trivially:

```python

from langchain.schema import Document

import edgeparse, json

def load_pdf(path: str) -> list[Document]:

    data = json.loads(edgeparse.convert(path, format="json"))

    docs = []

    for el in data["kids"]:

        if el["type"] in ("paragraph", "heading"):

            docs.append(Document(

                page_content=el["content"],

                metadata={

                    "source": path,

                    "page": el["page number"],

                    "bbox": el["bounding box"],

                    "type": el["type"],

                }

            ))

    return docs

```

### LlamaIndex integration

```python

from llama_index.core import Document

import edgeparse, json

def edgeparse_reader(path: str) -> list[Document]:

    data = json.loads(edgeparse.convert(path, format="json"))

    return [

        Document(

            text=el["content"],

            metadata={"page": el["page number"], "source": path}

        )

        for el in data["kids"]

        if el.get("content")

    ]

```

### Which output format for which use case?

| Use case | Recommended format | Why |

|----------|--------------------|-----|

| Feed PDF to LLM | `markdown` | Clean structure, fits in context window |

| RAG with source citations | `json` | Bounding boxes enable "click-to-source" UX |

| Semantic chunking by section | `markdown` | Headings make natural chunk boundaries |

| Element-level filtering | `json` | Filter by `type`, `page number`, `heading level` |

| Web display | `html` | Styled output with semantic elements |

---

## Agent Skill

EdgeParse ships as a **Claude agent skill** — a structured description that teaches Claude (and any compatible AI agent) how to extract PDF content on behalf of users.

```bash

# Add the EdgeParse skill to your agent environment

npx skills add raphaelmansuy/edgeparse --skill edgeparse

# Install the Python package

pip install edgeparse

```

The `npx skills add` command registers the skill in `skills-lock.json`:

```json

{

  "version": 1,

  "skills": {

    "edgeparse": {

      "source": "raphaelmansuy/edgeparse",

      "sourceType": "github"

    }

  }

}

```

Once installed, the agent reads `skills/edgeparse/SKILL.md` and knows when to call `edgeparse.convert()`, which format to use for different tasks, and how to handle edge cases like encrypted PDFs, borderless tables, and multi-column layouts.

See [docs/08-agent-skill.md](docs/08-agent-skill.md) for the full integration guide (LangChain, LlamaIndex, MCP, CrewAI patterns).

---

## Architecture

### Crate structure

```

edgeparse/

├── crates/

│   ├── pdf-cos/            # Low-level PDF object model (fork of lopdf 0.39)

│   ├── edgeparse-core/     # Core extraction engine (~90 source files)

│   │   └── src/

│   │       ├── api/        # ProcessingConfig, FilterConfig, BatchResult

│   │       ├── pdf/        # Loader, content stream parser, font/CMap decoding

│   │       ├── models/     # ContentElement, BoundingBox, TextChunk, PdfDocument

│   │       ├── pipeline/   # 20-stage orchestrator + Rayon parallel helpers

│   │       ├── output/     # Renderers: JSON, Markdown, HTML, text, CSV

│   │       ├── tagged/     # Tagged-PDF structure tree → McidMap

│   │       └── utils/      # XY-Cut++ algorithm, sanitizer, layout analysis

│   ├── edgeparse-cli/      # CLI binary (clap derive, 25+ flags)

│   ├── edgeparse-python/   # PyO3 native extension

│   └── edgeparse-node/     # NAPI-RS native addon

└── sdks/

    ├── python/             # Python packaging (maturin, pyproject.toml)

    └── node/               # npm packaging (TypeScript wrapper, tsup)

```

### Processing pipeline (20 stages)

```

PDF file

    │

    ▼

pdf-cos                   ← xref parsing, object graph, encrypted streams

    │

    ▼

edgeparse-core::pdf       ← page tree, content stream operators, font decoding,

    │                       CMap/ToUnicode, image extraction, tagged PDF

    ▼

edgeparse-core::pipeline  ← 20 sequential/parallel stages:

    │  [page range] → [watermark] → [filter] → [table borders] →

    │  [cell assignment] → [boxed headings] → [column detection] →

    │  [TextLine assembly] → [TextBlock grouping] → [table clustering] →

    │  [header/footer] → [list detection] → [paragraph] → [figure] →

    │  [heading classification] → [XY-Cut++ reading order] →

    │  [list pass 2] → [caption/footnote/TOC] → [cross-page tables] →

    │  [element nesting] → [final reading order] → [sanitize]

    ▼

edgeparse-core::output    ← render to JSON / Markdown / HTML / text

```

Stages marked `par_map_pages` run in parallel via Rayon; cross-page stages run sequentially.

---

## FAQ

### What is the best PDF parser for RAG?

For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. EdgeParse outputs structured JSON with bounding boxes for every element, handles multi-column layouts with XY-Cut++, and runs locally on CPU without a GPU or JVM. On the current 200-document benchmark it leads the comparison set in both overall score (`0.787`) and latency (`0.064 s/doc`). [See RAG integration examples](#rag--llm-integration).

### How do I cite PDF sources in RAG answers?

Every element in JSON output includes a `bounding box` (`[left, bottom, right, top]` in PDF points, 72 pt = 1 inch) and `page number`. Map the source chunk back to its bounding box to highlight the exact location in the original PDF — enabling "click-to-source" UX. No other non-OCR open-source parser provides bounding boxes for every element by default.

### How do I extract tables from PDF?

EdgeParse detects tables using border (ruling-line) analysis by default. For complex or borderless tables, add `--table-method cluster` (CLI) or `table_method="cluster"` (Python). This uses a text-clustering algorithm to detect table structure without visible borders. On the current benchmark it reaches `0.596` TEDS and `0.901` table-detection F1, both best in the published comparison set.

### Does it work without sending data to the cloud?

Yes. EdgeParse runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The WebAssembly SDK also runs entirely in the browser: the PDF is processed client-side and never uploaded to any server. Ideal for legal, healthcare, and financial documents.

### Does it handle multi-column layouts?

Yes. XY-Cut++ reading order analysis correctly sequences text across multi-column pages, sidebars, and mixed layouts. This works without any configuration change; it is the default reading order algorithm.

### Does it need a GPU or Java?

No. EdgeParse is a pure Rust implementation. It requires no JVM, no GPU, no OCR models, and no Python runtime for the CLI binary. The CLI binary is ~15 MB stripped. On Apple M4 Max, it processes the 200-document benchmark corpus in about `12.7` seconds total.

### How does it compare to Docling, MinerU, and Marker?

| vs. | EdgeParse advantage | Tradeoff |

|-----|---------------------|----------|

| OpenDataLoader | Faster (`0.064` vs `0.094 s/doc`) with stronger table structure and heading recovery | OpenDataLoader remains close on text quality and paragraph boundaries |

| IBM Docling | Faster (`0.064` vs `0.768 s/doc`) with better TEDS and overall score in the current benchmark snapshot | Docling remains a viable OCR-heavy fallback for scanned documents |

| Marker | Faster and materially better on every published metric in this benchmark family | Marker supports scanned PDFs via Surya OCR |

| PyMuPDF4LLM | Faster (`0.064` vs `0.439 s/doc`) with stronger tables, headings, and reading order | PyMuPDF4LLM is simpler if you only need lightweight text extraction |

### Does it support scanned PDFs?

Not directly — EdgeParse is built for born-digital PDFs with embedded fonts. It does not include an OCR engine. For scanned documents, use EdgeParse's hybrid backend option (`--hybrid docling-fast`) which routes complex pages to a Docling-Fast backend running locally.

### What does "deterministic" mean?

Same input PDF → same output, every time. No stochastic models, no floating-point non-determinism from ML inference. This makes EdgeParse safe to use in CI pipelines, regression tests, and compliance workflows where reproducibility is required.

### How do I chunk PDFs for semantic search?

Use `format="markdown"`. EdgeParse preserves heading hierarchy and table structure in Markdown output — headings make natural chunk boundaries for `RecursiveCharacterTextSplitter` (LangChain) or heading-based splitters. For element-level control, use `format="json"` and split on `heading level` boundaries or `page number` changes.

### Does the Python SDK run on Windows?

Yes. Pre-built wheels are available for Windows (x64). The Python package installs from PyPI with no compilation needed:

```bash

pip install edgeparse

```

---

## Tutorials

Step-by-step guides with working examples live in [`tutorials/`](tutorials/):

| Tutorial | Description |

|----------|-------------|

| [tutorials/01-cli.md](tutorials/01-cli.md) | All CLI flags with working examples and output samples |

| [tutorials/02-python-sdk.md](tutorials/02-python-sdk.md) | `pip install edgeparse` — full API, batch processing, JSON parsing |

| [tutorials/03-nodejs-sdk.md](tutorials/03-nodejs-sdk.md) | `npm install edgeparse` — TypeScript, CJS, and worker threads |

| [tutorials/04-rust-library.md](tutorials/04-rust-library.md) | `edgeparse-core` in your Rust project — config, models, Rayon |

| [tutorials/05-output-formats.md](tutorials/05-output-formats.md) | JSON schema, bounding boxes, Markdown variants, HTML, plain text |

---

## Documentation

Technical documentation lives in [`docs/`](docs/):

| Document | Description |

|----------|-------------|

| [docs/00-overview.md](docs/00-overview.md) | Project overview, goals, and design philosophy |

| [docs/01-architecture.md](docs/01-architecture.md) | Crate structure, module map, data-flow diagram |

| [docs/02-pipeline.md](docs/02-pipeline.md) | All 20 pipeline stages with ASCII diagrams |

| [docs/03-data-model.md](docs/03-data-model.md) | Type hierarchy: `ContentElement`, `BoundingBox`, `PdfDocument` |

| [docs/04-pdf-extraction.md](docs/04-pdf-extraction.md) | PDF loader, chunk parser, font/CMap decoding |

| [docs/05-output-formats.md](docs/05-output-formats.md) | JSON schema, Markdown renderer, HTML/text/CSV output |

| [docs/06-sdk-integration.md](docs/06-sdk-integration.md) | CLI flag reference, Python SDK API, Node.js SDK API, Batch API |

| [docs/07-cicd-publishing.md](docs/07-cicd-publishing.md) | CI/CD publishing pipeline — how it works and how to configure it |

| [docs/08-agent-skill.md](docs/08-agent-skill.md) | EdgeParse agent skill — `npx skills add`, SKILL.md structure, SDK patterns |

| [docs/09-wasm-sdk.md](docs/09-wasm-sdk.md) | WebAssembly SDK — objectives, API, use cases, build instructions |

---

## Project Layout

```

edgeparse/

├── LICENSE

├── CONTRIBUTING.md

├── README.md

├── Cargo.toml               # Rust workspace (5 members)

├── Cargo.lock

│

├── crates/

│   ├── pdf-cos/             # lopdf 0.39 fork — low-level PDF object model

│   ├── edgeparse-core/      # Core extraction engine (~90 source files)

│   ├── edgeparse-cli/       # CLI binary (clap, 25+ flags)

│   ├── edgeparse-wasm/      # WebAssembly build for browsers (wasm-bindgen)

│   ├── edgeparse-python/    # PyO3 native Python extension

│   └── edgeparse-node/      # NAPI-RS native Node.js addon

│

├── sdks/

│   ├── python/              # Python wheel packaging (maturin + pyproject.toml)

│   │   └── edgeparse/       # Pure-Python wrapper + CLI entry point

│   └── node/                # npm packaging (TypeScript + tsup + vitest)

│       └── src/             # index.ts, types.ts, cli.ts

│

├── benchmark/               # Evaluation suite

│   ├── run.py               # Benchmark runner (EdgeParse)

│   ├── compare_all.py       # Multi-engine comparison (9 engines)

│   ├── pyproject.toml

│   ├── thresholds.json      # Regression thresholds

│   ├── pdfs/                # Benchmark PDFs (200 docs)

│   ├── ground-truth/        # Reference Markdown and JSON annotations

│   ├── prediction/          # Per-engine output directories

│   ├── reports/             # HTML benchmark reports

│   └── src/                 # Python evaluators and engine adapters

│

├── docs/                    # Technical documentation (Markdown)

│

├── demo/                    # Interactive WASM demo (Vite + TypeScript)

│   └── src/                 # Demo application source

│

├── examples/

│   └── pdf/                 # Sample PDFs for quick testing

│       ├── lorem.pdf

│       ├── 1901.03003.pdf   # Academic paper (multi-column)

│       ├── 2408.02509v1.pdf # Academic paper

│       └── chinese_scan.pdf # CJK + scan example

│

├── benches/                 # Rust micro-benchmarks (criterion)

├── docker/                  # Dockerfile and Dockerfile.dev

├── scripts/                 # bench.sh, publish-crates.sh

└── tests/

    └── fixtures/            # Rust integration test fixtures

```

---

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). In short:

1. Fork, branch from `main`

2. `cargo fmt && cargo clippy -- -D warnings`

3. Run the benchmark to check for regressions: `cd benchmark && uv run python run.py --engine edgeparse`

4. Open a PR

---

## Star History



 

   

   

   

 



## License

EdgeParse is licensed under the **Apache License 2.0**. See [LICENSE](LICENSE) for the full text.

The `crates/pdf-cos/` directory is a fork of [lopdf](https://github.com/J-F-Liu/lopdf) (MIT/Apache-2.0 dual-licensed).  

Benchmark PDF documents (`benchmark/pdfs/`) are sourced from publicly available documents and are used solely for evaluation purposes.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/raphaelmansuy/edgeparse

Awesome Lists containing this project

README