{"id":47666061,"url":"https://github.com/raphaelmansuy/edgeparse","last_synced_at":"2026-04-13T14:01:26.301Z","repository":{"id":346113001,"uuid":"1188533892","full_name":"raphaelmansuy/edgeparse","owner":"raphaelmansuy","description":"EdgeParse converts any digital PDF into Markdown, JSON (with bounding boxes), HTML, or plain text — deterministically, without a JVM, without a GPU, and with best-in-class accuracy on the 200-document benchmark suite included in this repository.","archived":false,"fork":false,"pushed_at":"2026-03-29T01:33:07.000Z","size":65940,"stargazers_count":73,"open_issues_count":9,"forks_count":10,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-03T00:41:44.056Z","etag":null,"topics":["markdown","markitdown","pdf","pdf-converter"],"latest_commit_sha":null,"homepage":"https://edgeparse.com","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/raphaelmansuy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-22T08:02:45.000Z","updated_at":"2026-04-02T22:51:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/raphaelmansuy/edgeparse","commit_stats":null,"previous_names":["raphaelmansuy/edgeparse"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/raphaelmansuy/edgeparse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelmansuy%2Fedgeparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelmansuy%2Fedgeparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelmansuy%2Fedgeparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelmansuy%2Fedgeparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/raphaelmansuy","download_url":"https://codeload.github.com/raphaelmansuy/edgeparse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/raphaelmansuy%2Fedgeparse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31755536,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T13:27:56.013Z","status":"ssl_error","status_checked_at":"2026-04-13T13:21:23.512Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["markdown","markitdown","pdf","pdf-converter"],"created_at":"2026-04-02T11:57:36.079Z","updated_at":"2026-04-13T14:01:26.277Z","avatar_url":"https://github.com/raphaelmansuy.png","language":"Rust","readme":"# EdgeParse\n\n**Fastest PDF extraction engine. Rust-native. Zero GPU, zero JVM, zero OCR models.**\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)\n[![Rust](https://img.shields.io/badge/Rust-1.85%2B-orange.svg)](https://www.rust-lang.org/)\n[![crates.io](https://img.shields.io/crates/v/edgeparse-cli.svg)](https://crates.io/crates/edgeparse-cli)\n[![PyPI](https://img.shields.io/pypi/v/edgeparse.svg)](https://pypi.org/project/edgeparse/)\n[![npm](https://img.shields.io/npm/v/edgeparse.svg)](https://www.npmjs.com/package/edgeparse)\n\nExtract Markdown, JSON (with bounding boxes), and HTML from any born-digital PDF\n— deterministically, in milliseconds, on CPU.\n\n- **How accurate is it?** — **0.787 overall** on the latest `opendataloader.org` PDF-to-Markdown benchmark, with the best score in every reported metric: reading order, tables, headings, paragraphs, text quality, table detection, and speed. [Benchmark details](#benchmark)\n- **How fast?** — **0.064 s/doc** on the 200-document benchmark corpus on Apple M4 Max. Faster than OpenDataLoader, Docling, PyMuPDF4LLM, MarkItDown, and LiteParse.\n- **Does it need a GPU or Java?** — No. No JVM, no GPU, no OCR models, no Python runtime for the CLI. Single ~15 MB binary.\n- **RAG / LLM pipelines?** — Yes. Outputs structured Markdown for chunking, JSON with bounding boxes for citations, preserves reading order across multi-column layouts. [See integration examples](#rag--llm-integration)\n\nAvailable as a **Rust library**, **CLI binary**, **Python package**, **Node.js package**, and **WebAssembly module** for in-browser PDF parsing.\n\n---\n\n## Table of Contents\n\n- [Get Started in 30 Seconds](#get-started-in-30-seconds)\n- [What Problems Does This Solve?](#what-problems-does-this-solve)\n- [Release Channels](#release-channels)\n- [Benchmark](#benchmark)\n- [Capability Matrix](#capability-matrix)\n- [Installation](#installation)\n- [Python SDK](#python-sdk)\n- [Node.js SDK](#nodejs-sdk)\n- [WebAssembly SDK](#webassembly-sdk)\n- [CLI Reference](#cli-reference)\n- [Output Formats](#output-formats)\n- [RAG / LLM Integration](#rag--llm-integration)\n- [Agent Skill](#agent-skill)\n- [Architecture](#architecture)\n- [FAQ](#faq)\n- [Tutorials](#tutorials)\n- [Documentation](#documentation)\n- [Project Layout](#project-layout)\n- [Contributing](#contributing)\n- [License](#license)\n\n---\n\n## Get Started in 30 Seconds\n\n**Python** (Python 3.9+):\n\n```bash\npip install edgeparse\n```\n\n```python\nimport edgeparse\n\n# Markdown — ready for LLM context or RAG chunking\nmd = edgeparse.convert(\"report.pdf\", format=\"markdown\")\n\n# JSON with bounding boxes — for citations and element-level control\ndata = edgeparse.convert(\"report.pdf\", format=\"json\")\n```\n\n**CLI** (Rust 1.85+):\n\n```bash\ncargo install edgeparse-cli\nedgeparse report.pdf --format markdown --output-dir output/\n```\n\n**Node.js** (Node 18+):\n\n```bash\nnpm install edgeparse\n```\n\n```js\nimport { convert } from 'edgeparse';\nconst md = convert('report.pdf', { format: 'markdown' });\n```\n\n---\n\n## Release Channels\n\nTagged releases publish every supported distribution target through GitHub Actions:\n\n| Channel | Artifact | Install / Pull |\n|---------|----------|----------------|\n| Rust crates | `pdf-cos`, `edgeparse-core`, `edgeparse-cli` | `cargo install edgeparse-cli` |\n| Python SDK | `edgeparse` wheels + sdist | `pip install edgeparse` |\n| Node.js SDK | `edgeparse` + 5 platform addons | `npm install edgeparse` |\n| WebAssembly SDK | `edgeparse-wasm` | `npm install edgeparse-wasm` |\n| CLI binaries | GitHub Release archives for macOS, Linux, Windows | [GitHub Releases](https://github.com/raphaelmansuy/edgeparse/releases) |\n| Homebrew | `raphaelmansuy/edgeparse` tap | `brew tap raphaelmansuy/edgeparse \u0026\u0026 brew install edgeparse` |\n| Containers | GHCR + Docker Hub multi-arch images | `docker pull ghcr.io/raphaelmansuy/edgeparse:0.2.1` |\n\nRelease automation and registry details: [docs/07-cicd-publishing.md](docs/07-cicd-publishing.md)\n\n---\n\n## What Problems Does This Solve?\n\n| Problem | EdgeParse Solution | Status |\n|---------|--------------------|--------|\n| PDF text loses reading order in multi-column layouts | XY-Cut++ algorithm preserves correct reading sequence across columns, sidebars, and mixed layouts | ✅ Shipped |\n| Table extraction is broken (merged cells, borderless tables) | Ruling-line table detection + borderless cluster method; `--table-method cluster` for complex cases | ✅ Shipped |\n| OCR/ML tools add 500 MB+ of dependencies to a simple PDF pipeline | Zero GPU, zero OCR models, zero JVM — single 15 MB binary, pure Rust | ✅ Shipped |\n| Heading hierarchy is lost (all text looks the same) | Font-metric + geometry-based heading classifier; MHS score 0.553 on the current benchmark | ✅ Shipped |\n| PDFs can carry hidden prompt injection payloads | AI safety filters: hidden text, off-page content, tiny-text, invisible OCG layers detected and stripped | ✅ Shipped |\n| Need bounding boxes to cite sources in RAG answers | Every element (`paragraph`, `heading`, `table`, `image`) has `[left, bottom, right, top]` coordinates in PDF points | ✅ Shipped |\n| In-browser PDF parsing uploads data to a server | WebAssembly build — full Rust engine in the browser, PDF data never leaves the device | ✅ Shipped |\n\n---\n\n## Benchmark\n\nEvaluated on **200 real-world PDFs** — academic papers, financial reports, multi-column layouts, complex tables, mixed-language documents — running on Apple M4 Max.\n\n### Current comparison set\n\n| Engine | NID ↑ | TEDS ↑ | MHS ↑ | PBF ↑ | TQS ↑ | TD F1 ↑ | Speed ↓ | Overall ↑ |\n|--------|------:|-------:|------:|------:|------:|--------:|--------:|----------:|\n| **EdgeParse** | **0.889** | **0.596** | **0.553** | **0.559** | **0.920** | **0.901** | **0.064 s/doc** | **0.787** |\n| OpenDataLoader | 0.873 | 0.326 | 0.442 | 0.544 | 0.916 | 0.636 | 0.094 s/doc | 0.733 |\n| Docling | 0.867 | 0.540 | 0.438 | 0.530 | 0.908 | 0.891 | 0.768 s/doc | 0.745 |\n| PyMuPDF4LLM | 0.852 | 0.323 | 0.407 | 0.538 | 0.888 | 0.744 | 0.439 s/doc | 0.710 |\n| EdgeParse (pre-frontier baseline) | 0.859 | 0.493 | 0.500 | 0.482 | 0.891 | 0.849 | 0.232 s/doc | 0.751 |\n| MarkItDown | 0.808 | 0.193 | 0.001 | 0.362 | 0.861 | 0.558 | 0.149 s/doc | 0.564 |\n| LiteParse | 0.815 | 0.000 | 0.001 | 0.383 | 0.887 | N/A | 0.196 s/doc | 0.564 |\n\nEdgeParse now leads the entire comparison set on every reported benchmark metric, including speed. Relative to the previous EdgeParse baseline, the current pipeline increases reading-order accuracy, table structure similarity, paragraph boundaries, text quality, table-detection F1, and overall score while cutting latency from `0.232` to `0.064 s/doc`.\n\n**When to choose what:**\n\n| Use case | Recommendation |\n|----------|---------------|\n| Born-digital PDFs, latency matters, production deployment | **EdgeParse** — best accuracy/speed, zero dependencies |\n| Complex scanned tables, GPU available, batch offline | Consider Docling or MinerU |\n| Scanned documents requiring full OCR | Dedicated OCR pipeline |\n\n### Metrics\n\n| Metric | What it measures |\n|--------|-----------------|\n| **NID** | Reading order accuracy — normalised index distance |\n| **TEDS** | Table structure accuracy — tree-edit distance vs. ground truth |\n| **MHS** | Heading hierarchy accuracy |\n| **PBF** | Paragraph boundary F1 |\n| **TQS** | Text quality score |\n| **TD F1** | Table detection F1 |\n| **Overall** | Normalized aggregate benchmark score |\n| **Speed** | Wall-clock seconds per document (full pipeline, 200 docs, parallel) |\n\n### Running the benchmark\n\n```bash\ncargo build --release\ncd benchmark\nuv sync\nuv run python run.py          # EdgeParse on all 200 docs\nuv run python compare_all.py  # Compare against 9 engines\n```\n\nResults → `benchmark/prediction/edgeparse/` · HTML reports → `benchmark/reports/`\n\n### Regression thresholds\n\n`benchmark/thresholds.json` defines minimum acceptable scores for CI:\n\n```json\n{\n  \"nid\": 0.85,\n  \"teds\": 0.40,\n  \"mhs\": 0.55,\n  \"table_detection_f1\": 0.55,\n  \"elapsed_per_doc\": 2.0\n}\n```\n\n---\n\n## Capability Matrix\n\n| Capability | Available | Notes |\n|-----------|-----------|-------|\n| **Extraction** | | |\n| Text with correct reading order | ✅ | XY-Cut++ across columns and sidebars |\n| Bounding boxes for every element | ✅ | `[left, bottom, right, top]` in PDF points |\n| Table extraction — ruling-line borders | ✅ | Default mode |\n| Table extraction — borderless/cluster | ✅ | `--table-method cluster` |\n| Heading hierarchy detection | ✅ | Numbered + unnumbered, all levels |\n| List detection (numbered, bulleted, nested) | ✅ | |\n| Image extraction with coordinates | ✅ | PNG or JPEG, embedded or external |\n| Header / footer / watermark filtering | ✅ | |\n| Multi-column layout support | ✅ | |\n| CMap / ToUnicode font decoding | ✅ | Handles non-standard encodings |\n| Tagged PDF structure tree | ✅ | `--use-struct-tree` preserves author intent |\n| **Safety** | | |\n| Hidden text filtering | ✅ | Prompt injection protection |\n| Off-page content filtering | ✅ | |\n| Tiny-text / invisible OCG layer filtering | ✅ | |\n| PII sanitization | ✅ | `--sanitize` flag |\n| **Output** | | |\n| Markdown (GFM tables) | ✅ | |\n| JSON with bounding boxes | ✅ | |\n| HTML5 | ✅ | |\n| Plain text (reading order preserved) | ✅ | |\n| **SDKs** | | |\n| Python (PyO3 native extension) | ✅ | Python 3.9+, pre-built wheels |\n| Node.js (NAPI-RS native addon) | ✅ | Node 18+, pre-built addons |\n| WebAssembly | ✅ | In-browser, no server required |\n| Rust library | ✅ | `edgeparse-core` crate |\n| CLI binary | ✅ | `edgeparse-cli` on crates.io |\n| **Runtime** | | |\n| GPU required | ❌ No | CPU only |\n| JVM required | ❌ No | Pure Rust |\n| OCR models required | ❌ No | Born-digital PDFs only |\n| Parallel processing | ✅ | Rayon per-page parallelism |\n| Deterministic / reproducible | ✅ | Same input → same output, always |\n\n---\n\n## Installation\n\n### Python\n\n```bash\npip install edgeparse\n```\n\nPython 3.9+. Pre-built wheels for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).\n\n### CLI\n\n```bash\ncargo install edgeparse-cli\n```\n\nRequires [Rust 1.85+](https://rustup.rs/).\n\n### Rust library\n\n```toml\n[dependencies]\nedgeparse-core = \"0.2.1\"\n```\n\nDocs: [docs.rs/edgeparse-core](https://docs.rs/edgeparse-core) · [docs.rs/edgeparse-cli](https://docs.rs/edgeparse-cli)\n\n### Node.js\n\n```bash\nnpm install edgeparse\n```\n\nNode 18+. Pre-built native addons for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).\n\n### Build from source\n\n```bash\ngit clone https://github.com/raphaelmansuy/edgeparse.git\ncd edgeparse\ncargo build --release\n# Binary: target/release/edgeparse\n```\n\n### System requirements\n\n- macOS 12+, Linux (glibc 2.17+), or Windows 10+\n- ~15 MB binary (stripped release build)\n- No Java, no Python (for the CLI), no GPU\n\n---\n\n## Python SDK\n\n**Package:** `edgeparse` · **Requires:** Python 3.9+ · **Source:** [`sdks/python/`](sdks/python/)\n\n### `edgeparse.convert()`\n\n```python\ndef convert(\n    input_path: str | Path,\n    *,\n    format: str = \"markdown\",       # \"markdown\", \"json\", \"html\", \"text\"\n    pages: str | None = None,        # e.g. \"1,3,5-7\"\n    password: str | None = None,\n    reading_order: str = \"xycut\",   # \"xycut\" or \"off\"\n    table_method: str = \"default\",  # \"default\" or \"cluster\"\n    image_output: str = \"off\",      # \"off\", \"embedded\", \"external\"\n) -\u003e str: ...\n```\n\n### `edgeparse.convert_file()`\n\n```python\ndef convert_file(\n    input_path: str | Path,\n    output_dir: str | Path = \"output\",\n    *,\n    format: str = \"markdown\",\n    pages: str | None = None,\n    password: str | None = None,\n) -\u003e str: ...  # returns output file path\n```\n\n### Examples\n\n```python\nimport edgeparse\n\n# Basic Markdown extraction\nmd = edgeparse.convert(\"report.pdf\", format=\"markdown\")\n\n# JSON with bounding boxes\njson_str = edgeparse.convert(\"report.pdf\", format=\"json\")\n\n# Password-protected, specific pages, borderless tables\nmd = edgeparse.convert(\n    \"secure.pdf\",\n    format=\"markdown\",\n    pages=\"1,3,5-7\",\n    password=\"secret\",\n    reading_order=\"xycut\",\n    table_method=\"cluster\",\n)\n\n# Write output file  \nout_path = edgeparse.convert_file(\"report.pdf\", output_dir=\"output/\", format=\"markdown\")\n```\n\n### CLI entry point (Python package)\n\n```bash\nedgeparse report.pdf -f markdown -o output/\nedgeparse *.pdf --format json --output-dir out/ --pages \"1-3\"\n```\n\n---\n\n## Node.js SDK\n\n**Package:** `edgeparse` · **Requires:** Node.js 18+ · **Source:** [`sdks/node/`](sdks/node/)\n\n### `convert()`\n\n```ts\nimport { convert } from 'edgeparse';\n\n// Markdown\nconst md = convert('report.pdf', { format: 'markdown' });\n\n// JSON with bounding boxes\nconst json = convert('report.pdf', { format: 'json' });\n\n// With options\nconst result = convert('report.pdf', {\n  format: 'markdown',\n  pages: '1-5',\n  readingOrder: 'xycut',\n  tableMethod: 'cluster',\n});\n```\n\n### `ConvertOptions`\n\n```ts\ninterface ConvertOptions {\n  format?: 'markdown' | 'json' | 'html' | 'text';  // default: \"markdown\"\n  pages?: string;         // e.g. \"1,3,5-7\"\n  password?: string;\n  readingOrder?: 'xycut' | 'off';        // default: \"xycut\"\n  tableMethod?: 'default' | 'cluster';   // default: \"default\"\n  imageOutput?: 'off' | 'embedded' | 'external';  // default: \"off\"\n}\n```\n\n### CLI (npm)\n\n```bash\nnpx edgeparse report.pdf -f markdown -o output.md\nnpx edgeparse report.pdf --format json --pages \"1-5\"\n```\n\n---\n\n## WebAssembly SDK\n\nEdgeParse compiles to WebAssembly — **client-side PDF extraction in any modern browser with no server, no uploads, and no backend infrastructure.**\n\n- Same Rust engine, identical output to CLI/Python/Node\n- PDF data never leaves the user's device (privacy by design)\n- Works offline after initial WASM load (~4 MB cached)\n- Zero infrastructure cost — static hosting only\n\n### Quick start\n\n```typescript\nimport init, { convert_to_string } from 'edgeparse-wasm';\n\nawait init();  // load WASM binary once\n\nconst bytes = new Uint8Array(await file.arrayBuffer());\n\nconst markdown = convert_to_string(bytes, 'markdown');\nconst json     = convert_to_string(bytes, 'json');\nconst html     = convert_to_string(bytes, 'html');\n```\n\n### API\n\n| Function | Returns | Description |\n|----------|---------|-------------|\n| `convert(bytes, format?, pages?, readingOrder?, tableMethod?)` | JS object | Structured `PdfDocument` with pages, elements, bounding boxes |\n| `convert_to_string(bytes, format?, pages?, readingOrder?, tableMethod?)` | `string` | Formatted output (Markdown, JSON, HTML, or text) |\n| `version()` | `string` | EdgeParse version |\n\n### Live demo\n\n**[edgeparse.com/demo/](https://edgeparse.com/demo/)** — drag-and-drop any PDF, all processing runs locally in your browser.\n\n### Build from source\n\n```bash\ncurl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh\ncd crates/edgeparse-wasm\nwasm-pack build --target web --release\n# Output: crates/edgeparse-wasm/pkg/\n```\n\nFull documentation: [docs/09-wasm-sdk.md](docs/09-wasm-sdk.md)\n\n---\n\n## CLI Reference\n\n```\nedgeparse [OPTIONS] \u003cPDF_FILE\u003e...\n```\n\n### Core options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `-o, --output-dir \u003cDIR\u003e` | — | Write output files to this directory |\n| `-f, --format \u003cFMT\u003e` | `json` | Output format(s), comma-separated |\n| `-p, --password \u003cPW\u003e` | — | Password for encrypted PDFs |\n| `--pages \u003cRANGE\u003e` | — | Page range e.g. `\"1,3,5-7\"` |\n| `-q, --quiet` | false | Suppress log output |\n\n### Output format values\n\n| Value | Description |\n|-------|-------------|\n| `json` | Structured JSON with bounding boxes and element types (default) |\n| `markdown` | Standard Markdown with GFM tables |\n| `markdown-with-html` | Markdown with HTML table fallback for complex tables |\n| `markdown-with-images` | Markdown with embedded or linked images |\n| `html` | Full HTML5 document with semantic elements |\n| `text` | Plain UTF-8 text, reading order preserved |\n\nMultiple formats: `--format markdown,json`\n\n### Layout \u0026 extraction options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--reading-order \u003cALGO\u003e` | `xycut` | Reading order: `xycut` or `off` |\n| `--table-method \u003cMETHOD\u003e` | `default` | Table detection: `default` (ruling lines) or `cluster` (borderless) |\n| `--keep-line-breaks` | false | Preserve original line breaks within paragraphs |\n| `--use-struct-tree` | false | Use tagged PDF structure tree when available |\n| `--include-header-footer` | false | Include headers and footers in output |\n| `--sanitize` | false | Enable PII sanitization |\n| `--replace-invalid-chars \u003cCH\u003e` | `\" \"` | Replacement for invalid Unicode characters |\n\n### Image options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--image-output \u003cMODE\u003e` | `external` | `off`, `embedded` (base64), or `external` (files) |\n| `--image-format \u003cFMT\u003e` | `png` | `png` or `jpeg` |\n| `--image-dir \u003cDIR\u003e` | — | Directory for extracted image files |\n\n### Separator options\n\n| Flag | Description |\n|------|-------------|\n| `--markdown-page-separator \u003cSTR\u003e` | String inserted between Markdown pages |\n| `--text-page-separator \u003cSTR\u003e` | String inserted between plain-text pages |\n| `--html-page-separator \u003cSTR\u003e` | String inserted between HTML pages |\n\n### Content safety options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--content-safety-off \u003cFLAGS\u003e` | — | Disable safety filters: `all`, `hidden-text`, `off-page`, `tiny`, `hidden-ocg` |\n\n### Hybrid backend options\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--hybrid \u003cBACKEND\u003e` | `off` | Hybrid backend: `off` or `docling-fast` |\n| `--hybrid-mode \u003cMODE\u003e` | `auto` | Triage mode: `auto` or `full` |\n| `--hybrid-url \u003cURL\u003e` | — | Hybrid backend service URL |\n| `--hybrid-timeout \u003cMS\u003e` | `30000` | Timeout in milliseconds |\n| `--hybrid-fallback` | false | Fall back to local extraction on hybrid error |\n\n---\n\n## Output Formats\n\n| Format | Flag value | Best for |\n|--------|-----------|----------|\n| JSON with bounding boxes | `json` | RAG citations, element-level processing, source highlighting |\n| Markdown (GFM) | `markdown` | LLM context windows, chunking pipelines, readable output |\n| Markdown + HTML tables | `markdown-with-html` | Complex tables that don't render well in pure Markdown |\n| Markdown + images | `markdown-with-images` | Documents where figures matter |\n| HTML5 | `html` | Web display, accessibility pipelines |\n| Plain text | `text` | Keyword search, simple NLP, legacy pipelines |\n\nCombine formats: `--format markdown,json`\n\n### JSON output example\n\n```json\n{\n  \"type\": \"heading\",\n  \"id\": 42,\n  \"level\": \"Title\",\n  \"page number\": 1,\n  \"bounding box\": [72.0, 700.0, 540.0, 730.0],\n  \"heading level\": 1,\n  \"font\": \"Helvetica-Bold\",\n  \"font size\": 24.0,\n  \"content\": \"Introduction\"\n}\n```\n\n| Field | Description |\n|-------|-------------|\n| `type` | Element type: `heading`, `paragraph`, `table`, `list`, `image`, `caption` |\n| `id` | Unique identifier for cross-referencing |\n| `page number` | 1-indexed page reference |\n| `bounding box` | `[left, bottom, right, top]` in PDF points (72 pt = 1 inch) |\n| `heading level` | Heading depth (1+) |\n| `content` | Extracted text |\n\n---\n\n## RAG / LLM Integration\n\nEdgeParse is designed for AI pipelines. Every element has a `bounding box` and `page number`, so you can cite exact sources in answers.\n\n### Extract Markdown for chunking\n\n```python\nimport edgeparse\n\nmd = edgeparse.convert(\"report.pdf\", format=\"markdown\")\n\n# Feed directly into an LLM\nresponse = llm.invoke(f\"Summarize this document:\\n\\n{md}\")\n```\n\n### Extract JSON for source citations\n\n```python\nimport json, edgeparse\n\ndata = json.loads(edgeparse.convert(\"report.pdf\", format=\"json\"))\n\nfor element in data[\"kids\"]:\n    if element[\"type\"] == \"paragraph\":\n        # element[\"bounding box\"] → highlight location in original PDF\n        # element[\"page number\"] → link back to source page\n        print(f\"p.{element['page number']}: {element['content'][:80]}\")\n```\n\n### LangChain integration\n\nEdgeParse has no official LangChain loader yet, but integrates trivially:\n\n```python\nfrom langchain.schema import Document\nimport edgeparse, json\n\ndef load_pdf(path: str) -\u003e list[Document]:\n    data = json.loads(edgeparse.convert(path, format=\"json\"))\n    docs = []\n    for el in data[\"kids\"]:\n        if el[\"type\"] in (\"paragraph\", \"heading\"):\n            docs.append(Document(\n                page_content=el[\"content\"],\n                metadata={\n                    \"source\": path,\n                    \"page\": el[\"page number\"],\n                    \"bbox\": el[\"bounding box\"],\n                    \"type\": el[\"type\"],\n                }\n            ))\n    return docs\n```\n\n### LlamaIndex integration\n\n```python\nfrom llama_index.core import Document\nimport edgeparse, json\n\ndef edgeparse_reader(path: str) -\u003e list[Document]:\n    data = json.loads(edgeparse.convert(path, format=\"json\"))\n    return [\n        Document(\n            text=el[\"content\"],\n            metadata={\"page\": el[\"page number\"], \"source\": path}\n        )\n        for el in data[\"kids\"]\n        if el.get(\"content\")\n    ]\n```\n\n### Which output format for which use case?\n\n| Use case | Recommended format | Why |\n|----------|--------------------|-----|\n| Feed PDF to LLM | `markdown` | Clean structure, fits in context window |\n| RAG with source citations | `json` | Bounding boxes enable \"click-to-source\" UX |\n| Semantic chunking by section | `markdown` | Headings make natural chunk boundaries |\n| Element-level filtering | `json` | Filter by `type`, `page number`, `heading level` |\n| Web display | `html` | Styled output with semantic elements |\n\n---\n\n## Agent Skill\n\nEdgeParse ships as a **Claude agent skill** — a structured description that teaches Claude (and any compatible AI agent) how to extract PDF content on behalf of users.\n\n```bash\n# Add the EdgeParse skill to your agent environment\nnpx skills add raphaelmansuy/edgeparse --skill edgeparse\n\n# Install the Python package\npip install edgeparse\n```\n\nThe `npx skills add` command registers the skill in `skills-lock.json`:\n\n```json\n{\n  \"version\": 1,\n  \"skills\": {\n    \"edgeparse\": {\n      \"source\": \"raphaelmansuy/edgeparse\",\n      \"sourceType\": \"github\"\n    }\n  }\n}\n```\n\nOnce installed, the agent reads `skills/edgeparse/SKILL.md` and knows when to call `edgeparse.convert()`, which format to use for different tasks, and how to handle edge cases like encrypted PDFs, borderless tables, and multi-column layouts.\n\nSee [docs/08-agent-skill.md](docs/08-agent-skill.md) for the full integration guide (LangChain, LlamaIndex, MCP, CrewAI patterns).\n\n---\n\n## Architecture\n\n### Crate structure\n\n```\nedgeparse/\n├── crates/\n│   ├── pdf-cos/            # Low-level PDF object model (fork of lopdf 0.39)\n│   ├── edgeparse-core/     # Core extraction engine (~90 source files)\n│   │   └── src/\n│   │       ├── api/        # ProcessingConfig, FilterConfig, BatchResult\n│   │       ├── pdf/        # Loader, content stream parser, font/CMap decoding\n│   │       ├── models/     # ContentElement, BoundingBox, TextChunk, PdfDocument\n│   │       ├── pipeline/   # 20-stage orchestrator + Rayon parallel helpers\n│   │       ├── output/     # Renderers: JSON, Markdown, HTML, text, CSV\n│   │       ├── tagged/     # Tagged-PDF structure tree → McidMap\n│   │       └── utils/      # XY-Cut++ algorithm, sanitizer, layout analysis\n│   ├── edgeparse-cli/      # CLI binary (clap derive, 25+ flags)\n│   ├── edgeparse-python/   # PyO3 native extension\n│   └── edgeparse-node/     # NAPI-RS native addon\n└── sdks/\n    ├── python/             # Python packaging (maturin, pyproject.toml)\n    └── node/               # npm packaging (TypeScript wrapper, tsup)\n```\n\n### Processing pipeline (20 stages)\n\n```\nPDF file\n    │\n    ▼\npdf-cos                   ← xref parsing, object graph, encrypted streams\n    │\n    ▼\nedgeparse-core::pdf       ← page tree, content stream operators, font decoding,\n    │                       CMap/ToUnicode, image extraction, tagged PDF\n    ▼\nedgeparse-core::pipeline  ← 20 sequential/parallel stages:\n    │  [page range] → [watermark] → [filter] → [table borders] →\n    │  [cell assignment] → [boxed headings] → [column detection] →\n    │  [TextLine assembly] → [TextBlock grouping] → [table clustering] →\n    │  [header/footer] → [list detection] → [paragraph] → [figure] →\n    │  [heading classification] → [XY-Cut++ reading order] →\n    │  [list pass 2] → [caption/footnote/TOC] → [cross-page tables] →\n    │  [element nesting] → [final reading order] → [sanitize]\n    ▼\nedgeparse-core::output    ← render to JSON / Markdown / HTML / text\n```\n\nStages marked `par_map_pages` run in parallel via Rayon; cross-page stages run sequentially.\n\n---\n\n## FAQ\n\n### What is the best PDF parser for RAG?\n\nFor RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. EdgeParse outputs structured JSON with bounding boxes for every element, handles multi-column layouts with XY-Cut++, and runs locally on CPU without a GPU or JVM. On the current 200-document benchmark it leads the comparison set in both overall score (`0.787`) and latency (`0.064 s/doc`). [See RAG integration examples](#rag--llm-integration).\n\n### How do I cite PDF sources in RAG answers?\n\nEvery element in JSON output includes a `bounding box` (`[left, bottom, right, top]` in PDF points, 72 pt = 1 inch) and `page number`. Map the source chunk back to its bounding box to highlight the exact location in the original PDF — enabling \"click-to-source\" UX. No other non-OCR open-source parser provides bounding boxes for every element by default.\n\n### How do I extract tables from PDF?\n\nEdgeParse detects tables using border (ruling-line) analysis by default. For complex or borderless tables, add `--table-method cluster` (CLI) or `table_method=\"cluster\"` (Python). This uses a text-clustering algorithm to detect table structure without visible borders. On the current benchmark it reaches `0.596` TEDS and `0.901` table-detection F1, both best in the published comparison set.\n\n### Does it work without sending data to the cloud?\n\nYes. EdgeParse runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The WebAssembly SDK also runs entirely in the browser: the PDF is processed client-side and never uploaded to any server. Ideal for legal, healthcare, and financial documents.\n\n### Does it handle multi-column layouts?\n\nYes. XY-Cut++ reading order analysis correctly sequences text across multi-column pages, sidebars, and mixed layouts. This works without any configuration change; it is the default reading order algorithm.\n\n### Does it need a GPU or Java?\n\nNo. EdgeParse is a pure Rust implementation. It requires no JVM, no GPU, no OCR models, and no Python runtime for the CLI binary. The CLI binary is ~15 MB stripped. On Apple M4 Max, it processes the 200-document benchmark corpus in about `12.7` seconds total.\n\n### How does it compare to Docling, MinerU, and Marker?\n\n| vs. | EdgeParse advantage | Tradeoff |\n|-----|---------------------|----------|\n| OpenDataLoader | Faster (`0.064` vs `0.094 s/doc`) with stronger table structure and heading recovery | OpenDataLoader remains close on text quality and paragraph boundaries |\n| IBM Docling | Faster (`0.064` vs `0.768 s/doc`) with better TEDS and overall score in the current benchmark snapshot | Docling remains a viable OCR-heavy fallback for scanned documents |\n| Marker | Faster and materially better on every published metric in this benchmark family | Marker supports scanned PDFs via Surya OCR |\n| PyMuPDF4LLM | Faster (`0.064` vs `0.439 s/doc`) with stronger tables, headings, and reading order | PyMuPDF4LLM is simpler if you only need lightweight text extraction |\n\n### Does it support scanned PDFs?\n\nNot directly — EdgeParse is built for born-digital PDFs with embedded fonts. It does not include an OCR engine. For scanned documents, use EdgeParse's hybrid backend option (`--hybrid docling-fast`) which routes complex pages to a Docling-Fast backend running locally.\n\n### What does \"deterministic\" mean?\n\nSame input PDF → same output, every time. No stochastic models, no floating-point non-determinism from ML inference. This makes EdgeParse safe to use in CI pipelines, regression tests, and compliance workflows where reproducibility is required.\n\n### How do I chunk PDFs for semantic search?\n\nUse `format=\"markdown\"`. EdgeParse preserves heading hierarchy and table structure in Markdown output — headings make natural chunk boundaries for `RecursiveCharacterTextSplitter` (LangChain) or heading-based splitters. For element-level control, use `format=\"json\"` and split on `heading level` boundaries or `page number` changes.\n\n### Does the Python SDK run on Windows?\n\nYes. Pre-built wheels are available for Windows (x64). The Python package installs from PyPI with no compilation needed:\n```bash\npip install edgeparse\n```\n\n---\n\n## Tutorials\n\nStep-by-step guides with working examples live in [`tutorials/`](tutorials/):\n\n| Tutorial | Description |\n|----------|-------------|\n| [tutorials/01-cli.md](tutorials/01-cli.md) | All CLI flags with working examples and output samples |\n| [tutorials/02-python-sdk.md](tutorials/02-python-sdk.md) | `pip install edgeparse` — full API, batch processing, JSON parsing |\n| [tutorials/03-nodejs-sdk.md](tutorials/03-nodejs-sdk.md) | `npm install edgeparse` — TypeScript, CJS, and worker threads |\n| [tutorials/04-rust-library.md](tutorials/04-rust-library.md) | `edgeparse-core` in your Rust project — config, models, Rayon |\n| [tutorials/05-output-formats.md](tutorials/05-output-formats.md) | JSON schema, bounding boxes, Markdown variants, HTML, plain text |\n\n---\n\n## Documentation\n\nTechnical documentation lives in [`docs/`](docs/):\n\n| Document | Description |\n|----------|-------------|\n| [docs/00-overview.md](docs/00-overview.md) | Project overview, goals, and design philosophy |\n| [docs/01-architecture.md](docs/01-architecture.md) | Crate structure, module map, data-flow diagram |\n| [docs/02-pipeline.md](docs/02-pipeline.md) | All 20 pipeline stages with ASCII diagrams |\n| [docs/03-data-model.md](docs/03-data-model.md) | Type hierarchy: `ContentElement`, `BoundingBox`, `PdfDocument` |\n| [docs/04-pdf-extraction.md](docs/04-pdf-extraction.md) | PDF loader, chunk parser, font/CMap decoding |\n| [docs/05-output-formats.md](docs/05-output-formats.md) | JSON schema, Markdown renderer, HTML/text/CSV output |\n| [docs/06-sdk-integration.md](docs/06-sdk-integration.md) | CLI flag reference, Python SDK API, Node.js SDK API, Batch API |\n| [docs/07-cicd-publishing.md](docs/07-cicd-publishing.md) | CI/CD publishing pipeline — how it works and how to configure it |\n| [docs/08-agent-skill.md](docs/08-agent-skill.md) | EdgeParse agent skill — `npx skills add`, SKILL.md structure, SDK patterns |\n| [docs/09-wasm-sdk.md](docs/09-wasm-sdk.md) | WebAssembly SDK — objectives, API, use cases, build instructions |\n\n---\n\n## Project Layout\n\n```\nedgeparse/\n├── LICENSE\n├── CONTRIBUTING.md\n├── README.md\n├── Cargo.toml               # Rust workspace (5 members)\n├── Cargo.lock\n│\n├── crates/\n│   ├── pdf-cos/             # lopdf 0.39 fork — low-level PDF object model\n│   ├── edgeparse-core/      # Core extraction engine (~90 source files)\n│   ├── edgeparse-cli/       # CLI binary (clap, 25+ flags)\n│   ├── edgeparse-wasm/      # WebAssembly build for browsers (wasm-bindgen)\n│   ├── edgeparse-python/    # PyO3 native Python extension\n│   └── edgeparse-node/      # NAPI-RS native Node.js addon\n│\n├── sdks/\n│   ├── python/              # Python wheel packaging (maturin + pyproject.toml)\n│   │   └── edgeparse/       # Pure-Python wrapper + CLI entry point\n│   └── node/                # npm packaging (TypeScript + tsup + vitest)\n│       └── src/             # index.ts, types.ts, cli.ts\n│\n├── benchmark/               # Evaluation suite\n│   ├── run.py               # Benchmark runner (EdgeParse)\n│   ├── compare_all.py       # Multi-engine comparison (9 engines)\n│   ├── pyproject.toml\n│   ├── thresholds.json      # Regression thresholds\n│   ├── pdfs/                # Benchmark PDFs (200 docs)\n│   ├── ground-truth/        # Reference Markdown and JSON annotations\n│   ├── prediction/          # Per-engine output directories\n│   ├── reports/             # HTML benchmark reports\n│   └── src/                 # Python evaluators and engine adapters\n│\n├── docs/                    # Technical documentation (Markdown)\n│\n├── demo/                    # Interactive WASM demo (Vite + TypeScript)\n│   └── src/                 # Demo application source\n│\n├── examples/\n│   └── pdf/                 # Sample PDFs for quick testing\n│       ├── lorem.pdf\n│       ├── 1901.03003.pdf   # Academic paper (multi-column)\n│       ├── 2408.02509v1.pdf # Academic paper\n│       └── chinese_scan.pdf # CJK + scan example\n│\n├── benches/                 # Rust micro-benchmarks (criterion)\n├── docker/                  # Dockerfile and Dockerfile.dev\n├── scripts/                 # bench.sh, publish-crates.sh\n└── tests/\n    └── fixtures/            # Rust integration test fixtures\n```\n\n---\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md). In short:\n\n1. Fork, branch from `main`\n2. `cargo fmt \u0026\u0026 cargo clippy -- -D warnings`\n3. Run the benchmark to check for regressions: `cd benchmark \u0026\u0026 uv run python run.py --engine edgeparse`\n4. Open a PR\n\n---\n## Star History\n\n\u003ca href=\"https://www.star-history.com/?repos=raphaelmansuy%2Fedgeparse\u0026type=date\u0026legend=top-left\"\u003e\n \u003cpicture\u003e\n   \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/image?repos=raphaelmansuy/edgeparse\u0026type=date\u0026theme=dark\u0026legend=top-left\" /\u003e\n   \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/image?repos=raphaelmansuy/edgeparse\u0026type=date\u0026legend=top-left\" /\u003e\n   \u003cimg alt=\"Star History Chart\" src=\"https://api.star-history.com/image?repos=raphaelmansuy/edgeparse\u0026type=date\u0026legend=top-left\" /\u003e\n \u003c/picture\u003e\n\u003c/a\u003e\n\n## License\n\nEdgeParse is licensed under the **Apache License 2.0**. See [LICENSE](LICENSE) for the full text.\n\nThe `crates/pdf-cos/` directory is a fork of [lopdf](https://github.com/J-F-Liu/lopdf) (MIT/Apache-2.0 dual-licensed).  \nBenchmark PDF documents (`benchmark/pdfs/`) are sourced from publicly available documents and are used solely for evaluation purposes.\n","funding_links":[],"categories":["\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelmansuy%2Fedgeparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fraphaelmansuy%2Fedgeparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fraphaelmansuy%2Fedgeparse/lists"}