{"id":35777781,"url":"https://github.com/yfedoseev/pdf_oxide","last_synced_at":"2026-04-10T07:18:48.606Z","repository":{"id":322714579,"uuid":"1090617520","full_name":"yfedoseev/pdf_oxide","owner":"yfedoseev","description":"The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation \u0026 editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.","archived":false,"fork":false,"pushed_at":"2026-03-03T03:07:41.000Z","size":15973,"stargazers_count":154,"open_issues_count":15,"forks_count":18,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-03-03T07:58:37.545Z","etag":null,"topics":["data-extraction","document-processing","fast","image-extraction","llm","markdown","pdf","pdf-editor","pdf-generation","pdf-library","pdf-parser","pdf-to-markdown","pdf-to-text","pyo3","python","rag","rust","text-extraction"],"latest_commit_sha":null,"homepage":"https://oxide.fyi","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/yfedoseev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE-APACHE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null},"funding":{"github":["yfedoseev"]}},"created_at":"2025-11-05T22:56:26.000Z","updated_at":"2026-03-03T07:49:06.000Z","dependencies_parsed_at":null,"dependency_job_id":"5256f3ec-ab92-4f3a-99c2-d35b1a79f74b","html_url":"https://github.com/yfedoseev/pdf_oxide","commit_stats":null,"previous_names":["yfedoseev/pdf_oxide"],"tags_count":25,"template":false,"template_full_name":null,"purl":"pkg:github/yfedoseev/pdf_oxide","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yfedoseev%2Fpdf_oxide","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yfedoseev%2Fpdf_oxide/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yfedoseev%2Fpdf_oxide/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yfedoseev%2Fpdf_oxide/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/yfedoseev","download_url":"https://codeload.github.com/yfedoseev/pdf_oxide/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/yfedoseev%2Fpdf_oxide/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30086135,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-04T15:40:14.053Z","status":"ssl_error","status_checked_at":"2026-03-04T15:40:13.655Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-extraction","document-processing","fast","image-extraction","llm","markdown","pdf","pdf-editor","pdf-generation","pdf-library","pdf-parser","pdf-to-markdown","pdf-to-text","pyo3","python","rag","rust","text-extraction"],"created_at":"2026-01-07T05:27:24.779Z","updated_at":"2026-04-10T07:18:48.598Z","avatar_url":"https://github.com/yfedoseev.png","language":"Rust","readme":"# PDF Oxide - The Fastest PDF Toolkit for Python, Rust, WASM, CLI \u0026 AI\n\nThe fastest PDF library for text extraction, image extraction, and markdown conversion. Rust core with Python bindings, WASM support, CLI tool, and MCP server for AI assistants. 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf. 100% pass rate on 3,830 real-world PDFs. MIT licensed.\n\n[![Crates.io](https://img.shields.io/crates/v/pdf_oxide.svg)](https://crates.io/crates/pdf_oxide)\n[![PyPI](https://img.shields.io/pypi/v/pdf_oxide.svg)](https://pypi.org/project/pdf_oxide/)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/pdf-oxide)](https://pypi.org/project/pdf-oxide/)\n[![npm](https://img.shields.io/npm/v/pdf-oxide-wasm)](https://www.npmjs.com/package/pdf-oxide-wasm)\n[![Documentation](https://docs.rs/pdf_oxide/badge.svg)](https://docs.rs/pdf_oxide)\n[![Build Status](https://github.com/yfedoseev/pdf_oxide/workflows/CI/badge.svg)](https://github.com/yfedoseev/pdf_oxide/actions)\n[![License: MIT OR Apache-2.0](https://img.shields.io/badge/License-MIT%20OR%20Apache--2.0-blue.svg)](https://opensource.org/licenses)\n\n## Quick Start\n\n### Python\n```python\nfrom pdf_oxide import PdfDocument\n\n# path can be str or pathlib.Path; use with for scoped access\ndoc = PdfDocument(\"paper.pdf\")\n# or: with PdfDocument(\"paper.pdf\") as doc: ...\ntext = doc.extract_text(0)\nchars = doc.extract_chars(0)\nmarkdown = doc.to_markdown(0, detect_headings=True)\n```\n\n```bash\npip install pdf_oxide\n```\n\n### Rust\n```rust\nuse pdf_oxide::PdfDocument;\n\nlet mut doc = PdfDocument::open(\"paper.pdf\")?;\nlet text = doc.extract_text(0)?;\nlet images = doc.extract_images(0)?;\nlet markdown = doc.to_markdown(0, Default::default())?;\n```\n\n```toml\n[dependencies]\npdf_oxide = \"0.3\"\n```\n\n### CLI\n```bash\npdf-oxide text document.pdf\npdf-oxide markdown document.pdf -o output.md\npdf-oxide search document.pdf \"pattern\"\npdf-oxide merge a.pdf b.pdf -o combined.pdf\n```\n\n```bash\nbrew install yfedoseev/tap/pdf-oxide\n```\n\n### MCP Server (for AI assistants)\n```bash\n# Install\nbrew install yfedoseev/tap/pdf-oxide   # includes pdf-oxide-mcp\n\n# Configure in Claude Desktop / Claude Code / Cursor\n{\n  \"mcpServers\": {\n    \"pdf-oxide\": { \"command\": \"crgx\", \"args\": [\"pdf_oxide_mcp@latest\"] }\n  }\n}\n```\n\n## Why pdf_oxide?\n\n- **Fast** — 0.8ms mean per document, 5× faster than PyMuPDF, 15× faster than pypdf, 29× faster than pdfplumber\n- **Reliable** — 100% pass rate on 3,830 test PDFs, zero panics, zero timeouts\n- **Complete** — Text extraction, image extraction, PDF creation, and editing in one library\n- **Multi-platform** — Rust, Python, JavaScript/WASM, CLI, and MCP server for AI assistants\n- **Permissive license** — MIT / Apache-2.0 — use freely in commercial and open-source projects\n\n## Performance\n\nBenchmarked on 3,830 PDFs from three independent public test suites (veraPDF, Mozilla pdf.js, DARPA SafeDocs). Text extraction libraries only (no OCR). Single-thread, 60s timeout, no warm-up.\n\n### Python Libraries\n\n| Library | Mean | p99 | Pass Rate | License |\n|---------|------|-----|-----------|---------|\n| **PDF Oxide** | **0.8ms** | **9ms** | **100%** | **MIT** |\n| PyMuPDF | 4.6ms | 28ms | 99.3% | AGPL-3.0 |\n| pypdfium2 | 4.1ms | 42ms | 99.2% | Apache-2.0 |\n| pymupdf4llm | 55.5ms | 280ms | 99.1% | AGPL-3.0 |\n| pdftext | 7.3ms | 82ms | 99.0% | GPL-3.0 |\n| pdfminer | 16.8ms | 124ms | 98.8% | MIT |\n| pdfplumber | 23.2ms | 189ms | 98.8% | MIT |\n| markitdown | 108.8ms | 378ms | 98.6% | MIT |\n| pypdf | 12.1ms | 97ms | 98.4% | BSD-3 |\n\n### Rust Libraries\n\n| Library | Mean | p99 | Pass Rate | Text Extraction |\n|---------|------|-----|-----------|-----------------|\n| **PDF Oxide** | **0.8ms** | **9ms** | **100%** | **Built-in** |\n| oxidize_pdf | 13.5ms | 11ms | 99.1% | Basic |\n| unpdf | 2.8ms | 10ms | 95.1% | Basic |\n| pdf_extract | 4.08ms | 37ms | 91.5% | Basic |\n| lopdf | 0.3ms | 2ms | 80.2% | No built-in extraction |\n\n### Text Quality\n\n99.5% text parity vs PyMuPDF and pypdfium2 across the full corpus. PDF Oxide extracts text from 7–10× more \"hard\" files than it misses vs any competitor.\n\n### Corpus\n\n| Suite | PDFs | Pass Rate |\n|-------|-----:|----------:|\n| [veraPDF](https://github.com/veraPDF/veraPDF-corpus) (PDF/A compliance) | 2,907 | 100% |\n| [Mozilla pdf.js](https://github.com/mozilla/pdf.js/tree/master/test/pdfs) | 897 | 99.2% |\n| [SafeDocs](https://github.com/pdf-association/safedocs) (targeted edge cases) | 26 | 100% |\n| **Total** | **3,830** | **100%** |\n\n100% pass rate on all valid PDFs — the 7 non-passing files across the corpus are intentionally broken test fixtures (missing PDF header, fuzz-corrupted catalogs, invalid xref streams).\n\n## Features\n\n| Extract | Create | Edit |\n|---------|--------|------|\n| Text \u0026 Layout | Documents | Annotations |\n| Images | Tables | Form Fields |\n| Forms | Graphics | Bookmarks |\n| Annotations | Templates | Links |\n| Bookmarks | Images | Content |\n\n## Python API\n\n```python\nfrom pdf_oxide import PdfDocument\n\n# Path can be str or pathlib.Path; use \"with PdfDocument(...) as doc\" for context manager\ndoc = PdfDocument(\"report.pdf\")\nprint(f\"Pages: {doc.page_count()}\")\nprint(f\"Version: {doc.version()}\")\n\n# 1. Scoped extraction (v0.3.14)\n# Extract only from a specific area: (x, y, width, height)\nheader = doc.within(0, (0, 700, 612, 92)).extract_text()\n\n# 2. Word-level extraction (v0.3.14)\nwords = doc.extract_words(0)\nfor w in words:\n    print(f\"{w.text} at {w.bbox}\")\n    # Access individual characters in the word\n    # print(w.chars[0].font_name)\n\n# Optional: override the adaptive word gap threshold (in PDF points)\nwords = doc.extract_words(0, word_gap_threshold=2.5)\n\n# 3. Line-level extraction (v0.3.14)\nlines = doc.extract_text_lines(0)\nfor line in lines:\n    print(f\"Line: {line.text}\")\n\n# Optional: override word and/or line gap thresholds (in PDF points)\nlines = doc.extract_text_lines(0, word_gap_threshold=2.5, line_gap_threshold=4.0)\n\n# Inspect the adaptive thresholds before overriding\nparams = doc.page_layout_params(0)\nprint(f\"word gap: {params.word_gap_threshold:.1f}, line gap: {params.line_gap_threshold:.1f}\")\n\n# Use a pre-tuned extraction profile for specific document types\nfrom pdf_oxide import ExtractionProfile\nwords = doc.extract_words(0, profile=ExtractionProfile.form())\nlines = doc.extract_text_lines(0, profile=ExtractionProfile.academic())\n\n# 4. Table extraction (v0.3.14)\ntables = doc.extract_tables(0)\nfor table in tables:\n    print(f\"Table with {table.row_count} rows\")\n\n# 5. Traditional extraction\ntext = doc.extract_text(0)\nchars = doc.extract_chars(0)\n```\n\n### Form Fields\n\n```python\n# Extract form fields\nfields = doc.get_form_fields()\nfor f in fields:\n    print(f\"{f.name} ({f.field_type}) = {f.value}\")\n\n# Fill and save\ndoc.set_form_field_value(\"employee_name\", \"Jane Doe\")\ndoc.set_form_field_value(\"wages\", \"85000.00\")\ndoc.save(\"filled.pdf\")\n```\n\n## Rust API\n\n```rust\nuse pdf_oxide::PdfDocument;\n\nfn main() -\u003e Result\u003c(), Box\u003cdyn std::error::Error\u003e\u003e {\n    let mut doc = PdfDocument::open(\"paper.pdf\")?;\n\n    // Extract text\n    let text = doc.extract_text(0)?;\n\n    // Character-level extraction\n    let chars = doc.extract_chars(0)?;\n\n    // Extract images\n    let images = doc.extract_images(0)?;\n\n    // Vector graphics\n    let paths = doc.extract_paths(0)?;\n\n    Ok(())\n}\n```\n\n### Form Fields (Rust)\n\n```rust\nuse pdf_oxide::editor::{DocumentEditor, EditableDocument, SaveOptions};\nuse pdf_oxide::editor::form_fields::FormFieldValue;\n\nlet mut editor = DocumentEditor::open(\"w2.pdf\")?;\neditor.set_form_field_value(\"employee_name\", FormFieldValue::Text(\"Jane Doe\".into()))?;\neditor.save_with_options(\"filled.pdf\", SaveOptions::incremental())?;\n```\n\n## Installation\n\n### Python\n\n```bash\npip install pdf_oxide\n```\n\nWheels available for Linux, macOS, and Windows. Python 3.8–3.14.\n\n### Rust\n\n```toml\n[dependencies]\npdf_oxide = \"0.3\"\n```\n\n### JavaScript/WASM\n\n```bash\nnpm install pdf-oxide-wasm\n```\n\n```javascript\nconst { WasmPdfDocument } = require(\"pdf-oxide-wasm\");\n```\n\n### CLI\n\n```bash\nbrew install yfedoseev/tap/pdf-oxide    # Homebrew (macOS/Linux)\ncargo install pdf_oxide_cli             # Cargo\ncargo binstall pdf_oxide_cli            # Pre-built binary via cargo-binstall\n```\n\n### MCP Server\n\n```bash\nbrew install yfedoseev/tap/pdf-oxide    # Included with CLI in Homebrew\ncargo install pdf_oxide_mcp             # Cargo\n```\n\n## CLI\n\n22 commands for PDF processing directly from your terminal:\n\n```bash\npdf-oxide text report.pdf                      # Extract text\npdf-oxide markdown report.pdf -o report.md     # Convert to Markdown\npdf-oxide html report.pdf -o report.html       # Convert to HTML\npdf-oxide info report.pdf                      # Show metadata\npdf-oxide search report.pdf \"neural.?network\"  # Search (regex)\npdf-oxide images report.pdf -o ./images/       # Extract images\npdf-oxide merge a.pdf b.pdf -o combined.pdf    # Merge PDFs\npdf-oxide split report.pdf -o ./pages/         # Split into pages\npdf-oxide watermark doc.pdf \"DRAFT\"            # Add watermark\npdf-oxide forms w2.pdf --fill \"name=Jane\"      # Fill form fields\n```\n\nRun `pdf-oxide` with no arguments for interactive REPL mode. Use `--pages 1-5` to process specific pages, `--json` for machine-readable output.\n\n## MCP Server\n\n`pdf-oxide-mcp` lets AI assistants (Claude, Cursor, etc.) extract content from PDFs locally via the [Model Context Protocol](https://modelcontextprotocol.io/).\n\nAdd to your MCP client configuration:\n\n```json\n{\n  \"mcpServers\": {\n    \"pdf-oxide\": { \"command\": \"crgx\", \"args\": [\"pdf_oxide_mcp@latest\"] }\n  }\n}\n```\n\nThe server exposes an `extract` tool that supports text, markdown, and HTML output formats with optional page ranges and image extraction. All processing runs locally — no files leave your machine.\n\n## Building from Source\n\n```bash\n# Clone and build\ngit clone https://github.com/yfedoseev/pdf_oxide\ncd pdf_oxide\ncargo build --release\n\n# Run tests\ncargo test\n\n# Build Python bindings\nmaturin develop\n```\n\n## Documentation\n\n- **[Full Documentation](https://pdf.oxide.fyi)** - Complete documentation site\n- **[Getting Started (Rust)](https://pdf.oxide.fyi/docs/getting-started/rust)** - Rust guide\n- **[Getting Started (Python)](https://pdf.oxide.fyi/docs/getting-started/python)** - Python guide\n- **[Getting Started (WASM)](https://pdf.oxide.fyi/docs/getting-started/javascript)** - Browser and Node.js guide\n- **[Getting Started (CLI)](https://pdf.oxide.fyi/docs/getting-started/cli)** - CLI guide\n- **[Getting Started (MCP)](https://pdf.oxide.fyi/docs/getting-started/mcp)** - MCP server for AI assistants\n- **[API Docs](https://docs.rs/pdf_oxide)** - Full Rust API reference\n- **[Performance Benchmarks](https://pdf.oxide.fyi/docs/performance)** - Full benchmark methodology and results\n\n## Use Cases\n\n- **RAG / LLM pipelines** — Convert PDFs to clean Markdown for retrieval-augmented generation with LangChain, LlamaIndex, or any framework\n- **AI assistants** — Give Claude, Cursor, or any MCP-compatible tool direct PDF access via the MCP server\n- **Document processing at scale** — Extract text, images, and metadata from thousands of PDFs in seconds\n- **Data extraction** — Pull structured data from forms, tables, and layouts\n- **Academic research** — Parse papers, extract citations, and process large corpora\n- **PDF generation** — Create invoices, reports, certificates, and templated documents programmatically\n- **PyMuPDF alternative** — MIT licensed, 5× faster, no AGPL restrictions\n\n## License\n\nDual-licensed under [MIT](LICENSE-MIT) or [Apache-2.0](LICENSE-APACHE) at your option. Unlike AGPL-licensed alternatives, pdf_oxide can be used freely in any project — commercial or open-source — with no copyleft restrictions.\n\n## Contributing\n\nWe welcome contributions! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n```bash\ncargo build \u0026\u0026 cargo test \u0026\u0026 cargo fmt \u0026\u0026 cargo clippy -- -D warnings\n```\n\n## Citation\n\n```bibtex\n@software{pdf_oxide,\n  title = {PDF Oxide: Fast PDF Toolkit for Rust and Python},\n  author = {Yury Fedoseev},\n  year = {2025},\n  url = {https://github.com/yfedoseev/pdf_oxide}\n}\n```\n\n---\n\n**Rust** + **Python** + **WASM** + **CLI** + **MCP** | MIT/Apache-2.0 | 100% pass rate on 3,830 PDFs | 0.8ms mean | 5× faster than the industry leaders\n","funding_links":["https://github.com/sponsors/yfedoseev"],"categories":["Specific Formats Processing","File Format Processing","Libraries","\u003ca name=\"Rust\"\u003e\u003c/a\u003eRust","Rust"],"sub_categories":["Graphics"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyfedoseev%2Fpdf_oxide","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fyfedoseev%2Fpdf_oxide","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fyfedoseev%2Fpdf_oxide/lists"}