{"id":47473909,"url":"https://github.com/run-llama/liteparse","last_synced_at":"2026-06-12T01:01:42.953Z","repository":{"id":345447155,"uuid":"1153982569","full_name":"run-llama/liteparse","owner":"run-llama","description":"A fast, helpful, and open-source document parser","archived":false,"fork":false,"pushed_at":"2026-05-17T05:40:22.000Z","size":9670,"stargazers_count":5141,"open_issues_count":37,"forks_count":341,"subscribers_count":13,"default_branch":"main","last_synced_at":"2026-05-17T07:38:35.970Z","etag":null,"topics":["document-ocr","document-processing","ocr","ocr-recognition","pdf","pdf-parser","text-extraction"],"latest_commit_sha":null,"homepage":"https://developers.llamaindex.ai/liteparse/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/run-llama.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-09T22:16:30.000Z","updated_at":"2026-05-17T07:35:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/run-llama/liteparse","commit_stats":null,"previous_names":["run-llama/liteparse"],"tags_count":18,"template":false,"template_full_name":null,"purl":"pkg:github/run-llama/liteparse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/run-llama","download_url":"https://codeload.github.com/run-llama/liteparse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33368584,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-21T12:23:38.849Z","status":"online","status_checked_at":"2026-05-22T02:00:06.671Z","response_time":265,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-ocr","document-processing","ocr","ocr-recognition","pdf","pdf-parser","text-extraction"],"created_at":"2026-03-25T11:00:24.927Z","updated_at":"2026-06-12T01:01:42.864Z","avatar_url":"https://github.com/run-llama.png","language":"TypeScript","funding_links":[],"categories":["Rust","Libraries","Parsers, OCR and extraction","Tools","Developer Tools","\u003ca name=\"TypeScript\"\u003e\u003c/a\u003eTypeScript"],"sub_categories":["Parsing","Document Processing"],"readme":"# LiteParse\n\n[![CI](https://github.com/run-llama/liteparse/actions/workflows/ci.yml/badge.svg)](https://github.com/run-llama/liteparse/actions/workflows/ci.yml)\n|\n[![Crates.io version](https://img.shields.io/crates/v/liteparse.svg)](https://crates.io/crates/liteparse)\n|\n[![npm version](https://img.shields.io/npm/v/@llamaindex/liteparse.svg)](https://www.npmjs.com/package/@llamaindex/liteparse)\n|\n[![wasm version](https://img.shields.io/npm/v/@llamaindex/liteparse-wasm.svg)](https://www.npmjs.com/package/@llamaindex/liteparse-wasm)\n|\n[![PyPI version](https://img.shields.io/pypi/v/liteparse.svg)](https://pypi.org/project/liteparse/)\n|\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n|\n[Docs](https://developers.llamaindex.ai/liteparse/)\n\nEnglish | [简体中文](README.zh-CN.md)\n\n\u003cimg src=\"https://github.com/user-attachments/assets/07ba6a82-6bb1-4dea-b0ef-cad7df7d1622\" alt=\"out\" width=\"600\"\u003e\n\n\u003e Looking for LiteParse V1? Follow this link to [the old code](https://github.com/run-llama/liteparse/tree/logan/liteparse-v1)\n\nLiteParse is a standalone OSS PDF parsing tool focused exclusively on **fast and light** parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.\n\n**Hitting the limits of local parsing?**\nFor complex documents (dense tables, multi-column layouts, charts, handwritten text, or\nscanned PDFs), you'll get significantly better results with [LlamaParse](https://developers.llamaindex.ai/python/cloud/llamaparse/?utm_source=github\u0026utm_medium=liteparse),\nour cloud-based document parser built for production document pipelines. LlamaParse handles the\nhard stuff so your models see clean, structured data and markdown.\n\n\u003e  [Sign up for LlamaParse free](https://cloud.llamaindex.ai?utm_source=github\u0026utm_medium=liteparse)\n\n## Overview\n\n- **Fast Text Parsing**: Spatial text parsing using PDFium\n- **Flexible OCR System**:\n  - **Built-in**: Tesseract (zero setup, bundled with the library)\n  - **HTTP Servers**: Plug in any OCR server (EasyOCR, PaddleOCR, custom)\n  - **Standard API**: Simple, well-defined OCR API specification\n- **Screenshot Generation**: Generate high-quality page screenshots for LLM agents\n- **Multiple Output Formats**: JSON and Text\n- **Bounding Boxes**: Precise text positioning information\n- **Multi-language**: Use from Rust, Node.js/TypeScript, Python, or the browser (WASM)\n- **Multi-platform**: Linux, macOS (Intel/ARM), Windows\n\n```mermaid\nflowchart LR\n      subgraph Input[\"Input Formats\"]\n          direction TB\n          PDF[\"PDF\"]\n          DOCX[\"DOCX\"]\n          XLSX[\"XLSX\"]\n          PPTX[\"PPTX\"]\n          IMG[\"Images\"]\n      end\n\n      subgraph Core[\"Rust Core\"]\n          direction TB\n          CONV[\"Format Conversion\\nLibreOffice / ImageMagick\"]\n          EXTRACT[\"Text Extraction\\nPDFium C library\"]\n          OCR[\"Selective OCR\\nTesseract / HTTP / Custom\"]\n          MERGE[\"OCR Merge\\nNative text + OCR results\"]\n          PROJ[\"Grid Projection\\nSpatial layout reconstruction\"]\n          CONV --\u003e EXTRACT\n          EXTRACT --\u003e OCR --\u003e MERGE --\u003e PROJ\n          EXTRACT --\u003e MERGE\n      end\n\n      subgraph Output[\" Output \"]\n          direction TB\n          JSON[\"Structured JSON\\ntext + bounding boxes\"]\n          TEXT[\"Plain Text\\nlayout-preserved\"]\n          SCREEN[\"Screenshots\\nPNG rendering\"]\n      end\n\n      subgraph Bindings[\"Language Bindings\"]\n          direction TB\n          NAPI[\"Node.js / TypeScript\\nnapi-rs\"]\n          PYO3[\"Python\\nPyO3\"]\n          WASM[\"Browser / WASM\\nwasm-bindgen\"]\n          CLI[\"CLI\\ncargo / npm / pip\"]\n          NAPI ~~~ PYO3 ~~~ WASM ~~~ CLI\n      end\n\n      PDF --\u003e EXTRACT\n      DOCX \u0026 XLSX \u0026 PPTX \u0026 IMG --\u003e CONV\n      PROJ --\u003e JSON \u0026 TEXT \u0026 SCREEN\n      JSON \u0026 TEXT \u0026 SCREEN --\u003e Bindings\n\n      style Input fill:#F5F5F5,color:#000000,stroke:#37D7FA,stroke-width:2px\n      style Core fill:#F5F5F5,color:#000000,stroke:#3E18F9,stroke-width:2px\n      style Output fill:#F5F5F5,color:#000000,stroke:#FF8705,stroke-width:2px\n      style Bindings fill:#F5F5F5,color:#000000,stroke:#FF8DF2,stroke-width:2px\n\n      style PDF fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px\n      style DOCX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px\n      style XLSX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px\n      style PPTX fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px\n      style IMG fill:#96E7F9,color:#000000,stroke:#37D7FA,stroke-width:1px\n\n      style CONV fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px\n      style EXTRACT fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px\n      style OCR fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px\n      style MERGE fill:#92AEFF,color:#000000,stroke:#4B72FE,stroke-width:1px\n      style PROJ fill:#4B72FE,color:#FFFFFF,stroke:#3E18F9,stroke-width:2px\n\n      style JSON fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px\n      style TEXT fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px\n      style SCREEN fill:#FFBD74,color:#000000,stroke:#FF8705,stroke-width:1px\n\n      style NAPI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px\n      style PYO3 fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px\n      style WASM fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px\n      style CLI fill:#FFBFF8,color:#000000,stroke:#FF8DF2,stroke-width:1px\n```\n\n## Installation\n\nInstall via your preferred package manager. All versions (except WASM) ship with the same `lit` CLI.\n\n| Language | Install | Library Docs |\n|----------|---------|--------------|\n| **Node.js / TypeScript** | `npm i @llamaindex/liteparse` | [Node.js README](packages/node/README.md) |\n| **Python** | `pip install liteparse` | [Python README](packages/python/README.md) |\n| **Rust** | `cargo install liteparse` (CLI) / `cargo add liteparse` (lib) | [Rust README (crates.io)](crates/liteparse/README.md) |\n| **Browser (WASM)** | `npm i @llamaindex/liteparse-wasm` | [WASM README](packages/wasm/README.md) |\n\n### Agent Skill\n\nYou can use `liteparse` as an agent skill, downloading it with the `skills` CLI tool:\n\n```bash\nnpx skills add run-llama/llamaparse-agent-skills --skill liteparse\n```\n\nOr copy-pasting the [`SKILL.md`](https://github.com/run-llama/llamaparse-agent-skills/blob/main/skills/liteparse/SKILL.md) file to your own skills setup.\n\n## CLI Usage\n\nThe CLI is the same across all installations (`npm`, `pip`, `cargo install`).\n\n### Parse Files\n\n```bash\n# Basic parsing\nlit parse document.pdf\n\n# Parse with specific format\nlit parse document.pdf --format json -o output.json\n\n# Parse specific pages\nlit parse document.pdf --target-pages \"1-5,10,15-20\"\n\n# Parse without OCR\nlit parse document.pdf --no-ocr\n\n# Parse a remote PDF\ncurl -sL https://example.com/report.pdf | lit parse -\n```\n\n### Batch Parsing\n\nParse an entire directory of documents:\n\n```bash\nlit batch-parse ./input-directory ./output-directory\n```\n\n### Generate Screenshots\n\nScreenshots are essential for LLM agents to extract visual information that text alone cannot capture.\n\n```bash\n# Screenshot all pages\nlit screenshot document.pdf -o ./screenshots\n\n# Screenshot specific pages\nlit screenshot document.pdf --target-pages \"1,3,5\" -o ./screenshots\n\n# Custom DPI\nlit screenshot document.pdf --dpi 300 -o ./screenshots\n```\n\n### CLI Reference\n\n#### Parse Command\n\n```\nlit parse [OPTIONS] \u003cfile\u003e\n\nOptions:\n  -o, --output \u003cfile\u003e          Output file path\n      --format \u003cformat\u003e        Output format: json|text [default: text]\n      --no-ocr                 Disable OCR\n      --ocr-language \u003clang\u003e    OCR language, Tesseract format [default: eng]\n      --ocr-server-url \u003curl\u003e   HTTP OCR server URL (uses Tesseract if not provided)\n      --tessdata-path \u003cpath\u003e   Path to tessdata directory\n      --max-pages \u003cn\u003e          Max pages to parse [default: 1000]\n      --target-pages \u003cpages\u003e   Pages to parse (e.g., \"1-5,10,15-20\")\n      --dpi \u003cdpi\u003e              Rendering DPI [default: 150]\n      --preserve-small-text    Keep very small text\n      --password \u003cpassword\u003e    Password for encrypted documents\n      --num-workers \u003cn\u003e        Concurrent OCR workers [default: CPU cores - 1]\n  -q, --quiet                  Suppress progress output\n  -h, --help                   Print help\n```\n\n#### Batch Parse Command\n\n```\nlit batch-parse [OPTIONS] \u003cinput-dir\u003e \u003coutput-dir\u003e\n\nOptions:\n      --format \u003cformat\u003e        Output format: json|text [default: text]\n      --no-ocr                 Disable OCR\n      --ocr-language \u003clang\u003e    OCR language [default: eng]\n      --ocr-server-url \u003curl\u003e   HTTP OCR server URL\n      --tessdata-path \u003cpath\u003e   Path to tessdata directory\n      --max-pages \u003cn\u003e          Max pages per file [default: 1000]\n      --dpi \u003cdpi\u003e              Rendering DPI [default: 150]\n      --recursive              Recursively search input directory\n      --extension \u003cext\u003e        Only process files with this extension (e.g., \".pdf\")\n      --password \u003cpassword\u003e    Password for encrypted documents\n      --num-workers \u003cn\u003e        Concurrent OCR workers\n  -q, --quiet                  Suppress progress output\n  -h, --help                   Print help\n```\n\n#### Screenshot Command\n\n```\nlit screenshot [OPTIONS] \u003cfile\u003e\n\nOptions:\n  -o, --output-dir \u003cdir\u003e       Output directory [default: ./screenshots]\n      --target-pages \u003cpages\u003e   Pages to screenshot (e.g., \"1,3,5\" or \"1-5\")\n      --dpi \u003cdpi\u003e              Rendering DPI [default: 150]\n      --password \u003cpassword\u003e    Password for encrypted documents\n  -q, --quiet                  Suppress progress output\n  -h, --help                   Print help\n```\n\n## OCR Setup\n\n### Default: Tesseract\n\nTesseract is bundled and works out of the box:\n\n```bash\nlit parse document.pdf                    # OCR enabled by default\nlit parse document.pdf --ocr-language fra # Specify language\nlit parse document.pdf --no-ocr           # Disable OCR\n```\n\nFor offline or air-gapped environments, set `TESSDATA_PREFIX` to a directory containing pre-downloaded `.traineddata` files:\n\n```bash\nexport TESSDATA_PREFIX=/path/to/tessdata\nlit parse document.pdf --ocr-language eng\n```\n\nOr pass the path directly:\n\n```bash\nlit parse document.pdf --tessdata-path /path/to/tessdata\n```\n\n### Optional: HTTP OCR Servers\n\nFor higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:\n\n- [EasyOCR](ocr/easyocr/README.md)\n- [PaddleOCR](ocr/paddleocr/README.md)\n\nYou can integrate any OCR service by implementing the simple LiteParse OCR API specification (see [`OCR_API_SPEC.md`](OCR_API_SPEC.md)).\n\nThe API requires:\n- POST `/ocr` endpoint\n- Accepts `file` and `language` parameters\n- Returns JSON: `{ results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }`\n\n## Multi-Format Input Support\n\nLiteParse supports **automatic conversion** of various document formats to PDF before parsing.\n\n### Supported Input Formats\n\n#### Office Documents (via LibreOffice)\n- **Word**: `.doc`, `.docx`, `.docm`, `.odt`, `.rtf`, `.pages`\n- **PowerPoint**: `.ppt`, `.pptx`, `.pptm`, `.odp`, `.key`\n- **Spreadsheets**: `.xls`, `.xlsx`, `.xlsm`, `.ods`, `.csv`, `.tsv`, `.numbers`\n\nInstall LibreOffice for automatic conversion:\n\n```bash\n# macOS\nbrew install --cask libreoffice\n\n# Ubuntu/Debian\napt-get install libreoffice\n\n# Windows\nchoco install libreoffice-fresh\n```\n\n\u003e _On Windows, you may need to add LibreOffice's program directory (usually `C:\\Program Files\\LibreOffice\\program`) to your PATH._\n\n#### Images (via ImageMagick)\n- **Formats**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg`\n\nInstall ImageMagick for image-to-PDF conversion:\n\n```bash\n# macOS\nbrew install imagemagick\n\n# Ubuntu/Debian\napt-get install imagemagick\n\n# Windows\nchoco install imagemagick.app\n```\n\n## Environment Variables\n\n| Variable | Description |\n|----------|-------------|\n| `TESSDATA_PREFIX` | Path to a directory containing Tesseract `.traineddata` files. Used for offline/air-gapped environments. |\n\n## Development\n\nThe project is a Rust workspace with the core library and language-specific binding crates.\n\n```\ncrates/\n├── liteparse/          # Core library + CLI binary\n├── liteparse-napi/     # Node.js bindings (napi-rs)\n├── liteparse-python/   # Python bindings (PyO3)\n├── liteparse-wasm/     # WASM bindings (wasm-bindgen)\n├── pdfium/             # PDFium Rust wrapper\n└── pdfium-sys/         # PDFium FFI bindings\npackages/\n├── node/               # npm package (TS wrapper + native binary)\n├── python/             # PyPI package (Python wrapper + native binary)\n└── wasm/               # WASM npm package\n```\n\n### Building\n\n```bash\n# Build the CLI\ncargo build --release -p liteparse\n\n# Build Node.js bindings\ncd packages/node \u0026\u0026 npm run build\n\n# Build Python bindings\ncd packages/python \u0026\u0026 maturin develop --release\n\n# Build WASM\ncd packages/wasm \u0026\u0026 npm run build\n```\n\nWe provide a fairly rich `AGENTS.md`/`CLAUDE.md` that we recommend using to help with development + coding agents.\n\n## License\n\nApache 2.0\n\n## Credits\n\nBuilt on top of:\n\n- [PDFium](https://pdfium.googlesource.com/pdfium/) - PDF rendering and text extraction\n- [Tesseract](https://github.com/tesseract-ocr/tesseract) - OCR engine (via tesseract-rs)\n- [EasyOCR](https://github.com/JaidedAI/EasyOCR) - HTTP OCR server (optional)\n- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - HTTP OCR server (optional)\n- [napi-rs](https://napi.rs/) - Node.js native bindings\n- [PyO3](https://pyo3.rs/) - Python native bindings\n- [wasm-bindgen](https://github.com/wasm-bindgen/wasm-bindgen) - WebAssembly bindings\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frun-llama%2Fliteparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frun-llama%2Fliteparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frun-llama%2Fliteparse/lists"}