{"id":47473909,"url":"https://github.com/run-llama/liteparse","last_synced_at":"2026-04-06T17:02:21.057Z","repository":{"id":345447155,"uuid":"1153982569","full_name":"run-llama/liteparse","owner":"run-llama","description":"A fast, helpful, and open-source document parser","archived":false,"fork":false,"pushed_at":"2026-03-26T16:39:40.000Z","size":4858,"stargazers_count":2392,"open_issues_count":11,"forks_count":147,"subscribers_count":8,"default_branch":"main","last_synced_at":"2026-03-26T21:08:36.214Z","etag":null,"topics":["document-ocr","document-processing","ocr","ocr-recognition","pdf","pdf-parser","text-extraction"],"latest_commit_sha":null,"homepage":"https://developers.llamaindex.ai/liteparse/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/run-llama.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-02-09T22:16:30.000Z","updated_at":"2026-03-26T21:03:13.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/run-llama/liteparse","commit_stats":null,"previous_names":["run-llama/liteparse"],"tags_count":8,"template":false,"template_full_name":null,"purl":"pkg:github/run-llama/liteparse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/run-llama","download_url":"https://codeload.github.com/run-llama/liteparse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/run-llama%2Fliteparse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31105537,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-28T13:41:34.766Z","status":"ssl_error","status_checked_at":"2026-03-28T13:41:05.465Z","response_time":79,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-ocr","document-processing","ocr","ocr-recognition","pdf","pdf-parser","text-extraction"],"created_at":"2026-03-25T11:00:24.927Z","updated_at":"2026-04-01T18:17:28.414Z","avatar_url":"https://github.com/run-llama.png","language":"TypeScript","readme":"# LiteParse\n\n[![CI](https://github.com/run-llama/liteparse/actions/workflows/ci.yml/badge.svg)](https://github.com/run-llama/liteparse/actions/workflows/ci.yml)\n|\n[![npm version](https://img.shields.io/npm/v/@llamaindex/liteparse.svg)](https://www.npmjs.com/package/@llamaindex/liteparse)\n|\n[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n|\n[Docs](https://developers.llamaindex.ai/liteparse/)\n\n\u003cimg src=\"https://github.com/user-attachments/assets/07ba6a82-6bb1-4dea-b0ef-cad7df7d1622\" alt=\"out\" width=\"600\"\u003e\n\nLiteParse is a standalone OSS PDF parsing tool focused exclusively on **fast and light** parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine. \n\n**Hitting the limits of local parsing?**\nFor complex documents (dense tables, multi-column layouts, charts, handwritten text, or \nscanned PDFs), you'll get significantly better results with [LlamaParse](https://developers.llamaindex.ai/python/cloud/llamaparse/?utm_source=github\u0026utm_medium=liteparse), \nour cloud-based document parser built for production document pipelines. LlamaParse handles the \nhard stuff so your models see clean, structured data and markdown.\n\n\u003e  👉 [Sign up for LlamaParse free](https://cloud.llamaindex.ai?utm_source=github\u0026utm_medium=liteparse)\n\n## Overview\n\n- **Fast Text Parsing**: Spatial text parsing using PDF.js\n- **Flexible OCR System**:\n  - **Built-in**: Tesseract.js (zero setup, works out of the box!)\n  - **HTTP Servers**: Plug in any OCR server (EasyOCR, PaddleOCR, custom)\n  - **Standard API**: Simple, well-defined OCR API specification\n- **Screenshot Generation**: Generate high-quality page screenshots for LLM agents\n- **Multiple Output Formats**: JSON and Text\n- **Bounding Boxes**: Precise text positioning information\n- **Standalone Binary**: No cloud dependencies, runs entirely locally\n- **Multi-platform**: Linux, macOS (Intel/ARM), Windows\n\n## Installation\n\n### CLI Tool\n\n#### Option 1: Global Install (Recommended)\n\nInstall globally via npm to use the `lit` command anywhere:\n\n```bash\nnpm i -g @llamaindex/liteparse\n```\n\nThen use it:\n\n```bash\nlit parse document.pdf\nlit screenshot document.pdf\n```\n\nFor macOS and Linux users, `liteparse` can be also installed via `brew`:\n\n```bash\nbrew tap run-llama/liteparse\nbrew install llamaindex-liteparse\n```\n\n#### Option 2: Install from Source\n\nYou can clone the repo and install the CLI globally from source:\n\n```\ngit clone https://github.com/run-llama/liteparse.git\ncd liteparse\nnpm run build\nnpm pack\nnpm install -g ./liteparse-*.tgz\n```\n\n### Agent Skill\n\nYou can use `liteparse` as an agent skill, downloading it with the `skills` CLI tool:\n\n```bash\nnpx skills add run-llama/llamaparse-agent-skills --skill liteparse\n```\n\nOr copy-pasting the [`SKILL.md`](https://github.com/run-llama/llamaparse-agent-skills/blob/main/skills/liteparse/SKILL.md) file to your own skills setup.\n\n## Usage\n\n### Parse Files\n\n```bash\n# Basic parsing\nlit parse document.pdf\n\n# Parse with specific format\nlit parse document.pdf --format json -o output.md\n\n# Parse specific pages\nlit parse document.pdf --target-pages \"1-5,10,15-20\"\n\n# Parse without OCR\nlit parse document.pdf --no-ocr\n\n# Parse a remote PDF\ncurl -sL https://example.com/report.pdf | lit parse -\n```\n\n### Batch Parsing\n\nYou can also parse an entire directory of documents:\n\n```bash\nlit batch-parse ./input-directory ./output-directory\n```\n\n### Generate Screenshots\n\nScreenshots are essential for LLM agents to extract visual information that text alone cannot capture.\n\n```bash\n# Screenshot all pages\nlit screenshot document.pdf -o ./screenshots\n\n# Screenshot specific pages\nlit screenshot document.pdf --target-pages \"1,3,5\" -o ./screenshots\n\n# Custom DPI\nlit screenshot document.pdf --dpi 300 -o ./screenshots\n\n# Screenshot page range\nlit screenshot document.pdf --target-pages \"1-10\" -o ./screenshots\n```\n\n### Library Usage\n\nInstall as a dependency in your project:\n\n```bash\nnpm install @llamaindex/liteparse\n# or\npnpm add @llamaindex/liteparse\n```\n\n```typescript\nimport { LiteParse } from '@llamaindex/liteparse';\n\nconst parser = new LiteParse({ ocrEnabled: true });\nconst result = await parser.parse('document.pdf');\nconsole.log(result.text);\n```\n\n#### Buffer / Uint8Array Input\n\nYou can pass raw bytes directly instead of a file path, which is useful for remote files:\n\n```typescript\nimport { LiteParse } from '@llamaindex/liteparse';\nimport { readFile } from 'fs/promises';\n\nconst parser = new LiteParse();\n\n// From a file read\nconst pdfBytes = await readFile('document.pdf');\nconst result = await parser.parse(pdfBytes);\n\n// From an HTTP response\nconst response = await fetch('https://example.com/document.pdf');\nconst buffer = Buffer.from(await response.arrayBuffer());\nconst result2 = await parser.parse(buffer);\n```\n\nNon-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:\n\n```typescript\nconst screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);\n```\n\n### CLI Options\n\n#### Parse Command\n\n```\n$ lit parse --help\nUsage: lit parse [options] \u003cfile\u003e\n\nParse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)\n\nOptions:\n  -o, --output \u003cfile\u003e     Output file path\n  --format \u003cformat\u003e       Output format: json|text (default: \"text\")\n  --ocr-server-url \u003curl\u003e  HTTP OCR server URL (uses Tesseract if not provided)\n  --no-ocr                Disable OCR\n  --ocr-language \u003clang\u003e   OCR language(s) (default: \"en\")\n  --num-workers \u003cn\u003e       Number of pages to OCR in parallel (default: CPU cores - 1)\n  --max-pages \u003cn\u003e         Max pages to parse (default: \"10000\")\n  --target-pages \u003cpages\u003e  Target pages (e.g., \"1-5,10,15-20\")\n  --dpi \u003cdpi\u003e             DPI for rendering (default: \"150\")\n  --no-precise-bbox       Disable precise bounding boxes\n  --preserve-small-text   Preserve very small text\n  --password \u003cpassword\u003e   Password for encrypted/protected documents\n  --config \u003cfile\u003e         Config file (JSON)\n  -q, --quiet             Suppress progress output\n  -h, --help              display help for command\n```\n\n#### Batch Parse Command\n\n```\n$ lit batch-parse --help\nUsage: lit batch-parse [options] \u003cinput-dir\u003e \u003coutput-dir\u003e\n\nParse multiple documents in batch mode (reuses PDF engine for efficiency)\n\nOptions:\n  --format \u003cformat\u003e       Output format: json|text (default: \"text\")\n  --ocr-server-url \u003curl\u003e  HTTP OCR server URL (uses Tesseract if not provided)\n  --no-ocr                Disable OCR\n  --ocr-language \u003clang\u003e   OCR language(s) (default: \"en\")\n  --num-workers \u003cn\u003e       Number of pages to OCR in parallel (default: CPU cores - 1)\n  --max-pages \u003cn\u003e         Max pages to parse per file (default: \"10000\")\n  --dpi \u003cdpi\u003e             DPI for rendering (default: \"150\")\n  --no-precise-bbox       Disable precise bounding boxes\n  --recursive             Recursively search input directory\n  --extension \u003cext\u003e       Only process files with this extension (e.g., \".pdf\")\n  --password \u003cpassword\u003e   Password for encrypted/protected documents (applied to all files)\n  --config \u003cfile\u003e         Config file (JSON)\n  -q, --quiet             Suppress progress output\n  -h, --help              display help for command\n```\n\n#### Screenshot Command\n\n```\n$ lit screenshot --help\nUsage: lit screenshot [options] \u003cfile\u003e\n\nGenerate screenshots of PDF pages\n\nOptions:\n  -o, --output-dir \u003cdir\u003e  Output directory for screenshots (default: \"./screenshots\")\n  --target-pages \u003cpages\u003e  Page numbers to screenshot (e.g., \"1,3,5\" or \"1-5\")\n  --dpi \u003cdpi\u003e             DPI for rendering (default: \"150\")\n  --format \u003cformat\u003e       Image format: png|jpg (default: \"png\")\n  --password \u003cpassword\u003e   Password for encrypted/protected documents\n  --config \u003cfile\u003e         Config file (JSON)\n  -q, --quiet             Suppress progress output\n  -h, --help              display help for command\n```\n\n## OCR Setup\n\n### Default: Tesseract.js\n\n```bash\n# Tesseract is enabled by default\nlit parse document.pdf\n\n# Specify language\nlit parse document.pdf --ocr-language fra\n\n# Disable OCR\nlit parse document.pdf --no-ocr\n```\n\nBy default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the `TESSDATA_PREFIX` environment variable to a directory containing pre-downloaded `.traineddata` files:\n\n```bash\nexport TESSDATA_PREFIX=/path/to/tessdata\nlit parse document.pdf --ocr-language eng\n```\n\nYou can also pass `tessdataPath` in the library config:\n\n```typescript\nconst parser = new LiteParse({ tessdataPath: '/path/to/tessdata' });\n```\n\n### Optional: HTTP OCR Servers\n\nFor higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:\n\n- [EasyOCR](ocr/easyocr/README.md)\n- [PaddleOCR](ocr/paddleocr/README.md)\n\nYou can integrate any OCR service by implementing the simple LiteParse OCR API specification (see [`OCR_API_SPEC.md`](OCR_API_SPEC.md)).\n\nThe API requires:\n- POST `/ocr` endpoint\n- Accepts `file` and `language` parameters\n- Returns JSON: `{ results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }`\n\nSee the example servers in `ocr/easyocr/` and `ocr/paddleocr/` as templates.\n\nFor the complete OCR API specification, see [`OCR_API_SPEC.md`](OCR_API_SPEC.md).\n\n## Multi-Format Input Support\n\nLiteParse supports **automatic conversion** of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!\n\n### Supported Input Formats\n\n#### Office Documents (via LibreOffice)\n- **Word**: `.doc`, `.docx`, `.docm`, `.odt`, `.rtf`\n- **PowerPoint**: `.ppt`, `.pptx`, `.pptm`, `.odp`\n- **Spreadsheets**: `.xls`, `.xlsx`, `.xlsm`, `.ods`, `.csv`, `.tsv`\n\nJust install the dependency and LiteParse will automatically convert these formats to PDF for parsing:\n\n```bash\n# macOS\nbrew install --cask libreoffice\n\n# Ubuntu/Debian\napt-get install libreoffice\n\n# Windows\nchoco install libreoffice-fresh # might require admin permissions\n```\n\n\u003e _For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally `C:\\Program Files\\LibreOffice\\program`) to the environment variables and re-start the machine._\n\n#### Images (via ImageMagick)\n- **Formats**: `.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.tiff`, `.webp`, `.svg`\n\nJust install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):\n\n```bash\n# macOS\nbrew install imagemagick\n\n# Ubuntu/Debian\napt-get install imagemagick\n\n# Windows\nchoco install imagemagick.app # might require admin permissions\n```\n\n## Environment Variables\n\n| Variable | Description |\n|----------|-------------|\n| `TESSDATA_PREFIX` | Path to a directory containing Tesseract `.traineddata` files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet. |\n| `LITEPARSE_TMPDIR` | Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (`os.tmpdir()`). Useful in containerized or read-only filesystem environments. |\n\n## Configuration\n\nYou can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.\n\n### Config File Example\n\nCreate a `liteparse.config.json` file:\n\n```json\n{\n  \"ocrLanguage\": \"en\",\n  \"ocrEnabled\": true,\n  \"maxPages\": 1000,\n  \"dpi\": 150,\n  \"outputFormat\": \"json\",\n  \"preciseBoundingBox\": true,\n  \"preserveVerySmallText\": false,\n  \"password\": \"optional_password\"\n}\n```\n\nFor HTTP OCR servers, just add `ocrServerUrl`:\n\n```json\n{\n  \"ocrServerUrl\": \"http://localhost:8828/ocr\",\n  \"ocrLanguage\": \"en\",\n  \"outputFormat\": \"json\"\n}\n```\n\nUse with:\n\n```bash\nlit parse document.pdf --config liteparse.config.json\n```\n\n## Development\n\nWe provide a fairly rich `AGENTS.md`/`CLAUDE.md` that we recommend using to help with development + coding agents.\n\n```bash\n# Install dependencies\nnpm install\n\n# Build TypeScript (Linux/macOs)\nnpm run build\n\n# Build Typescript (Windows)\nnpm run build:windows\n\n# Watch mode\nnpm run dev\n\n# Test parsing\nnpm test\n```\n\n## License\n\nApache 2.0\n\n## Credits\n\nBuilt on top of:\n\n- [PDF.js](https://github.com/mozilla/pdf.js) - PDF parsing engine\n- [Tesseract.js](https://github.com/naptha/tesseract.js) - In-process OCR engine\n- [EasyOCR](https://github.com/JaidedAI/EasyOCR) - HTTP OCR server (optional)\n- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR) - HTTP OCR server (optional)\n- [Sharp](https://github.com/lovell/sharp) - Image processing\n","funding_links":[],"categories":["TypeScript","\u003ca name=\"TypeScript\"\u003e\u003c/a\u003eTypeScript"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frun-llama%2Fliteparse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frun-llama%2Fliteparse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frun-llama%2Fliteparse/lists"}