{"id":34733547,"url":"https://github.com/ahnafnafee/local-llm-pdf-ocr","last_synced_at":"2026-04-24T17:05:48.941Z","repository":{"id":328703321,"uuid":"1116376250","full_name":"ahnafnafee/local-llm-pdf-ocr","owner":"ahnafnafee","description":"Convert scanned PDFs into searchable text locally using Vision LLMs (olmOCR). 100% private, offline, and free. Features a modern Web UI \u0026 CLI.","archived":false,"fork":false,"pushed_at":"2026-04-19T07:09:39.000Z","size":1991,"stargazers_count":33,"open_issues_count":1,"forks_count":7,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-19T08:33:23.555Z","etag":null,"topics":["document-processing","fastapi","local-llm","no-api-key","ocr","offline-ai","olmocr","pdf-ocr","privacy-focused","python","searchable-pdf","surya-ocr","vision-llm","web-ui"],"latest_commit_sha":null,"homepage":"https://www.ahnafnafee.dev/blog/local-llm-pdf-ocr","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ahnafnafee.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-14T18:30:37.000Z","updated_at":"2026-04-19T07:09:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ahnafnafee/local-llm-pdf-ocr","commit_stats":null,"previous_names":["ahnafnafee/local-llm-pdf-ocr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ahnafnafee/local-llm-pdf-ocr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ahnafnafee%2Flocal-llm-pdf-ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ahnafnafee%2Flocal-llm-pdf-ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ahnafnafee%2Flocal-llm-pdf-ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ahnafnafee%2Flocal-llm-pdf-ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ahnafnafee","download_url":"https://codeload.github.com/ahnafnafee/local-llm-pdf-ocr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ahnafnafee%2Flocal-llm-pdf-ocr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32232701,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-24T13:21:15.438Z","status":"ssl_error","status_checked_at":"2026-04-24T13:21:15.005Z","response_time":64,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-processing","fastapi","local-llm","no-api-key","ocr","offline-ai","olmocr","pdf-ocr","privacy-focused","python","searchable-pdf","surya-ocr","vision-llm","web-ui"],"created_at":"2025-12-25T03:19:08.307Z","updated_at":"2026-04-24T17:05:48.934Z","avatar_url":"https://github.com/ahnafnafee.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# 📄 Local LLM PDF OCR\n\n[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?style=for-the-badge\u0026logo=python\u0026logoColor=white)](https://python.org)\n[![FastAPI](https://img.shields.io/badge/FastAPI-Modern-009688?style=for-the-badge\u0026logo=fastapi\u0026logoColor=white)](https://fastapi.tiangolo.com)\n[![License](https://img.shields.io/badge/License-MIT-purple?style=for-the-badge)](LICENSE)\n[![Local AI](https://img.shields.io/badge/AI-100%25_Local-orange?style=for-the-badge)](https://lmstudio.ai)\n\n\u003e **Transform scanned and written documents into fully searchable, selectable PDFs using the power of Local LLM Vision.**\n\n**PDF LLM OCR** is a next-generation OCR tool that moves beyond traditional Tesseract-based scanning. By leveraging OCR Vision Language Models (VLMs) like `olmOCR` running locally on your machine, it \"reads\" documents with human-like understanding while keeping 100% of your data private.\n\n---\n\n## ✨ Features\n\n-   **🧠 AI-Powered Vision**: Uses advanced VLMs to transcribe text with high accuracy, even on complex layouts or noisy scans.\n-   **🤝 DP-Based Text↔Box Alignment**: **Surya OCR** detects layout boxes; a **Local LLM** transcribes the whole page; a Needleman-Wunsch dynamic-programming aligner binds LLM lines to the correct boxes in reading order, with a per-box crop re-OCR fallback for boxes the DP cannot confidently populate.\n-   **🛰️ Grounded Path (opt-in)**: Point the tool at a bbox-native VLM (Qwen2.5-VL, Qwen3-VL, MinerU, Florence-2, …) with `--grounded` and it skips Surya/DP/refine entirely — the model returns text + coordinates in a single call.\n-   **🖼️ PDF or Raw Image Input**: Accepts **`.pdf`, `.jpg`, `.jpeg`, `.png`, `.bmp`, `.webp`, `.tif`/`.tiff`**. Multi-frame TIFFs become multi-page output PDFs — no manual PDF-wrap step.\n-   **⚡ Fast Detection**: Surya runs in detection-only mode (no recognition) and batches across pages.\n-   **🔒 100% Local \u0026 Private**: No cloud APIs, no subscription fees. Run it entirely offline using [LM Studio](https://lmstudio.ai) or [Ollama](https://ollama.com).\n-   **🔍 Searchable Outputs**: Embeds an invisible text layer into a sandwich PDF. Glyph bboxes are horizontally scaled so selection in a PDF viewer covers the full width of each text region.\n-   **🖥️ Dual Interfaces**:\n    -   **Web UI**: Drag \u0026 drop, Dark Mode, real-time per-page progress.\n    -   **CLI**: Documented flags for power users and batch automation, Rich progress bars.\n-   **🧪 Tested**: 145-test suite covering DP invariants, embedding geometry, grounded JSON parsing, and end-to-end runs against the example PDFs.\n\n---\n\n## 🏗️ Architecture\n\nThe tool has two execution paths behind a single `OCRPipeline` seam (`src/pdf_ocr/pipeline.py`). The default **hybrid path** works with any OCR-capable VLM; the opt-in **grounded path** collapses the whole flow into one call for VLMs that emit text+bbox natively.\n\n```mermaid\ngraph TD\n    A[Input: PDF / JPEG / PNG / TIFF] --\u003e B[Rasterize to images]\n    B --\u003e|--grounded| Z[Grounded VLM: text+bbox in one call]\n    Z --\u003e EMB\n\n    B --\u003e|default| C[Surya DetectionPredictor\u003cbr/\u003ebatch, detection-only]\n    B --\u003e D[LLM full-page OCR\u003cbr/\u003eOlmOCR / GLM-OCR / etc.]\n    C --\u003e E[Layout boxes in reading order]\n    D --\u003e F[Plain text with line breaks]\n    E --\u003e G[Needleman-Wunsch DP aligner\u003cbr/\u003eline ↔ box monotonic match]\n    F --\u003e G\n    G --\u003e H{Boxes the DP\u003cbr/\u003eleft empty?}\n    H --\u003e|yes| R[Per-box crop re-OCR\u003cbr/\u003erefine stage]\n    H --\u003e|no| EMB[Sandwich PDF writer]\n    R --\u003e EMB\n    EMB --\u003e L[Searchable PDF output]\n```\n\n### How It Works\n\n1. **Input**: PDFs *or* raw images. Multi-frame TIFFs expand to one page per frame. Images skip the PDF round-trip and feed straight into the pipeline.\n\n2. **Batch Layout Detection** *(hybrid path)*: Surya's `DetectionPredictor` processes all pages in one call, ~10-21× faster than running full recognition.\n\n3. **LLM Text Extraction** *(hybrid path)*: A local vision model (OlmOCR by default via LM Studio) transcribes each page's full content with human-like understanding.\n\n4. **Needleman-Wunsch Alignment** *(hybrid path)*: The DP aligner binds each LLM line to its Surya box using character-count fit + reading-order monotonicity. Cheap `skip_box` ops (many detected boxes are rules/decorations), expensive `skip_line` ops — but unmatched lines are attached to the nearest matched box so no LLM text is lost.\n\n5. **Refine Fallback** *(hybrid path, optional)*: Any sizeable box the DP couldn't populate gets its image crop re-OCR'd individually. Catches tables/multi-column/figure captions without paying N× latency on clean prose. Disable with `--no-refine`.\n\n6. **Grounded Path** *(opt-in alternative)*: With `--grounded` pointed at a bbox-native VLM (Qwen2.5-VL, Qwen3-VL, MinerU, …), the model returns `{bbox, text}` tuples in a single call — Surya, DP, and refine are all skipped.\n\n7. **Sandwich PDF**: The page is rasterized as a background image and invisible text is overlaid with horizontal-scale matrices so glyph bboxes span the full width of each source box — selection in a PDF viewer correctly covers the whole region.\n\n---\n\n## 🚀 Getting Started\n\n### Prerequisites\n\n1.  **Python 3.10+**\n2.  **A local OpenAI-compatible LLM server**. Any of:\n    -   **[LM Studio](https://lmstudio.ai)** — recommended default. Load `allenai/olmocr-2-7b` (hybrid path) or `qwen/qwen3-vl-8b` / `qwen/qwen2.5-vl-7b` (grounded path). Start the local server (default port `1234`).\n    -   **[Ollama](https://ollama.com)** — pull `glm-ocr:latest` (requires `--max-image-dim 640`) or any vision model. Served at `http://localhost:11434/v1`.\n    -   **vLLM / SGLang / any OpenAI-compatible endpoint**.\n\n### Configuration\n\nCreate a `.env` file in the root directory to configure your Local LLM:\n\n```env\nLLM_API_BASE=http://localhost:1234/v1\nLLM_MODEL=allenai/olmocr-2-7b\n```\n\n### Installation\n\nThis project is managed with [`uv`](https://github.com/astral-sh/uv) for lightning-fast dependency management.\n\n1.  **Install `uv`** (if not installed):\n\n    ```bash\n    pip install uv\n    ```\n\n2.  **Clone the repository**:\n\n    ```bash\n    git clone https://github.com/ahnafnafee/pdf-ocr-llm.git\n    cd pdf-ocr-llm\n    ```\n\n3.  **Sync Dependencies**:\n    ```bash\n    uv sync\n    ```\n\n---\n\n## Usage\n\n### 1. 🌐 Web Interface (Recommended)\n\nThe easiest way to use the tool. Features a modern dashboard with Dark Mode and Text Preview.\n\n1.  **Start the Server**:\n    ```bash\n    uv run uvicorn server:app --reload --port 8000\n    ```\n2.  Open your browser to `http://localhost:8000`.\n3.  **Drag \u0026 Drop** your PDF.\n4.  Watch the magic happen! ✨\n    -   **Real-time Progress**: Track per-page OCR status.\n    -   **Preview**: Click \"View Text\" to inspect the raw AI extraction.\n    -   **Dark Mode**: Toggle the moon icon for a sleek dark theme.\n\n### 2. 💻 Command Line Interface (CLI)\n\nPerfect for developers or integrating into scripts.\n\nRun the OCR tool on any PDF:\n\n```bash\nuv run main.py input.pdf output_ocr.pdf\n```\n\n**Options**:\n\n| Option                    | Description                                                           |\n| ------------------------- | --------------------------------------------------------------------- |\n| `input`                   | Path to a PDF **or** image file (`.jpg`/`.jpeg`/`.png`/`.bmp`/`.webp`/`.tif`/`.tiff`). Required. Multi-frame TIFFs expand to multiple output pages. |\n| `output`                  | Path to output PDF (optional, defaults to `\u003cinput_stem\u003e_ocr.pdf`; always a PDF, even for image inputs). |\n| `-v`, `--verbose`         | Enable debug logging (alignment details, box counts)                  |\n| `-q`, `--quiet`           | Suppress all output except errors                                     |\n| `--dpi \u003cint\u003e`             | DPI for image rendering (default: 200)                                |\n| `--pages \u003crange\u003e`         | Page range to process, e.g., `1-3,5` (default: all)                   |\n| `--concurrency \u003cint\u003e`     | Parallel LLM requests (default: 1)                                    |\n| `--no-refine`             | Skip per-box crop re-OCR (faster, less robust on tables/multi-column) |\n| `--max-image-dim \u003cint\u003e`   | Longest-edge px cap for page images (default: 1024; see note below)   |\n| `--api-base \u003curl\u003e`        | Override LLM API base URL                                             |\n| `--model \u003cname\u003e`          | Override LLM model name                                               |\n\n**Examples**:\n\n```bash\n# Basic usage (auto-generates input_ocr.pdf, uses LM Studio + OlmOCR)\nuv run main.py scan.pdf\n\n# Specific pages with higher rendering DPI\nuv run main.py document.pdf output.pdf --pages 1-5 --dpi 300\n\n# Parallel LLM calls on a multi-page doc\nuv run main.py long.pdf --concurrency 3\n\n# Use Ollama + GLM-OCR instead of LM Studio\nuv run main.py scan.pdf \\\n    --api-base http://localhost:11434/v1 \\\n    --model glm-ocr:latest \\\n    --max-image-dim 640\n\n# Grounded path: bbox-native VLM (Qwen2.5-VL / Qwen3-VL) — skips Surya, DP, refine\nuv run main.py scan.pdf --grounded \\\n    --api-base http://localhost:1234/v1 \\\n    --model qwen/qwen3-vl-8b\n\n# Raw image input — no PDF required. Accepts JPEG/PNG/BMP/WebP, and\n# multi-page TIFFs (each frame becomes one page in the output PDF).\nuv run main.py scan.png scan_ocr.pdf\nuv run main.py archive.tiff archive_ocr.pdf\n```\n\n### Two pipeline paths\n\n| Path | Flag | Detection | Text | Alignment | Refine | When to use |\n|------|------|-----------|------|-----------|--------|-------------|\n| **Hybrid** (default) | _none_ | Surya | LLM full-page | DP | Per-box crop | Text-only VLMs (OlmOCR, GLM-OCR); max coverage |\n| **Grounded** | `--grounded` | — | Bbox-native VLM returns both | — | — | Qwen2.5/3-VL, MinerU, etc.; simpler, fewer moving parts |\n\nThe hybrid path is the safe default: it works with *any* OCR-capable VLM, including models that can only return plain text. The grounded path is faster and eliminates the DP-alignment class of bugs entirely, but requires a VLM that emits `{\"bbox_2d\": [...], \"content\": \"...\"}` JSON when asked (Qwen2.5-VL / Qwen3-VL confirmed working; others untested).\n\n\u003e **Note on `--max-image-dim`**: small local VLMs have tight context windows.\n\u003e OlmOCR-2-7B (Qwen2.5-VL base) is happy with the 1024 default.\n\u003e **GLM-OCR:1.1B via Ollama crashes its runner above ~640 px**, so drop the\n\u003e cap when you use it. If Ollama dies mid-run, restart it with `ollama serve`\n\u003e and lower `--max-image-dim`.\n\n_You'll see animated progress bars showing detection, LLM OCR, refinement, and embedding._\n\n---\n\n## 📁 Project Structure\n\n```\nlocal-llm-pdf-ocr/\n├── src/pdf_ocr/\n│   ├── pipeline.py            # OCRPipeline orchestration seam (hybrid + grounded)\n│   ├── core/\n│   │   ├── aligner.py         # HybridAligner: Surya detect + Needleman-Wunsch DP\n│   │   ├── ocr.py             # OCRProcessor: OpenAI-compat LLM client + crop OCR\n│   │   ├── pdf.py             # PDFHandler: PDF/image I/O + sandwich-PDF embedding\n│   │   └── grounded.py        # Grounded backends (PromptedGroundedOCR, ZAIHostedOCR) + parsers\n│   ├── evaluation.py          # Confidence comparator (IoU + text similarity)\n│   └── utils/\n│       ├── image.py           # Crop utility for the refine stage\n│       └── tqdm_patch.py      # Silences Surya's internal progress bars\n├── tests/                     # 145-test suite (fast tier + Surya-integration tier)\n│   └── fixtures/              # Ground-truth JSON for confidence evaluation\n├── scripts/\n│   ├── confidence_eval.py     # Score either path against ground-truth fixtures\n│   ├── debug_alignment.py     # Visualize alignment for a single PDF\n│   ├── visualize_bboxes.py    # Render Surya's detected boxes\n│   └── ...                    # Other debug tools\n├── static/                    # Web UI assets\n├── examples/                  # Sample PDFs (digital, hybrid, handwritten)\n├── main.py                    # CLI entry point\n└── server.py                  # FastAPI web server\n```\n\n---\n\n## 🛠️ Tech Stack\n\n-   **Backend**: FastAPI (Async Web Framework)\n-   **Frontend**: Vanilla JS + CSS Variables\n-   **PDF Processing**: PyMuPDF (Fitz)\n-   **Layout Detection**: Surya OCR (Detection-only mode)\n-   **AI Integration**: OpenAI Client (compatible with Local LLM servers)\n-   **CLI UI**: Rich (Terminal formatting)\n\n---\n\n## ⚡ Performance\n\nDetection is no longer the bottleneck — full-page LLM OCR is. Rough per-page timings on a warm run (Surya loaded, LM Studio serving OlmOCR-2-7B on a single GPU):\n\n| Phase | Time / page | Notes |\n|---|---|---|\n| Rasterize PDF → image | ~0.3 s | Linear in pages |\n| Surya batch detection | ~0.5 s | Amortized across all pages in one call |\n| **LLM full-page OCR** | **~2–4 s** | **Dominant cost.** Set `--concurrency 3` to parallelize on multi-page docs |\n| Per-box refine (if needed) | ~0.5–1 s × empty boxes | Typically 0–2 s; `--no-refine` to skip |\n| PDF assembly | ~0.2 s | Linear in pages |\n| Cold-start Surya load | +5–10 s (once) | Paid even on `--grounded` runs |\n\nOn our three example PDFs (hybrid path, `allenai/olmocr-2-7b`, warm): digital ≈ 14 s, hybrid ≈ 5 s, handwritten ≈ 4 s.\n\n---\n\n## 🧪 Testing\n\n```bash\nuv run pytest                      # full suite (~75s, loads Surya once)\nuv run pytest -m \"not slow\"        # fast tier (~15s, no model loads)\nuv run pytest tests/test_aligner.py -v\n```\n\nConfidence evaluation (needs a live LLM endpoint):\n\n```bash\nuv run scripts/confidence_eval.py --path both \\\n    --grounded-model qwen/qwen3-vl-8b \\\n    --hybrid-model allenai/olmocr-2-7b\n```\n\nScores either path against the fixtures in `tests/fixtures/ground_truth_*.json` — block recall at IoU≥0.3, average IoU of matched pairs, average text similarity.\n\n---\n\n## 🤝 Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n**License**: MIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fahnafnafee%2Flocal-llm-pdf-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fahnafnafee%2Flocal-llm-pdf-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fahnafnafee%2Flocal-llm-pdf-ocr/lists"}