https://github.com/aptakhin/unifex

A Python library for document text extraction with local and cloud OCR solutions
https://github.com/aptakhin/unifex
azure google ocr paddleocr python tesseract-ocr
Last synced: 3 months ago
JSON representation
A Python library for document text extraction with local and cloud OCR solutions
Host: GitHub
URL: https://github.com/aptakhin/unifex
Owner: aptakhin
License: bsd-3-clause
Created: 2026-01-13T15:07:50.000Z (6 months ago)
Default Branch: main
Last Pushed: 2026-02-11T18:14:24.000Z (5 months ago)
Last Synced: 2026-03-10T23:15:11.031Z (4 months ago)
Topics: azure, google, ocr, paddleocr, python, tesseract-ocr
Language: Python
Homepage: https://aptakhin.name/unifex/
Size: 1.15 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # unifex

A Python library for document text extraction with local and cloud OCR solutions.

**Focus:** Built for tasks like fraud detection where precision matters. We needed a universal tool for both PDF and image processing with best-in-class OCR support through local engines (EasyOCR, Tesseract, PaddleOCR) and cloud services (Azure Document Intelligence, Google Document AI).

📖 **[Documentation](https://aptakhin.name/unifex/)**

## Features

- **Multiple OCR Backends**: Local (EasyOCR, Tesseract, PaddleOCR) and cloud (Azure Document Intelligence, Google Document AI) OCR support

- **PDF Text Extraction**: Native PDF text extraction using pypdfium2

- **LLM Extraction**: Extract structured data using GPT-4o, Claude, Gemini, or OpenAI-compatible APIs

- **Unified Coordinates**: Seamless conversion between POINTS, PIXELS, INCHES, and NORMALIZED coordinate systems

- **Table Extraction**: PDF (tabula), PaddleOCR (PPStructure), and cloud OCR (Azure DI, Google DocAI)

- **Parallel Extraction**: Process multiple pages concurrently with thread or process executors

- **Async Support**: Native async/await API for integration with async applications

- **Unified Extractors**: Each OCR extractor auto-detects file type (PDF vs image) and handles conversion internally

- **Schema Adapters**: Clean separation of external API schemas from internal models

- **Pydantic Models**: Type-safe document representation with pydantic v1/v2 compatibility

## Alternatives

For broader document processing, check out [Docling](https://docling-project.github.io) and [Kreuzberg](https://kreuzberg.dev/).

## Installation

```bash

pip install unifex

```

Or with optional dependencies:

```bash

pip install unifex[pdf]       # PDF text extraction

pip install unifex[easyocr]   # EasyOCR support

pip install unifex[tesseract] # Tesseract OCR support

pip install unifex[azure]     # Azure Document Intelligence

pip install unifex[google]    # Google Document AI

pip install unifex[llm-openai]     # OpenAI/GPT-4 extraction

pip install unifex[llm-anthropic]  # Anthropic/Claude extraction

pip install unifex[all]       # All dependencies

```

## Quick Start

### Factory Interface (Recommended)

The simplest way to use unifex is via the factory interface. Both string paths and `Path` objects are accepted:

```python

from unifex import create_extractor, ExtractorType

# PDF extraction (native text) - string path

with create_extractor("document.pdf", ExtractorType.PDF) as extractor:

    result = extractor.extract()

    doc = result.document  # Access the Document

# EasyOCR for images

with create_extractor("image.png", ExtractorType.EASYOCR, languages=["en"]) as extractor:

    result = extractor.extract()

# EasyOCR for PDFs (auto-converts to images internally)

with create_extractor("scanned.pdf", ExtractorType.EASYOCR, dpi=200) as extractor:

    result = extractor.extract()

# Azure Document Intelligence (credentials from env vars)

with create_extractor("document.pdf", ExtractorType.AZURE_DI) as extractor:

    result = extractor.extract()

# Path objects also work

from pathlib import Path

with create_extractor(Path("document.pdf"), ExtractorType.PDF) as extractor:

    result = extractor.extract()

```

### Example Output

The `extract()` method returns an `ExtractionResult` containing the `Document` and per-page results:

```python

from unifex import create_extractor, ExtractorType

with create_extractor("document.pdf", ExtractorType.PDF) as extractor:

    result = extractor.extract()

# Check extraction status

print(f"Success: {result.success}")  # True if all pages extracted

# Access extracted document

doc = result.document

print(f"Pages: {len(doc.pages)}")  # Pages: 2

for page in doc.pages:

    print(f"Page {page.page + 1} ({page.width:.0f}x{page.height:.0f}):")

    for text in page.texts:

        print(f"  - \"{text.text}\"")

        print(f"    bbox: ({text.bbox.x0:.1f}, {text.bbox.y0:.1f}, {text.bbox.x1:.1f}, {text.bbox.y1:.1f})")

# Handle errors if any

if not result.success:

    for page_num, error in result.errors:

        print(f"Page {page_num} failed: {error}")

```

Output:

```

Pages: 2

Page 1 (595x842):

  - "First page. First text"

    bbox: (48.3, 57.8, 205.4, 74.6)

  - "First page. Second text"

    bbox: (48.0, 81.4, 231.2, 98.6)

  - "First page. Fourth text"

    bbox: (47.8, 120.5, 221.9, 137.4)

Page 2 (595x842):

  - "Second page. Third text"

    bbox: (47.4, 81.1, 236.9, 98.3)

```

For more detailed examples, see the [documentation](https://aptakhin.name/unifex/).

### PDF Text Extraction

```python

from unifex import PdfExtractor

# String paths work directly

with PdfExtractor("document.pdf") as extractor:

    result = extractor.extract()

    for page in result.document.pages:

        for text in page.texts:

            print(text.text)

```

### Language Codes

All OCR extractors use **2-letter ISO 639-1 language codes** (e.g., `"en"`, `"fr"`, `"de"`, `"it"`).

Extractors that require different formats (like Tesseract) convert internally.

### Parallel Extraction

Extract multiple pages concurrently for faster processing:

```python

from unifex import create_extractor, ExtractorType, ExecutorType

# Thread-based parallelism (recommended for most cases)

with create_extractor("large_document.pdf", ExtractorType.EASYOCR) as extractor:

    result = extractor.extract(max_workers=4)  # 4 parallel workers

# Process-based parallelism (for CPU-bound pure Python workloads)

with create_extractor("large_document.pdf", ExtractorType.EASYOCR) as extractor:

    result = extractor.extract(max_workers=4, executor=ExecutorType.PROCESS)

# Extract specific pages in parallel

with create_extractor("document.pdf", ExtractorType.PDF) as extractor:

    result = extractor.extract(pages=[0, 2, 5, 8], max_workers=4)

```

**Executor Types:**

| Executor | Best For | Notes |

|----------|----------|-------|

| `THREAD` (default) | Most OCR use cases | Shared model cache, low overhead, C libraries release GIL |

| `PROCESS` | CPU-bound pure Python | Models duplicated per worker, higher memory usage |

### Async Extraction

For async applications, use the async API:

```python

import asyncio

from unifex import create_extractor, ExtractorType

async def extract_document():

    with create_extractor("document.pdf", ExtractorType.EASYOCR) as extractor:

        result = await extractor.extract_async(max_workers=4)

        return result.document

doc = asyncio.run(extract_document())

```

### OCR Extraction

Local OCR engines (EasyOCR, Tesseract, PaddleOCR) and cloud services (Azure Document Intelligence, Google Document AI). All extractors auto-detect file type (PDF vs image) and handle conversion internally.

```python

from unifex import (

    EasyOcrExtractor, TesseractOcrExtractor, PaddleOcrExtractor,

    AzureDocumentIntelligenceExtractor, GoogleDocumentAIExtractor,

)

# Local OCR (works for both images and PDFs)

with EasyOcrExtractor("scanned.pdf", languages=["en"], dpi=200) as extractor:

    result = extractor.extract()

# Tesseract (requires system install: brew install tesseract)

with TesseractOcrExtractor("image.png", languages=["en"]) as extractor:

    result = extractor.extract()

# PaddleOCR (excellent for Chinese)

with PaddleOcrExtractor("chinese_doc.png", lang="ch") as extractor:

    result = extractor.extract()

# Cloud: Azure Document Intelligence

with AzureDocumentIntelligenceExtractor(

    "document.pdf",

    endpoint="https://your-resource.cognitiveservices.azure.com",

    key="your-api-key",

) as extractor:

    result = extractor.extract()

# Cloud: Google Document AI

with GoogleDocumentAIExtractor(

    "document.pdf",

    processor_name="projects/your-project/locations/us/processors/id",

    credentials_path="/path/to/credentials.json",

) as extractor:

    result = extractor.extract()

```

### LLM Extraction

Extract structured data using vision-capable LLMs (OpenAI, Anthropic, Google, Azure OpenAI). Supports custom prompts, Pydantic schemas, parallel extraction, async API, and OpenAI-compatible endpoints (vLLM, Ollama).

```python

from pydantic import BaseModel

from unifex.llm import extract_structured, extract_structured_async

class Invoice(BaseModel):

    invoice_number: str

    date: str

    total: float

# Basic extraction with Pydantic schema

result = extract_structured("invoice.pdf", model="openai/gpt-4o", schema=Invoice)

invoice: Invoice = result.data

# With custom prompt and parallel workers

result = extract_structured(

    "large_doc.pdf",

    model="anthropic/claude-sonnet-4-20250514",

    prompt="Extract invoice details",

    max_workers=4,  # Process pages in parallel

)

# OpenAI-compatible APIs (vLLM, Ollama) with custom headers

result = extract_structured(

    "document.pdf",

    model="openai/llava",

    base_url="http://localhost:11434/v1",

    headers={"X-Custom-Auth": "token"},

)

# Async API

result = await extract_structured_async("doc.pdf", model="openai/gpt-4o", max_workers=4)

```

## CLI Usage

```bash

# OCR extractors: pdf, easyocr, tesseract, paddle, azure-di, google-docai

uv run python -m unifex.cli document.pdf --extractor easyocr --lang en

# Parallel extraction with process executor

uv run python -m unifex.cli large_doc.pdf --extractor easyocr --max-workers 4 --executor process

# Cloud OCR (credentials via CLI or env vars)

uv run python -m unifex.cli document.pdf --extractor azure-di \

    --azure-endpoint https://your-resource.cognitiveservices.azure.com --azure-key your-key

# LLM extraction with parallel workers and custom endpoint

uv run python -m unifex.cli document.pdf --llm openai/gpt-4o --max-workers 4 \

    --llm-base-url https://your-proxy.com/v1 --llm-header "X-Auth=token"

# JSON output, specific pages

uv run python -m unifex.cli document.pdf --extractor pdf --pages 0,1,2 --json

```

## Environment Variables

Cloud extractors and LLM providers support configuration via environment variables:

**OCR Extractors:**

| Variable | Description |

|----------|-------------|

| `UNIFEX_AZURE_DI_ENDPOINT` | Azure Document Intelligence endpoint URL |

| `UNIFEX_AZURE_DI_KEY` | Azure Document Intelligence API key |

| `UNIFEX_AZURE_DI_MODEL` | Azure model ID (default: `prebuilt-read`) |

| `UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME` | Google Document AI processor name |

| `UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH` | Path to Google service account JSON |

**LLM Providers:**

| Variable | Description |

|----------|-------------|

| `OPENAI_API_KEY` | OpenAI API key |

| `ANTHROPIC_API_KEY` | Anthropic API key |

| `GOOGLE_API_KEY` | Google AI API key |

| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key |

| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL |

| `AZURE_OPENAI_API_VERSION` | Azure OpenAI API version (default: `2024-02-15-preview`) |

## Development

### Setup

```bash

# Install dependencies

uv sync

# Install pre-commit hooks

uv run pre-commit install

```

### Running Tests

```bash

# Run all tests

uv run pytest

# Run fast tests only (unit tests, <0.5s per test)

uv run pytest tests/base tests/ocr

# Run integration tests only (slow, load ML models)

uv run pytest tests/integration

# Run with coverage

uv run pytest --cov=unifex --cov-report=term-missing

```

### Test Structure

```

tests/

├── base/           # Fast unit tests (<0.5s each) - run in pre-commit

├── ocr/            # OCR adapter unit tests (mocked) - run in pre-commit

├── llm/            # LLM unit tests (mocked) - run in pre-commit

└── integration/    # Slow tests - NOT in pre-commit

    ├── ocr/        # OCR integration tests (load real ML models)

    └── llm/        # LLM integration tests (call real APIs)

```

**Pre-commit runs:** `tests/base`, `tests/ocr`, and `tests/llm` with 0.5s timeout per test.

**CI runs:** All tests including integration tests.

### Integration Tests

Integration tests load real ML models and call real services. They are in `tests/integration/`.

**Local extractors** (no credentials required):

- `PdfExtractor` - Tests PDF text extraction

- `EasyOcrExtractor` - Tests image and PDF OCR with EasyOCR

- `TesseractOcrExtractor` - Tests image and PDF OCR with Tesseract (requires Tesseract installed)

- `PaddleOcrExtractor` - Tests image and PDF OCR with PaddleOCR

**Cloud extractors** (require credentials):

- `AzureDocumentIntelligenceExtractor` - Tests Azure Document Intelligence

- `GoogleDocumentAIExtractor` - Tests Google Document AI

#### Azure Credentials Setup

1. Copy the example environment file:

   ```bash

   cp .env.example .env

   ```

2. Edit `.env` with your Azure Document Intelligence credentials:

   ```

   UNIFEX_AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com

   UNIFEX_AZURE_DI_KEY=your-api-key

   ```

3. Load environment variables and run tests:

   ```bash

   set -a; source .env; set +a;

   uv run pytest tests/test_integration.py -v

   ```

Azure integration tests are automatically skipped if credentials are not configured.

#### Google Document AI Credentials Setup

1. Create a Google Cloud project and enable the Document AI API

2. Create a Document AI processor in the Google Cloud Console

3. Create a service account with Document AI permissions

4. Download the service account JSON key file

5. Edit `.env` with your Google Document AI credentials:

   ```

   UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME=projects/your-project/locations/us/processors/your-processor-id

   UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH=/path/to/your/service-account.json

   ```

Google Document AI integration tests are automatically skipped if credentials are not configured.

### Documentation

Build and serve the documentation locally:

```bash

# Serve docs with live reload

uv run mkdocs serve

# Build static site

uv run mkdocs build

```

Open http://localhost:8000 to view the documentation.

### Pre-commit Checks

The pre-commit hook runs automatically on `git commit`. To run manually:

```bash

uv run pre-commit run --all-files

```

This runs:

- `ruff format` - Code formatting

- `ruff check --fix` - Linting with auto-fix

- `ty check` - Type checking

- `pytest` - Test suite

## License

BSD 3-Clause License. See [LICENSE](LICENSE) for details.

## Future plans

- Detecting language helper

- Performance measurement
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aptakhin/unifex

Awesome Lists containing this project

README