{"id":47632183,"url":"https://github.com/aptakhin/unifex","last_synced_at":"2026-04-01T23:48:38.542Z","repository":{"id":332376602,"uuid":"1133597089","full_name":"aptakhin/unifex","owner":"aptakhin","description":" A Python library for document text extraction with local and cloud OCR solutions","archived":false,"fork":false,"pushed_at":"2026-02-11T18:14:24.000Z","size":1206,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-10T23:15:11.031Z","etag":null,"topics":["azure","google","ocr","paddleocr","python","tesseract-ocr"],"latest_commit_sha":null,"homepage":"https://aptakhin.name/unifex/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"bsd-3-clause","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aptakhin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-13T15:07:50.000Z","updated_at":"2026-02-02T15:37:48.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/aptakhin/unifex","commit_stats":null,"previous_names":["aptakhin/xtra","aptakhin/unifex"],"tags_count":2,"template":false,"template_full_name":null,"purl":"pkg:github/aptakhin/unifex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aptakhin%2Funifex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aptakhin%2Funifex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aptakhin%2Funifex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aptakhin%2Funifex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aptakhin","download_url":"https://codeload.github.com/aptakhin/unifex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aptakhin%2Funifex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31293122,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T21:15:39.731Z","status":"ssl_error","status_checked_at":"2026-04-01T21:15:34.046Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["azure","google","ocr","paddleocr","python","tesseract-ocr"],"created_at":"2026-04-01T23:48:38.073Z","updated_at":"2026-04-01T23:48:38.530Z","avatar_url":"https://github.com/aptakhin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# unifex\n\nA Python library for document text extraction with local and cloud OCR solutions.\n\n**Focus:** Built for tasks like fraud detection where precision matters. We needed a universal tool for both PDF and image processing with best-in-class OCR support through local engines (EasyOCR, Tesseract, PaddleOCR) and cloud services (Azure Document Intelligence, Google Document AI).\n\n📖 **[Documentation](https://aptakhin.name/unifex/)**\n\n## Features\n\n- **Multiple OCR Backends**: Local (EasyOCR, Tesseract, PaddleOCR) and cloud (Azure Document Intelligence, Google Document AI) OCR support\n- **PDF Text Extraction**: Native PDF text extraction using pypdfium2\n- **LLM Extraction**: Extract structured data using GPT-4o, Claude, Gemini, or OpenAI-compatible APIs\n- **Unified Coordinates**: Seamless conversion between POINTS, PIXELS, INCHES, and NORMALIZED coordinate systems\n- **Table Extraction**: PDF (tabula), PaddleOCR (PPStructure), and cloud OCR (Azure DI, Google DocAI)\n- **Parallel Extraction**: Process multiple pages concurrently with thread or process executors\n- **Async Support**: Native async/await API for integration with async applications\n- **Unified Extractors**: Each OCR extractor auto-detects file type (PDF vs image) and handles conversion internally\n- **Schema Adapters**: Clean separation of external API schemas from internal models\n- **Pydantic Models**: Type-safe document representation with pydantic v1/v2 compatibility\n\n## Alternatives\n\nFor broader document processing, check out [Docling](https://docling-project.github.io) and [Kreuzberg](https://kreuzberg.dev/).\n\n## Installation\n\n```bash\npip install unifex\n```\n\nOr with optional dependencies:\n\n```bash\npip install unifex[pdf]       # PDF text extraction\npip install unifex[easyocr]   # EasyOCR support\npip install unifex[tesseract] # Tesseract OCR support\npip install unifex[azure]     # Azure Document Intelligence\npip install unifex[google]    # Google Document AI\npip install unifex[llm-openai]     # OpenAI/GPT-4 extraction\npip install unifex[llm-anthropic]  # Anthropic/Claude extraction\npip install unifex[all]       # All dependencies\n```\n\n## Quick Start\n\n### Factory Interface (Recommended)\n\nThe simplest way to use unifex is via the factory interface. Both string paths and `Path` objects are accepted:\n\n```python\nfrom unifex import create_extractor, ExtractorType\n\n# PDF extraction (native text) - string path\nwith create_extractor(\"document.pdf\", ExtractorType.PDF) as extractor:\n    result = extractor.extract()\n    doc = result.document  # Access the Document\n\n# EasyOCR for images\nwith create_extractor(\"image.png\", ExtractorType.EASYOCR, languages=[\"en\"]) as extractor:\n    result = extractor.extract()\n\n# EasyOCR for PDFs (auto-converts to images internally)\nwith create_extractor(\"scanned.pdf\", ExtractorType.EASYOCR, dpi=200) as extractor:\n    result = extractor.extract()\n\n# Azure Document Intelligence (credentials from env vars)\nwith create_extractor(\"document.pdf\", ExtractorType.AZURE_DI) as extractor:\n    result = extractor.extract()\n\n# Path objects also work\nfrom pathlib import Path\nwith create_extractor(Path(\"document.pdf\"), ExtractorType.PDF) as extractor:\n    result = extractor.extract()\n```\n\n### Example Output\n\nThe `extract()` method returns an `ExtractionResult` containing the `Document` and per-page results:\n\n```python\nfrom unifex import create_extractor, ExtractorType\n\nwith create_extractor(\"document.pdf\", ExtractorType.PDF) as extractor:\n    result = extractor.extract()\n\n# Check extraction status\nprint(f\"Success: {result.success}\")  # True if all pages extracted\n\n# Access extracted document\ndoc = result.document\nprint(f\"Pages: {len(doc.pages)}\")  # Pages: 2\n\nfor page in doc.pages:\n    print(f\"Page {page.page + 1} ({page.width:.0f}x{page.height:.0f}):\")\n    for text in page.texts:\n        print(f\"  - \\\"{text.text}\\\"\")\n        print(f\"    bbox: ({text.bbox.x0:.1f}, {text.bbox.y0:.1f}, {text.bbox.x1:.1f}, {text.bbox.y1:.1f})\")\n\n# Handle errors if any\nif not result.success:\n    for page_num, error in result.errors:\n        print(f\"Page {page_num} failed: {error}\")\n```\n\nOutput:\n```\nPages: 2\nPage 1 (595x842):\n  - \"First page. First text\"\n    bbox: (48.3, 57.8, 205.4, 74.6)\n  - \"First page. Second text\"\n    bbox: (48.0, 81.4, 231.2, 98.6)\n  - \"First page. Fourth text\"\n    bbox: (47.8, 120.5, 221.9, 137.4)\nPage 2 (595x842):\n  - \"Second page. Third text\"\n    bbox: (47.4, 81.1, 236.9, 98.3)\n```\n\nFor more detailed examples, see the [documentation](https://aptakhin.name/unifex/).\n\n### PDF Text Extraction\n\n```python\nfrom unifex import PdfExtractor\n\n# String paths work directly\nwith PdfExtractor(\"document.pdf\") as extractor:\n    result = extractor.extract()\n    for page in result.document.pages:\n        for text in page.texts:\n            print(text.text)\n```\n\n### Language Codes\n\nAll OCR extractors use **2-letter ISO 639-1 language codes** (e.g., `\"en\"`, `\"fr\"`, `\"de\"`, `\"it\"`).\nExtractors that require different formats (like Tesseract) convert internally.\n\n### Parallel Extraction\n\nExtract multiple pages concurrently for faster processing:\n\n```python\nfrom unifex import create_extractor, ExtractorType, ExecutorType\n\n# Thread-based parallelism (recommended for most cases)\nwith create_extractor(\"large_document.pdf\", ExtractorType.EASYOCR) as extractor:\n    result = extractor.extract(max_workers=4)  # 4 parallel workers\n\n# Process-based parallelism (for CPU-bound pure Python workloads)\nwith create_extractor(\"large_document.pdf\", ExtractorType.EASYOCR) as extractor:\n    result = extractor.extract(max_workers=4, executor=ExecutorType.PROCESS)\n\n# Extract specific pages in parallel\nwith create_extractor(\"document.pdf\", ExtractorType.PDF) as extractor:\n    result = extractor.extract(pages=[0, 2, 5, 8], max_workers=4)\n```\n\n**Executor Types:**\n\n| Executor | Best For | Notes |\n|----------|----------|-------|\n| `THREAD` (default) | Most OCR use cases | Shared model cache, low overhead, C libraries release GIL |\n| `PROCESS` | CPU-bound pure Python | Models duplicated per worker, higher memory usage |\n\n### Async Extraction\n\nFor async applications, use the async API:\n\n```python\nimport asyncio\nfrom unifex import create_extractor, ExtractorType\n\nasync def extract_document():\n    with create_extractor(\"document.pdf\", ExtractorType.EASYOCR) as extractor:\n        result = await extractor.extract_async(max_workers=4)\n        return result.document\n\ndoc = asyncio.run(extract_document())\n```\n\n### OCR Extraction\n\nLocal OCR engines (EasyOCR, Tesseract, PaddleOCR) and cloud services (Azure Document Intelligence, Google Document AI). All extractors auto-detect file type (PDF vs image) and handle conversion internally.\n\n```python\nfrom unifex import (\n    EasyOcrExtractor, TesseractOcrExtractor, PaddleOcrExtractor,\n    AzureDocumentIntelligenceExtractor, GoogleDocumentAIExtractor,\n)\n\n# Local OCR (works for both images and PDFs)\nwith EasyOcrExtractor(\"scanned.pdf\", languages=[\"en\"], dpi=200) as extractor:\n    result = extractor.extract()\n\n# Tesseract (requires system install: brew install tesseract)\nwith TesseractOcrExtractor(\"image.png\", languages=[\"en\"]) as extractor:\n    result = extractor.extract()\n\n# PaddleOCR (excellent for Chinese)\nwith PaddleOcrExtractor(\"chinese_doc.png\", lang=\"ch\") as extractor:\n    result = extractor.extract()\n\n# Cloud: Azure Document Intelligence\nwith AzureDocumentIntelligenceExtractor(\n    \"document.pdf\",\n    endpoint=\"https://your-resource.cognitiveservices.azure.com\",\n    key=\"your-api-key\",\n) as extractor:\n    result = extractor.extract()\n\n# Cloud: Google Document AI\nwith GoogleDocumentAIExtractor(\n    \"document.pdf\",\n    processor_name=\"projects/your-project/locations/us/processors/id\",\n    credentials_path=\"/path/to/credentials.json\",\n) as extractor:\n    result = extractor.extract()\n```\n\n### LLM Extraction\n\nExtract structured data using vision-capable LLMs (OpenAI, Anthropic, Google, Azure OpenAI). Supports custom prompts, Pydantic schemas, parallel extraction, async API, and OpenAI-compatible endpoints (vLLM, Ollama).\n\n```python\nfrom pydantic import BaseModel\nfrom unifex.llm import extract_structured, extract_structured_async\n\nclass Invoice(BaseModel):\n    invoice_number: str\n    date: str\n    total: float\n\n# Basic extraction with Pydantic schema\nresult = extract_structured(\"invoice.pdf\", model=\"openai/gpt-4o\", schema=Invoice)\ninvoice: Invoice = result.data\n\n# With custom prompt and parallel workers\nresult = extract_structured(\n    \"large_doc.pdf\",\n    model=\"anthropic/claude-sonnet-4-20250514\",\n    prompt=\"Extract invoice details\",\n    max_workers=4,  # Process pages in parallel\n)\n\n# OpenAI-compatible APIs (vLLM, Ollama) with custom headers\nresult = extract_structured(\n    \"document.pdf\",\n    model=\"openai/llava\",\n    base_url=\"http://localhost:11434/v1\",\n    headers={\"X-Custom-Auth\": \"token\"},\n)\n\n# Async API\nresult = await extract_structured_async(\"doc.pdf\", model=\"openai/gpt-4o\", max_workers=4)\n```\n\n## CLI Usage\n\n```bash\n# OCR extractors: pdf, easyocr, tesseract, paddle, azure-di, google-docai\nuv run python -m unifex.cli document.pdf --extractor easyocr --lang en\n\n# Parallel extraction with process executor\nuv run python -m unifex.cli large_doc.pdf --extractor easyocr --max-workers 4 --executor process\n\n# Cloud OCR (credentials via CLI or env vars)\nuv run python -m unifex.cli document.pdf --extractor azure-di \\\n    --azure-endpoint https://your-resource.cognitiveservices.azure.com --azure-key your-key\n\n# LLM extraction with parallel workers and custom endpoint\nuv run python -m unifex.cli document.pdf --llm openai/gpt-4o --max-workers 4 \\\n    --llm-base-url https://your-proxy.com/v1 --llm-header \"X-Auth=token\"\n\n# JSON output, specific pages\nuv run python -m unifex.cli document.pdf --extractor pdf --pages 0,1,2 --json\n```\n\n## Environment Variables\n\nCloud extractors and LLM providers support configuration via environment variables:\n\n**OCR Extractors:**\n\n| Variable | Description |\n|----------|-------------|\n| `UNIFEX_AZURE_DI_ENDPOINT` | Azure Document Intelligence endpoint URL |\n| `UNIFEX_AZURE_DI_KEY` | Azure Document Intelligence API key |\n| `UNIFEX_AZURE_DI_MODEL` | Azure model ID (default: `prebuilt-read`) |\n| `UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME` | Google Document AI processor name |\n| `UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH` | Path to Google service account JSON |\n\n**LLM Providers:**\n\n| Variable | Description |\n|----------|-------------|\n| `OPENAI_API_KEY` | OpenAI API key |\n| `ANTHROPIC_API_KEY` | Anthropic API key |\n| `GOOGLE_API_KEY` | Google AI API key |\n| `AZURE_OPENAI_API_KEY` | Azure OpenAI API key |\n| `AZURE_OPENAI_ENDPOINT` | Azure OpenAI endpoint URL |\n| `AZURE_OPENAI_API_VERSION` | Azure OpenAI API version (default: `2024-02-15-preview`) |\n\n## Development\n\n### Setup\n\n```bash\n# Install dependencies\nuv sync\n\n# Install pre-commit hooks\nuv run pre-commit install\n```\n\n### Running Tests\n\n```bash\n# Run all tests\nuv run pytest\n\n# Run fast tests only (unit tests, \u003c0.5s per test)\nuv run pytest tests/base tests/ocr\n\n# Run integration tests only (slow, load ML models)\nuv run pytest tests/integration\n\n# Run with coverage\nuv run pytest --cov=unifex --cov-report=term-missing\n```\n\n### Test Structure\n\n```\ntests/\n├── base/           # Fast unit tests (\u003c0.5s each) - run in pre-commit\n├── ocr/            # OCR adapter unit tests (mocked) - run in pre-commit\n├── llm/            # LLM unit tests (mocked) - run in pre-commit\n└── integration/    # Slow tests - NOT in pre-commit\n    ├── ocr/        # OCR integration tests (load real ML models)\n    └── llm/        # LLM integration tests (call real APIs)\n```\n\n**Pre-commit runs:** `tests/base`, `tests/ocr`, and `tests/llm` with 0.5s timeout per test.\n\n**CI runs:** All tests including integration tests.\n\n### Integration Tests\n\nIntegration tests load real ML models and call real services. They are in `tests/integration/`.\n\n**Local extractors** (no credentials required):\n- `PdfExtractor` - Tests PDF text extraction\n- `EasyOcrExtractor` - Tests image and PDF OCR with EasyOCR\n- `TesseractOcrExtractor` - Tests image and PDF OCR with Tesseract (requires Tesseract installed)\n- `PaddleOcrExtractor` - Tests image and PDF OCR with PaddleOCR\n\n**Cloud extractors** (require credentials):\n- `AzureDocumentIntelligenceExtractor` - Tests Azure Document Intelligence\n- `GoogleDocumentAIExtractor` - Tests Google Document AI\n\n#### Azure Credentials Setup\n\n1. Copy the example environment file:\n   ```bash\n   cp .env.example .env\n   ```\n\n2. Edit `.env` with your Azure Document Intelligence credentials:\n   ```\n   UNIFEX_AZURE_DI_ENDPOINT=https://your-resource.cognitiveservices.azure.com\n   UNIFEX_AZURE_DI_KEY=your-api-key\n   ```\n\n3. Load environment variables and run tests:\n   ```bash\n   set -a; source .env; set +a;\n   uv run pytest tests/test_integration.py -v\n   ```\n\nAzure integration tests are automatically skipped if credentials are not configured.\n\n#### Google Document AI Credentials Setup\n\n1. Create a Google Cloud project and enable the Document AI API\n2. Create a Document AI processor in the Google Cloud Console\n3. Create a service account with Document AI permissions\n4. Download the service account JSON key file\n\n5. Edit `.env` with your Google Document AI credentials:\n   ```\n   UNIFEX_GOOGLE_DOCAI_PROCESSOR_NAME=projects/your-project/locations/us/processors/your-processor-id\n   UNIFEX_GOOGLE_DOCAI_CREDENTIALS_PATH=/path/to/your/service-account.json\n   ```\n\nGoogle Document AI integration tests are automatically skipped if credentials are not configured.\n\n### Documentation\n\nBuild and serve the documentation locally:\n\n```bash\n# Serve docs with live reload\nuv run mkdocs serve\n\n# Build static site\nuv run mkdocs build\n```\n\nOpen http://localhost:8000 to view the documentation.\n\n### Pre-commit Checks\n\nThe pre-commit hook runs automatically on `git commit`. To run manually:\n\n```bash\nuv run pre-commit run --all-files\n```\n\nThis runs:\n- `ruff format` - Code formatting\n- `ruff check --fix` - Linting with auto-fix\n- `ty check` - Type checking\n- `pytest` - Test suite\n\n## License\n\nBSD 3-Clause License. See [LICENSE](LICENSE) for details.\n\n## Future plans\n\n- Detecting language helper\n- Performance measurement\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faptakhin%2Funifex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faptakhin%2Funifex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faptakhin%2Funifex/lists"}