https://github.com/frederico23/docling_ocr

A powerful Python package for extracting text from images and documents using the SmolDocling-256M-preview advanced LLM-based models.
https://github.com/frederico23/docling_ocr

llm ocr ocr-python ocr-text-reader

Last synced: 6 days ago
JSON representation

A powerful Python package for extracting text from images and documents using the SmolDocling-256M-preview advanced LLM-based models.

Host: GitHub
URL: https://github.com/frederico23/docling_ocr
Owner: FREDERICO23
License: mit
Created: 2025-03-18T07:18:42.000Z (10 months ago)
Default Branch: main
Last Pushed: 2025-03-21T04:10:56.000Z (10 months ago)
Last Synced: 2025-09-23T13:52:45.957Z (4 months ago)
Topics: llm, ocr, ocr-python, ocr-text-reader
Language: Python
Homepage:
Size: 26.4 KB
Stars: 11
Watchers: 1
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # docling_ocr

[![PyPI version](https://badge.fury.io/py/docling_ocr.svg)](https://badge.fury.io/py/docling_ocr)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)

A powerful Python package for extracting text from images and documents using advanced LLM-based models.

## Overview

`docling_ocr` leverages state-of-the-art language models specifically designed for document understanding tasks. Unlike traditional OCR engines that rely solely on character recognition, `docling_ocr` uses language models that understand document context, layouts, and can handle various document formats with high accuracy.

Built on top of models like SmolDocling, this package provides a simple, intuitive interface for document text extraction tasks.

## Features

- **LLM-powered extraction**: Uses advanced language models trained specifically for document understanding

- **Context-aware recognition**: Understands document layouts and context for improved accuracy

- **Multi-format support**: Works with scanned documents, forms, receipts, and other text-heavy images

- **Simple API**: Easy-to-use interface with both file and image object inputs

- **Batch processing**: Process entire directories of documents efficiently

- **Flexible output options**: Return text or save directly to files

- **Extensible architecture**: Abstract base class makes it easy to add new models

## Installation

```bash

pip install docling_ocr

```

### Requirements

- Python 3.7+

- PyTorch 1.10.0+

- Transformers 4.15.0+

- Pillow 8.0.0+

## Quick Start

### Basic Usage

```python

from docling_ocr import SmolDoclingExtractor

# Initialize the extractor

extractor = SmolDoclingExtractor()

# Extract text from an image file

text = extractor.extract_text("path/to/document.jpg")

print(text)

# Or use the shorthand callable interface

text = extractor("path/to/document.jpg")

```

### Using with PIL Images

```python

from docling_ocr import SmolDoclingExtractor

from PIL import Image

# Initialize the extractor

extractor = SmolDoclingExtractor()

# Open image with PIL

image = Image.open("path/to/document.jpg")

# Extract text

text = extractor.extract_text_from_image(image)

print(text)

```

### Batch Processing

```python

from docling_ocr import SmolDoclingExtractor

from docling_ocr.utils import batch_process

# Initialize extractor

extractor = SmolDoclingExtractor()

# Process all images in a directory

results = batch_process(

    extractor, 

    image_dir="path/to/documents/", 

    output_dir="path/to/output/",

    extensions=['.jpg', '.png', '.pdf']  # Optional: specify file extensions

)

# Results contains a dictionary mapping filenames to extracted text

for filename, text in results.items():

    print(f"File: {filename}")

    print(f"Text: {text[:100]}...")  # Print first 100 chars

    print("-" * 50)

```

## Advanced Usage

### GPU Acceleration

By default, the extractor will use CUDA if available. You can explicitly specify the device:

```python

# Use CPU explicitly

extractor = SmolDoclingExtractor(device="cpu")

# Use specific GPU

extractor = SmolDoclingExtractor(device="cuda:0")

```

### Custom Model Configuration

You can specify a different model from the same family:

```python

# Use a different model variant

extractor = SmolDoclingExtractor(model_name="ds4sd/SmolDocling-512M")

```

### Adjusting Generated Text Length

For longer documents, you may want to increase the maximum generated text length:

```python

# Extract with a longer maximum length for complex documents

text = extractor.extract_text("complex_document.pdf", max_length=1024)

```

## Performance Considerations

- Processing time depends on the image size, complexity, and hardware

- GPU acceleration is recommended for batch processing

- First initialization loads the model which may take some time

- Subsequent calls are much faster as the model remains in memory

## Comparison with Traditional OCR

`docling_ocr` differs from traditional OCR engines in several key ways:

| Feature | Traditional OCR | docling_ocr |

|---------|----------------|-------------|

| Text Recognition | Character/word based | Context-aware language understanding |

| Layout Understanding | Limited/separate process | Integrated understanding |

| Language Understanding | Limited | Leverages LLM language capabilities |

| Format Flexibility | Engine-specific | Adaptable to various formats |

| Context Retention | Limited | Maintains document context |

## Examples

### Forms and Structured Documents

```python

from docling_ocr import SmolDoclingExtractor

extractor = SmolDoclingExtractor()

form_text = extractor("tax_form.jpg")

print(form_text)

```

### Tables and Spreadsheets

```python

spreadsheet_text = extractor("financial_data.jpg")

print(spreadsheet_text)

```

### Receipts and Invoices

```python

receipt_text = extractor("receipt.jpg")

print(receipt_text)

```

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

1. Fork the repository

2. Create your feature branch (`git checkout -b feature/amazing-feature`)

3. Commit your changes (`git commit -m 'Add some amazing feature'`)

4. Push to the branch (`git push origin feature/amazing-feature`)

5. Open a Pull Request

## Future Roadmap

- Support for PDF documents with multi-page handling

- Additional LLM-based extraction models

- Fine-tuning options for specific document types

- Structured data extraction (JSON output)

- Layout-preserving extraction options

## License

This project is licensed under the MIT License - see the LICENSE file for details.

## Acknowledgments

- Built on the amazing work of the SmolDocling team for the [SmolDocling-256M-preview model.](https://huggingface.co/ds4sd/SmolDocling-256M-preview)

- Inspired by the growing field of document AI

- Thanks to the HuggingFace team for making transformers accessible

## Citation

If you use this package in your research, please cite:

```

@software{docling_ocr,

  author = {Adhing'a Fredrick},

  title = {docling_ocr: LLM-based Document Text Extraction},

  year = {2025},

  url = {https://github.com/FREDERICO23/docling_ocr}

}

```

## Contact

For questions and support, please open an issue on the GitHub repository or contact [adhingafredrick@gmail.com](mailto:adhingafredrick@gmail.com).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/frederico23/docling_ocr

Awesome Lists containing this project

README