https://github.com/frederico23/docling_ocr
A powerful Python package for extracting text from images and documents using the SmolDocling-256M-preview advanced LLM-based models.
https://github.com/frederico23/docling_ocr
llm ocr ocr-python ocr-text-reader
Last synced: 6 days ago
JSON representation
A powerful Python package for extracting text from images and documents using the SmolDocling-256M-preview advanced LLM-based models.
- Host: GitHub
- URL: https://github.com/frederico23/docling_ocr
- Owner: FREDERICO23
- License: mit
- Created: 2025-03-18T07:18:42.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-03-21T04:10:56.000Z (10 months ago)
- Last Synced: 2025-09-23T13:52:45.957Z (4 months ago)
- Topics: llm, ocr, ocr-python, ocr-text-reader
- Language: Python
- Homepage:
- Size: 26.4 KB
- Stars: 11
- Watchers: 1
- Forks: 1
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# docling_ocr
[](https://badge.fury.io/py/docling_ocr)
[](https://opensource.org/licenses/MIT)
[](https://www.python.org/downloads/release/python-370/)
A powerful Python package for extracting text from images and documents using advanced LLM-based models.
## Overview
`docling_ocr` leverages state-of-the-art language models specifically designed for document understanding tasks. Unlike traditional OCR engines that rely solely on character recognition, `docling_ocr` uses language models that understand document context, layouts, and can handle various document formats with high accuracy.
Built on top of models like SmolDocling, this package provides a simple, intuitive interface for document text extraction tasks.
## Features
- **LLM-powered extraction**: Uses advanced language models trained specifically for document understanding
- **Context-aware recognition**: Understands document layouts and context for improved accuracy
- **Multi-format support**: Works with scanned documents, forms, receipts, and other text-heavy images
- **Simple API**: Easy-to-use interface with both file and image object inputs
- **Batch processing**: Process entire directories of documents efficiently
- **Flexible output options**: Return text or save directly to files
- **Extensible architecture**: Abstract base class makes it easy to add new models
## Installation
```bash
pip install docling_ocr
```
### Requirements
- Python 3.7+
- PyTorch 1.10.0+
- Transformers 4.15.0+
- Pillow 8.0.0+
## Quick Start
### Basic Usage
```python
from docling_ocr import SmolDoclingExtractor
# Initialize the extractor
extractor = SmolDoclingExtractor()
# Extract text from an image file
text = extractor.extract_text("path/to/document.jpg")
print(text)
# Or use the shorthand callable interface
text = extractor("path/to/document.jpg")
```
### Using with PIL Images
```python
from docling_ocr import SmolDoclingExtractor
from PIL import Image
# Initialize the extractor
extractor = SmolDoclingExtractor()
# Open image with PIL
image = Image.open("path/to/document.jpg")
# Extract text
text = extractor.extract_text_from_image(image)
print(text)
```
### Batch Processing
```python
from docling_ocr import SmolDoclingExtractor
from docling_ocr.utils import batch_process
# Initialize extractor
extractor = SmolDoclingExtractor()
# Process all images in a directory
results = batch_process(
extractor,
image_dir="path/to/documents/",
output_dir="path/to/output/",
extensions=['.jpg', '.png', '.pdf'] # Optional: specify file extensions
)
# Results contains a dictionary mapping filenames to extracted text
for filename, text in results.items():
print(f"File: {filename}")
print(f"Text: {text[:100]}...") # Print first 100 chars
print("-" * 50)
```
## Advanced Usage
### GPU Acceleration
By default, the extractor will use CUDA if available. You can explicitly specify the device:
```python
# Use CPU explicitly
extractor = SmolDoclingExtractor(device="cpu")
# Use specific GPU
extractor = SmolDoclingExtractor(device="cuda:0")
```
### Custom Model Configuration
You can specify a different model from the same family:
```python
# Use a different model variant
extractor = SmolDoclingExtractor(model_name="ds4sd/SmolDocling-512M")
```
### Adjusting Generated Text Length
For longer documents, you may want to increase the maximum generated text length:
```python
# Extract with a longer maximum length for complex documents
text = extractor.extract_text("complex_document.pdf", max_length=1024)
```
## Performance Considerations
- Processing time depends on the image size, complexity, and hardware
- GPU acceleration is recommended for batch processing
- First initialization loads the model which may take some time
- Subsequent calls are much faster as the model remains in memory
## Comparison with Traditional OCR
`docling_ocr` differs from traditional OCR engines in several key ways:
| Feature | Traditional OCR | docling_ocr |
|---------|----------------|-------------|
| Text Recognition | Character/word based | Context-aware language understanding |
| Layout Understanding | Limited/separate process | Integrated understanding |
| Language Understanding | Limited | Leverages LLM language capabilities |
| Format Flexibility | Engine-specific | Adaptable to various formats |
| Context Retention | Limited | Maintains document context |
## Examples
### Forms and Structured Documents
```python
from docling_ocr import SmolDoclingExtractor
extractor = SmolDoclingExtractor()
form_text = extractor("tax_form.jpg")
print(form_text)
```
### Tables and Spreadsheets
```python
spreadsheet_text = extractor("financial_data.jpg")
print(spreadsheet_text)
```
### Receipts and Invoices
```python
receipt_text = extractor("receipt.jpg")
print(receipt_text)
```
## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## Future Roadmap
- Support for PDF documents with multi-page handling
- Additional LLM-based extraction models
- Fine-tuning options for specific document types
- Structured data extraction (JSON output)
- Layout-preserving extraction options
## License
This project is licensed under the MIT License - see the LICENSE file for details.
## Acknowledgments
- Built on the amazing work of the SmolDocling team for the [SmolDocling-256M-preview model.](https://huggingface.co/ds4sd/SmolDocling-256M-preview)
- Inspired by the growing field of document AI
- Thanks to the HuggingFace team for making transformers accessible
## Citation
If you use this package in your research, please cite:
```
@software{docling_ocr,
author = {Adhing'a Fredrick},
title = {docling_ocr: LLM-based Document Text Extraction},
year = {2025},
url = {https://github.com/FREDERICO23/docling_ocr}
}
```
## Contact
For questions and support, please open an issue on the GitHub repository or contact [adhingafredrick@gmail.com](mailto:adhingafredrick@gmail.com).