{"id":37082888,"url":"https://github.com/frederico23/docling_ocr","last_synced_at":"2026-01-14T10:01:20.477Z","repository":{"id":283029484,"uuid":"950450132","full_name":"FREDERICO23/docling_ocr","owner":"FREDERICO23","description":"A powerful Python package for extracting text from images and documents using the SmolDocling-256M-preview advanced LLM-based models.","archived":false,"fork":false,"pushed_at":"2025-03-21T04:10:56.000Z","size":27,"stargazers_count":11,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-23T13:52:45.957Z","etag":null,"topics":["llm","ocr","ocr-python","ocr-text-reader"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/FREDERICO23.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-18T07:18:42.000Z","updated_at":"2025-07-31T07:04:50.000Z","dependencies_parsed_at":"2025-03-18T08:30:30.466Z","dependency_job_id":"b5eea232-d272-42de-8acd-6e2c6637ac0a","html_url":"https://github.com/FREDERICO23/docling_ocr","commit_stats":null,"previous_names":["frederico23/docling-ocr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/FREDERICO23/docling_ocr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FREDERICO23%2Fdocling_ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FREDERICO23%2Fdocling_ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FREDERICO23%2Fdocling_ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FREDERICO23%2Fdocling_ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/FREDERICO23","download_url":"https://codeload.github.com/FREDERICO23/docling_ocr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/FREDERICO23%2Fdocling_ocr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28416495,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T08:38:59.149Z","status":"ssl_error","status_checked_at":"2026-01-14T08:38:43.588Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["llm","ocr","ocr-python","ocr-text-reader"],"created_at":"2026-01-14T10:01:19.626Z","updated_at":"2026-01-14T10:01:20.469Z","avatar_url":"https://github.com/FREDERICO23.png","language":"Python","readme":"# docling_ocr\n\n[![PyPI version](https://badge.fury.io/py/docling_ocr.svg)](https://badge.fury.io/py/docling_ocr)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Python 3.7+](https://img.shields.io/badge/python-3.7+-blue.svg)](https://www.python.org/downloads/release/python-370/)\n\nA powerful Python package for extracting text from images and documents using advanced LLM-based models.\n\n## Overview\n\n`docling_ocr` leverages state-of-the-art language models specifically designed for document understanding tasks. Unlike traditional OCR engines that rely solely on character recognition, `docling_ocr` uses language models that understand document context, layouts, and can handle various document formats with high accuracy.\n\nBuilt on top of models like SmolDocling, this package provides a simple, intuitive interface for document text extraction tasks.\n\n## Features\n\n- **LLM-powered extraction**: Uses advanced language models trained specifically for document understanding\n- **Context-aware recognition**: Understands document layouts and context for improved accuracy\n- **Multi-format support**: Works with scanned documents, forms, receipts, and other text-heavy images\n- **Simple API**: Easy-to-use interface with both file and image object inputs\n- **Batch processing**: Process entire directories of documents efficiently\n- **Flexible output options**: Return text or save directly to files\n- **Extensible architecture**: Abstract base class makes it easy to add new models\n\n## Installation\n\n```bash\npip install docling_ocr\n```\n\n### Requirements\n\n- Python 3.7+\n- PyTorch 1.10.0+\n- Transformers 4.15.0+\n- Pillow 8.0.0+\n\n## Quick Start\n\n### Basic Usage\n\n```python\nfrom docling_ocr import SmolDoclingExtractor\n\n# Initialize the extractor\nextractor = SmolDoclingExtractor()\n\n# Extract text from an image file\ntext = extractor.extract_text(\"path/to/document.jpg\")\nprint(text)\n\n# Or use the shorthand callable interface\ntext = extractor(\"path/to/document.jpg\")\n```\n\n### Using with PIL Images\n\n```python\nfrom docling_ocr import SmolDoclingExtractor\nfrom PIL import Image\n\n# Initialize the extractor\nextractor = SmolDoclingExtractor()\n\n# Open image with PIL\nimage = Image.open(\"path/to/document.jpg\")\n\n# Extract text\ntext = extractor.extract_text_from_image(image)\nprint(text)\n```\n\n### Batch Processing\n\n```python\nfrom docling_ocr import SmolDoclingExtractor\nfrom docling_ocr.utils import batch_process\n\n# Initialize extractor\nextractor = SmolDoclingExtractor()\n\n# Process all images in a directory\nresults = batch_process(\n    extractor, \n    image_dir=\"path/to/documents/\", \n    output_dir=\"path/to/output/\",\n    extensions=['.jpg', '.png', '.pdf']  # Optional: specify file extensions\n)\n\n# Results contains a dictionary mapping filenames to extracted text\nfor filename, text in results.items():\n    print(f\"File: {filename}\")\n    print(f\"Text: {text[:100]}...\")  # Print first 100 chars\n    print(\"-\" * 50)\n```\n\n## Advanced Usage\n\n### GPU Acceleration\n\nBy default, the extractor will use CUDA if available. You can explicitly specify the device:\n\n```python\n# Use CPU explicitly\nextractor = SmolDoclingExtractor(device=\"cpu\")\n\n# Use specific GPU\nextractor = SmolDoclingExtractor(device=\"cuda:0\")\n```\n\n### Custom Model Configuration\n\nYou can specify a different model from the same family:\n\n```python\n# Use a different model variant\nextractor = SmolDoclingExtractor(model_name=\"ds4sd/SmolDocling-512M\")\n```\n\n### Adjusting Generated Text Length\n\nFor longer documents, you may want to increase the maximum generated text length:\n\n```python\n# Extract with a longer maximum length for complex documents\ntext = extractor.extract_text(\"complex_document.pdf\", max_length=1024)\n```\n\n## Performance Considerations\n\n- Processing time depends on the image size, complexity, and hardware\n- GPU acceleration is recommended for batch processing\n- First initialization loads the model which may take some time\n- Subsequent calls are much faster as the model remains in memory\n\n## Comparison with Traditional OCR\n\n`docling_ocr` differs from traditional OCR engines in several key ways:\n\n| Feature | Traditional OCR | docling_ocr |\n|---------|----------------|-------------|\n| Text Recognition | Character/word based | Context-aware language understanding |\n| Layout Understanding | Limited/separate process | Integrated understanding |\n| Language Understanding | Limited | Leverages LLM language capabilities |\n| Format Flexibility | Engine-specific | Adaptable to various formats |\n| Context Retention | Limited | Maintains document context |\n\n## Examples\n\n### Forms and Structured Documents\n\n```python\nfrom docling_ocr import SmolDoclingExtractor\n\nextractor = SmolDoclingExtractor()\nform_text = extractor(\"tax_form.jpg\")\nprint(form_text)\n```\n\n### Tables and Spreadsheets\n\n```python\nspreadsheet_text = extractor(\"financial_data.jpg\")\nprint(spreadsheet_text)\n```\n\n### Receipts and Invoices\n\n```python\nreceipt_text = extractor(\"receipt.jpg\")\nprint(receipt_text)\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## Future Roadmap\n\n- Support for PDF documents with multi-page handling\n- Additional LLM-based extraction models\n- Fine-tuning options for specific document types\n- Structured data extraction (JSON output)\n- Layout-preserving extraction options\n\n## License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## Acknowledgments\n\n- Built on the amazing work of the SmolDocling team for the [SmolDocling-256M-preview model.](https://huggingface.co/ds4sd/SmolDocling-256M-preview)\n- Inspired by the growing field of document AI\n- Thanks to the HuggingFace team for making transformers accessible\n\n## Citation\n\nIf you use this package in your research, please cite:\n\n```\n@software{docling_ocr,\n  author = {Adhing'a Fredrick},\n  title = {docling_ocr: LLM-based Document Text Extraction},\n  year = {2025},\n  url = {https://github.com/FREDERICO23/docling_ocr}\n}\n```\n\n## Contact\n\nFor questions and support, please open an issue on the GitHub repository or contact [adhingafredrick@gmail.com](mailto:adhingafredrick@gmail.com).\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrederico23%2Fdocling_ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffrederico23%2Fdocling_ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffrederico23%2Fdocling_ocr/lists"}