{"id":25059491,"url":"https://github.com/goldziher/kreuzberg","last_synced_at":"2025-05-14T00:11:03.079Z","repository":{"id":275266270,"uuid":"925434317","full_name":"Goldziher/kreuzberg","owner":"Goldziher","description":"A text extraction library supporting PDFs, images, office documents and more","archived":false,"fork":false,"pushed_at":"2025-04-02T08:50:55.000Z","size":12387,"stargazers_count":1736,"open_issues_count":1,"forks_count":57,"subscribers_count":10,"default_branch":"main","last_synced_at":"2025-04-07T11:01:30.167Z","etag":null,"topics":["asyncio","docx","ocr","pdf","text-extraction"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Goldziher.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-31T21:50:02.000Z","updated_at":"2025-04-07T07:57:50.000Z","dependencies_parsed_at":"2025-02-15T14:18:25.463Z","dependency_job_id":"93cfc0b9-3a30-45d0-8222-4ad5ef52bfe3","html_url":"https://github.com/Goldziher/kreuzberg","commit_stats":null,"previous_names":["goldziher/kreuzberg"],"tags_count":17,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Goldziher%2Fkreuzberg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Goldziher%2Fkreuzberg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Goldziher%2Fkreuzberg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Goldziher%2Fkreuzberg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Goldziher","download_url":"https://codeload.github.com/Goldziher/kreuzberg/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248889666,"owners_count":21178271,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","docx","ocr","pdf","text-extraction"],"created_at":"2025-02-06T15:06:47.401Z","updated_at":"2025-04-14T13:39:14.865Z","avatar_url":"https://github.com/Goldziher.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Kreuzberg\n\n[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)\n[![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nKreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs.\n\n## Why Kreuzberg?\n\n- **Simple and Hassle-Free**: Clean API that just works, without complex configuration\n- **Local Processing**: No external API calls or cloud dependencies required\n- **Resource Efficient**: Lightweight processing without GPU requirements\n- **Format Support**: Comprehensive support for documents, images, and text formats\n- **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR\n- **Metadata Extraction**: Get document metadata alongside text content\n- **Table Extraction**: Extract tables from documents using the excellent GMFT library\n- **Modern Python**: Built with async/await, type hints, and a functional-first approach\n- **Permissive OSS**: MIT licensed with permissively licensed dependencies\n\n## Quick Start\n\n```bash\npip install kreuzberg\n```\n\nInstall pandoc:\n\n```bash\n# Ubuntu/Debian\nsudo apt-get install tesseract-ocr pandoc\n\n# macOS\nbrew install tesseract pandoc\n\n# Windows\nchoco install -y tesseract pandoc\n```\n\nThe tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.\n\n### Alternative OCR engines\n\n```bash\n# Install with EasyOCR support\npip install \"kreuzberg[easyocr]\"\n\n# Install with PaddleOCR support\npip install \"kreuzberg[paddleocr]\"\n```\n\n## Quick Example\n\n```python\nimport asyncio\nfrom kreuzberg import extract_file\n\nasync def main():\n    # Extract text from a PDF\n    result = await extract_file(\"document.pdf\")\n    print(result.content)\n\n    # Extract text from an image\n    result = await extract_file(\"scan.jpg\")\n    print(result.content)\n\n    # Extract text from a Word document\n    result = await extract_file(\"report.docx\")\n    print(result.content)\n\nasyncio.run(main())\n```\n\n## Documentation\n\nFor comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):\n\n- [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage\n- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information\n- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation\n- [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases\n- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines\n- [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) - Choose the right OCR engine\n\n## Supported Formats\n\nKreuzberg supports a wide range of document formats:\n\n- **Documents**: PDF, DOCX, RTF, TXT, EPUB, etc.\n- **Images**: JPG, PNG, TIFF, BMP, GIF, etc.\n- **Spreadsheets**: XLSX, XLS, CSV, etc.\n- **Presentations**: PPTX, PPT, etc.\n- **Web Content**: HTML, XML, etc.\n\n## OCR Engines\n\nKreuzberg supports multiple OCR engines:\n\n- **Tesseract** (Default): Lightweight, fast startup, requires system installation\n- **EasyOCR**: Good for many languages, pure Python, but downloads models on first use\n- **PaddleOCR**: Excellent for Asian languages, pure Python, but downloads models on first use\n\nFor comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.\n\n## Contribution\n\nThis library is open to contribution. Feel free to open issues or submit PRs. It's better to discuss issues before submitting PRs to avoid disappointment.\n\n### Local Development\n\n- Clone the repo\n- Install the system dependencies\n- Install the full dependencies with `uv sync`\n- Install the pre-commit hooks with: `pre-commit install \u0026\u0026 pre-commit install --hook-type commit-msg`\n- Make your changes and submit a PR\n\n## License\n\nThis library is released under the MIT license.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoldziher%2Fkreuzberg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoldziher%2Fkreuzberg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoldziher%2Fkreuzberg/lists"}