https://github.com/goldziher/kreuzberg
A text extraction library supporting PDFs, images, office documents and more
https://github.com/goldziher/kreuzberg
asyncio docx ocr pdf text-extraction
Last synced: about 1 year ago
JSON representation
A text extraction library supporting PDFs, images, office documents and more
- Host: GitHub
- URL: https://github.com/goldziher/kreuzberg
- Owner: Goldziher
- License: mit
- Created: 2025-01-31T21:50:02.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-04-02T08:50:55.000Z (about 1 year ago)
- Last Synced: 2025-04-07T11:01:30.167Z (about 1 year ago)
- Topics: asyncio, docx, ocr, pdf, text-extraction
- Language: Python
- Homepage:
- Size: 11.8 MB
- Stars: 1,736
- Watchers: 10
- Forks: 57
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- License: LICENSE
Awesome Lists containing this project
README
# Kreuzberg
[](https://badge.fury.io/py/kreuzberg)
[](https://goldziher.github.io/kreuzberg/)
[](https://opensource.org/licenses/MIT)
Kreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs.
## Why Kreuzberg?
- **Simple and Hassle-Free**: Clean API that just works, without complex configuration
- **Local Processing**: No external API calls or cloud dependencies required
- **Resource Efficient**: Lightweight processing without GPU requirements
- **Format Support**: Comprehensive support for documents, images, and text formats
- **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR
- **Metadata Extraction**: Get document metadata alongside text content
- **Table Extraction**: Extract tables from documents using the excellent GMFT library
- **Modern Python**: Built with async/await, type hints, and a functional-first approach
- **Permissive OSS**: MIT licensed with permissively licensed dependencies
## Quick Start
```bash
pip install kreuzberg
```
Install pandoc:
```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc
# macOS
brew install tesseract pandoc
# Windows
choco install -y tesseract pandoc
```
The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.
### Alternative OCR engines
```bash
# Install with EasyOCR support
pip install "kreuzberg[easyocr]"
# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"
```
## Quick Example
```python
import asyncio
from kreuzberg import extract_file
async def main():
# Extract text from a PDF
result = await extract_file("document.pdf")
print(result.content)
# Extract text from an image
result = await extract_file("scan.jpg")
print(result.content)
# Extract text from a Word document
result = await extract_file("report.docx")
print(result.content)
asyncio.run(main())
```
## Documentation
For comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):
- [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage
- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information
- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation
- [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases
- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines
- [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) - Choose the right OCR engine
## Supported Formats
Kreuzberg supports a wide range of document formats:
- **Documents**: PDF, DOCX, RTF, TXT, EPUB, etc.
- **Images**: JPG, PNG, TIFF, BMP, GIF, etc.
- **Spreadsheets**: XLSX, XLS, CSV, etc.
- **Presentations**: PPTX, PPT, etc.
- **Web Content**: HTML, XML, etc.
## OCR Engines
Kreuzberg supports multiple OCR engines:
- **Tesseract** (Default): Lightweight, fast startup, requires system installation
- **EasyOCR**: Good for many languages, pure Python, but downloads models on first use
- **PaddleOCR**: Excellent for Asian languages, pure Python, but downloads models on first use
For comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.
## Contribution
This library is open to contribution. Feel free to open issues or submit PRs. It's better to discuss issues before submitting PRs to avoid disappointment.
### Local Development
- Clone the repo
- Install the system dependencies
- Install the full dependencies with `uv sync`
- Install the pre-commit hooks with: `pre-commit install && pre-commit install --hook-type commit-msg`
- Make your changes and submit a PR
## License
This library is released under the MIT license.