https://github.com/goldziher/kreuzberg

A text extraction library supporting PDFs, images, office documents and more
https://github.com/goldziher/kreuzberg

asyncio docx ocr pdf text-extraction

Last synced: about 1 year ago
JSON representation

A text extraction library supporting PDFs, images, office documents and more

Host: GitHub
URL: https://github.com/goldziher/kreuzberg
Owner: Goldziher
License: mit
Created: 2025-01-31T21:50:02.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-02T08:50:55.000Z (about 1 year ago)
Last Synced: 2025-04-07T11:01:30.167Z (about 1 year ago)
Topics: asyncio, docx, ocr, pdf, text-extraction
Language: Python
Homepage:
Size: 11.8 MB
Stars: 1,736
Watchers: 10
Forks: 57
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing.md
- License: LICENSE

Awesome Lists containing this project

README

# Kreuzberg

[![PyPI version](https://badge.fury.io/py/kreuzberg.svg)](https://badge.fury.io/py/kreuzberg)
[![Documentation](https://img.shields.io/badge/docs-GitHub_Pages-blue)](https://goldziher.github.io/kreuzberg/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

Kreuzberg is a Python library for text extraction from documents. It provides a unified interface for extracting text from PDFs, images, office documents, and more, with both async and sync APIs.

## Why Kreuzberg?

- **Simple and Hassle-Free**: Clean API that just works, without complex configuration
- **Local Processing**: No external API calls or cloud dependencies required
- **Resource Efficient**: Lightweight processing without GPU requirements
- **Format Support**: Comprehensive support for documents, images, and text formats
- **Multiple OCR Engines**: Support for Tesseract, EasyOCR, and PaddleOCR
- **Metadata Extraction**: Get document metadata alongside text content
- **Table Extraction**: Extract tables from documents using the excellent GMFT library
- **Modern Python**: Built with async/await, type hints, and a functional-first approach
- **Permissive OSS**: MIT licensed with permissively licensed dependencies

## Quick Start

```bash
pip install kreuzberg
```

Install pandoc:

```bash
# Ubuntu/Debian
sudo apt-get install tesseract-ocr pandoc

# macOS
brew install tesseract pandoc

# Windows
choco install -y tesseract pandoc
```

The tesseract OCR engine is the default OCR engine. You can decide not to use it - and then either use one of the two alternative OCR engines, or have no OCR at all.

### Alternative OCR engines

```bash
# Install with EasyOCR support
pip install "kreuzberg[easyocr]"

# Install with PaddleOCR support
pip install "kreuzberg[paddleocr]"
```

## Quick Example

```python
import asyncio
from kreuzberg import extract_file

async def main():
# Extract text from a PDF
result = await extract_file("document.pdf")
print(result.content)

# Extract text from an image
result = await extract_file("scan.jpg")
print(result.content)

# Extract text from a Word document
result = await extract_file("report.docx")
print(result.content)

asyncio.run(main())
```

## Documentation

For comprehensive documentation, visit our [GitHub Pages](https://goldziher.github.io/kreuzberg/):

- [Getting Started](https://goldziher.github.io/kreuzberg/getting-started/) - Installation and basic usage
- [User Guide](https://goldziher.github.io/kreuzberg/user-guide/) - In-depth usage information
- [API Reference](https://goldziher.github.io/kreuzberg/api-reference/) - Detailed API documentation
- [Examples](https://goldziher.github.io/kreuzberg/examples/) - Code examples for common use cases
- [OCR Configuration](https://goldziher.github.io/kreuzberg/user-guide/ocr-configuration/) - Configure OCR engines
- [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) - Choose the right OCR engine

## Supported Formats

Kreuzberg supports a wide range of document formats:

- **Documents**: PDF, DOCX, RTF, TXT, EPUB, etc.
- **Images**: JPG, PNG, TIFF, BMP, GIF, etc.
- **Spreadsheets**: XLSX, XLS, CSV, etc.
- **Presentations**: PPTX, PPT, etc.
- **Web Content**: HTML, XML, etc.

## OCR Engines

Kreuzberg supports multiple OCR engines:

- **Tesseract** (Default): Lightweight, fast startup, requires system installation
- **EasyOCR**: Good for many languages, pure Python, but downloads models on first use
- **PaddleOCR**: Excellent for Asian languages, pure Python, but downloads models on first use

For comparison and selection guidance, see the [OCR Backends](https://goldziher.github.io/kreuzberg/user-guide/ocr-backends/) documentation.

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. It's better to discuss issues before submitting PRs to avoid disappointment.

### Local Development

- Clone the repo
- Install the system dependencies
- Install the full dependencies with `uv sync`
- Install the pre-commit hooks with: `pre-commit install && pre-commit install --hook-type commit-msg`
- Make your changes and submit a PR

## License

This library is released under the MIT license.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/goldziher/kreuzberg

Awesome Lists containing this project

README