Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/Goldziher/kreuzberg

Last synced: 5 days ago
JSON representation
Host: GitHub
URL: https://github.com/Goldziher/kreuzberg
Owner: Goldziher
Created: 2025-01-31T21:50:02.000Z (11 days ago)
Default Branch: main
Last Pushed: 2025-02-01T08:38:48.000Z (10 days ago)
Last Synced: 2025-02-01T09:27:02.907Z (10 days ago)
Language: Python
Size: 209 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

my-awesome-github-stars - Goldziher/kreuzberg - A text extraction library supporting PDFs, images, office documents and more (Python)
README

        # Kreuzberg

Kreuzberg is a library for simplified text extraction from PDF files. It's meant to offer simple, hassle free text

extraction.

Why?

I am building, like many do now, a RAG focused service (checkout https://grantflow.ai). I have text extraction needs.

There are quite a lot of commercial options out there, and several open-source + paid options.

But I wanted something simple, which does not require expansive round-trips to an external API.

Furthermore, I wanted something that is easy to run locally and isn't very heavy / requires a GPU.

Hence, this library.

## Features

- Extract text from PDFs, images, office documents and more (see supported formats below)

- Use modern Python with async (via `anyio`) and proper type hints

- Extensive error handling for easy debugging

## Installation

1. Begin by installing the python package:

   ```shell

   pip install kreuzberg

   ```

2. Install the system dependencies:

- [pandoc](https://pandoc.org/installing.html) (non-pdf text extraction, GPL v2.0 licensed but used via CLI only)

- [tesseract-ocr](https://tesseract-ocr.github.io/) (for image/PDF OCR, Apache License)

## Dependencies and Philosophy

This library is built to be minimalist and simple. It also aims to utilize OSS tools for the job. Its fundamentally a

high order async abstraction on top of other tools, think of it like the library you would bake in your code base, but

polished and well maintained.

### Dependencies

- PDFs are processed using pdfium2 for searchable PDFs + Tesseract OCR for scanned documents

- Images are processed using Tesseract OCR

- Office documents and other formats are processed using Pandoc

- PPTX files are converted using python-pptx

- HTML files are converted using html-to-markdown

- Plain text files are read directly with appropriate encoding detection

### Roadmap

V1:

- [x] - html file text extraction

- [ ] - better PDF table extraction

- [ ] - TBD

V2:

- [ ] - extra install groups (to make dependencies optional)

- [ ] - metadata extraction (possible breaking change)

- [ ] - TBD

### Feature Requests

Feel free to open a discussion in GitHub or an issue if you have any feature requests

### Contribution

Is welcome! Read guidelines below.

## Supported File Types

Kreuzberg supports a wide range of file formats:

### Document Formats

- PDF (`.pdf`) - both searchable and scanned documents

- Word Documents (`.docx`, `.doc`)

- Power Point Presentations (`.pptx`)

- OpenDocument Text (`.odt`)

- Rich Text Format (`.rtf`)

### Image Formats

- JPEG, JPG (`.jpg`, `.jpeg`, `.pjpeg`)

- PNG (`.png`)

- TIFF (`.tiff`, `.tif`)

- BMP (`.bmp`)

- GIF (`.gif`)

- WebP (`.webp`)

- JPEG 2000 (`.jp2`, `.jpx`, `.jpm`, `.mj2`)

- Portable Anymap (`.pnm`)

- Portable Bitmap (`.pbm`)

- Portable Graymap (`.pgm`)

- Portable Pixmap (`.ppm`)

#### Text and Markup Formats

- HTML (`.html`, `.htm`)

- Plain Text (`.txt`)

- Markdown (`.md`)

- reStructuredText (`.rst`)

- LaTeX (`.tex`)

#### Data Formats

- Comma-Separated Values (`.csv`)

- Tab-Separated Values (`.tsv`)

## Usage

Kreuzberg exports two async functions:

- Extract text from a file (string path or `pathlib.Path`) using `extract_file()`

- Extract text from a byte-string using `extract_bytes()`

### Extract from File

```python

from pathlib import Path

from kreuzberg import extract_file

# Extract text from a PDF file

async def extract_pdf():

    result = await extract_file("document.pdf")

    print(f"Extracted text: {result.content}")

    print(f"Output mime type: {result.mime_type}")

# Extract text from an image

async def extract_image():

    result = await extract_file("scan.png")

    print(f"Extracted text: {result.content}")

# or use Path

async def extract_pdf():

    result = await extract_file(Path("document.pdf"))

    print(f"Extracted text: {result.content}")

    print(f"Output mime type: {result.mime_type}")

```

### Extract from Bytes

```python

from kreuzberg import extract_bytes

# Extract text from PDF bytes

async def process_uploaded_pdf(pdf_content: bytes):

    result = await extract_bytes(pdf_content, mime_type="application/pdf")

    return result.content

# Extract text from image bytes

async def process_uploaded_image(image_content: bytes):

    result = await extract_bytes(image_content, mime_type="image/jpeg")

    return result.content

```

### Forcing OCR

When extracting a PDF file or bytes, you might want to force OCR - for example, if the PDF includes images that have text that should be extracted etc.

You can do this by passing `force_ocr=True`:

```python

from kreuzberg import extract_bytes

# Extract text from PDF bytes and force OCR

async def process_uploaded_pdf(pdf_content: bytes):

    result = await extract_bytes(pdf_content, mime_type="application/pdf", force_ocr=True)

    return result.content

```

### Error Handling

Kreuzberg raises two exception types:

#### ValidationError

Raised when there are issues with input validation:

- Unsupported mime types

- Undetectable mime types

- Path doesn't point at an exist file

#### ParsingError

Raised when there are issues during the text extraction process:

- PDF parsing failures

- OCR errors

- Pandoc conversion errors

```python

from kreuzberg import extract_file

from kreuzberg.exceptions import ValidationError, ParsingError

async def safe_extract():

    try:

        result = await extract_file("document.doc")

        return result.content

    except ValidationError as e:

        print(f"Validation error: {e.message}")

        print(f"Context: {e.context}")

    except ParsingError as e:

        print(f"Parsing error: {e.message}")

        print(f"Context: {e.context}")  # Contains detailed error information

```

Both error types include helpful context information for debugging:

```python

try:

    result = await extract_file("scanned.pdf")

except ParsingError as e:

# e.context might contain:

# {

#    "file_path": "scanned.pdf",

#    "error": "Tesseract OCR failed: Unable to process image"

# }

```

### ExtractionResult

All extraction functions return an ExtractionResult named tuple containing:

- `content`: The extracted text as a string

- `mime_type`: The mime type of the output (either "text/plain" or, if pandoc is used- "text/markdown")

```python

from kreuzberg import ExtractionResult

async def process_document(path: str) -> str:

    result: ExtractionResult = await extract_file(path)

    return result.content

# or access the result as tuple

async def process_document(path: str) -> str:

    content, mime_type = await extract_file(path)

    # do something with mime_type

    return content

```

## Contribution

This library is open to contribution. Feel free to open issues or submit PRs. Its better to discuss issues before

submitting PRs to avoid disappointment.

### Local Development

1. Clone the repo

2. Install the system dependencies

3. Install the full dependencies with `uv sync`

4. Install the pre-commit hooks with:

   ```shell

   pre-commit install && pre-commit install --hook-type commit-msg

   ```

5. Make your changes and submit a PR

## License

This library uses the MIT license.