https://github.com/enoch3712/ExtractThinker

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
https://github.com/enoch3712/ExtractThinker

ai document-image-analysis document-intelligence document-parsing document-processing langchain llm machine-learning nlp ocr openai pdf pdf-to-text python

Last synced: about 2 months ago
JSON representation

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

Host: GitHub
URL: https://github.com/enoch3712/ExtractThinker
Owner: enoch3712
License: apache-2.0
Created: 2024-02-01T17:23:31.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-04-02T01:06:35.000Z (about 2 months ago)
Last Synced: 2025-04-02T09:43:29.193Z (about 2 months ago)
Topics: ai, document-image-analysis, document-intelligence, document-parsing, document-processing, langchain, llm, machine-learning, nlp, ocr, openai, pdf, pdf-to-text, python
Language: Python
Homepage: https://enoch3712.github.io/ExtractThinker
Size: 19.7 MB
Stars: 1,174
Watchers: 20
Forks: 113
Open Issues: 20
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

Awesome-LLM-RAG-Application - ExtractThinker
awesome_ai_agents - Extractthinker - ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows. (Building / Workflows)
awesome - enoch3712/ExtractThinker - ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows. (Python)
my-awesome-github-stars - enoch3712/ExtractThinker - ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows. (Python)

README

        


  







    









# ExtractThinker

Library to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.

## Features

- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.

- Customizable extraction using contract definitions.

- Asynchronous processing for efficient document handling.

- Built-in support for various document formats.

- ORM-style interaction between files and LLMs.



  



## Installation

To install `extract_thinker`, you can use `pip`:

```bash

pip install extract_thinker

```

## Usage

Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.

```python

import os

from dotenv import load_dotenv

from extract_thinker import DocumentLoaderTesseract, Extractor, Contract

load_dotenv()

cwd = os.getcwd()

class InvoiceContract(Contract):

    invoice_number: str

    invoice_date: str

tesseract_path = os.getenv("TESSERACT_PATH")

test_file_path = os.path.join(cwd, "test_images", "invoice.png")

extractor = Extractor()

extractor.load_document_loader(

    DocumentLoaderTesseract(tesseract_path)

)

extractor.load_llm("claude-3-haiku-20240307")

result = extractor.extract(test_file_path, InvoiceContract)

print("Invoice Number: ", result.invoice_number)

print("Invoice Date: ", result.invoice_date)

```

## Splitting Files Example

You can also split and process documents using extract_thinker. Here's how you can do it:

```python

import os

from dotenv import load_dotenv

from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter

load_dotenv()

class DriverLicense(Contract):

    # Define your DriverLicense contract fields here

    pass

class InvoiceContract(Contract):

    invoice_number: str

    invoice_date: str

extractor = Extractor()

extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))

extractor.load_llm("gpt-3.5-turbo")

classifications = [

    Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),

    Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)

]

process = Process()

process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))

process.load_splitter(ImageSplitter())

path = "..."

split_content = process.load_file(path)\

    .split(classifications)\

    .extract()

# Process the split_content as needed

```

## Infrastructure

The `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing. 



  



## 📖 Examples

| Notebook | Description |

|----------|-------------|

| [Basic Usage](examples/notebooks/basic_example.ipynb) | Basic usage of ExtractThinker with PyPDF loader and GPT-4o-mini for invoice data extraction |

## Why Just Not LangChain?

While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.

## Additional Examples

You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library

## Contributing

We welcome contributions from the community! If you would like to contribute, please follow these steps:

Fork the repository.

Create a new branch for your feature or bugfix.

Write tests for your changes.

Run tests to ensure everything is working correctly.

Submit a pull request with a description of your changes.

## Community

Júlio Almeida

    https://pub.towardsai.net/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef

## License

This project is licensed under the Apache License 2.0. See the LICENSE file for more details.

## Contact

For any questions or issues, please open an issue on the GitHub repository.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/enoch3712/ExtractThinker

Awesome Lists containing this project

README