Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/enoch3712/ExtractThinker
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
https://github.com/enoch3712/ExtractThinker
ai llm nlp ocr openai python
Last synced: 5 days ago
JSON representation
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
- Host: GitHub
- URL: https://github.com/enoch3712/ExtractThinker
- Owner: enoch3712
- License: apache-2.0
- Created: 2024-02-01T17:23:31.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-10-30T16:35:23.000Z (11 days ago)
- Last Synced: 2024-10-30T17:31:39.576Z (11 days ago)
- Topics: ai, llm, nlp, ocr, openai, python
- Language: Python
- Homepage:
- Size: 4.79 MB
- Stars: 326
- Watchers: 9
- Forks: 56
- Open Issues: 11
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- Awesome-LLM-RAG-Application - ExtractThinker
README
# ExtractThinker
Library to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.
## Features
- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.
- Customizable extraction using contract definitions.
- Asynchronous processing for efficient document handling.
- Built-in support for various document formats.
- ORM-style interaction between files and LLMs.
## Installation
To install `extract_thinker`, you can use `pip`:
```bash
pip install extract_thinker
```## Usage
Here's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.```python
import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Contractload_dotenv()
cwd = os.getcwd()class InvoiceContract(Contract):
invoice_number: str
invoice_date: strtesseract_path = os.getenv("TESSERACT_PATH")
test_file_path = os.path.join(cwd, "test_images", "invoice.png")extractor = Extractor()
extractor.load_document_loader(
DocumentLoaderTesseract(tesseract_path)
)
extractor.load_llm("claude-3-haiku-20240307")result = extractor.extract(test_file_path, InvoiceContract)
print("Invoice Number: ", result.invoice_number)
print("Invoice Date: ", result.invoice_date)
```## Splitting Files Example
You can also split and process documents using extract_thinker. Here's how you can do it:```python
import os
from dotenv import load_dotenv
from extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitterload_dotenv()
class DriverLicense(Contract):
# Define your DriverLicense contract fields here
passclass InvoiceContract(Contract):
invoice_number: str
invoice_date: strextractor = Extractor()
extractor.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
extractor.load_llm("gpt-3.5-turbo")classifications = [
Classification(name="Driver License", description="This is a driver license", contract=DriverLicense, extractor=extractor),
Classification(name="Invoice", description="This is an invoice", contract=InvoiceContract, extractor=extractor)
]process = Process()
process.load_document_loader(DocumentLoaderTesseract(os.getenv("TESSERACT_PATH")))
process.load_splitter(ImageSplitter())path = "..."
split_content = process.load_file(path)\
.split(classifications)\
.extract()# Process the split_content as needed
```## Infrastructure
The `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing.
## π Examples
| Notebook | Description |
|----------|-------------|
| [Basic Usage](examples/notebooks/basic_example.ipynb) | Basic usage of ExtractThinker with PyPDF loader and GPT-4o-mini for invoice data extraction |## Why Just Not LangChain?
While LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.## Additional Examples
You can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library## Contributing
We welcome contributions from the community! If you would like to contribute, please follow these steps:Fork the repository.
Create a new branch for your feature or bugfix.
Write tests for your changes.
Run tests to ensure everything is working correctly.
Submit a pull request with a description of your changes.## Community
JΓΊlio Almeida
https://pub.towardsai.net/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef## License
This project is licensed under the Apache License 2.0. See the LICENSE file for more details.## Contact
For any questions or issues, please open an issue on the GitHub repository.