{"id":23967339,"url":"https://github.com/enoch3712/extractthinker","last_synced_at":"2025-05-14T21:10:20.838Z","repository":{"id":220509563,"uuid":"751474418","full_name":"enoch3712/ExtractThinker","owner":"enoch3712","description":"ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.","archived":false,"fork":false,"pushed_at":"2025-04-02T01:06:35.000Z","size":20680,"stargazers_count":1174,"open_issues_count":20,"forks_count":113,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-04-02T09:43:29.193Z","etag":null,"topics":["ai","document-image-analysis","document-intelligence","document-parsing","document-processing","langchain","llm","machine-learning","nlp","ocr","openai","pdf","pdf-to-text","python"],"latest_commit_sha":null,"homepage":"https://enoch3712.github.io/ExtractThinker","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/enoch3712.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-01T17:23:31.000Z","updated_at":"2025-04-02T03:26:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"71343edc-d0c5-42ad-bcd1-ffc33be1b060","html_url":"https://github.com/enoch3712/ExtractThinker","commit_stats":null,"previous_names":["enoch3712/open-docllm"],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/enoch3712","download_url":"https://codeload.github.com/enoch3712/ExtractThinker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248027406,"owners_count":21035594,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","document-image-analysis","document-intelligence","document-parsing","document-processing","langchain","llm","machine-learning","nlp","ocr","openai","pdf","pdf-to-text","python"],"created_at":"2025-01-06T23:00:33.981Z","updated_at":"2025-04-09T11:04:18.500Z","avatar_url":"https://github.com/enoch3712.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/enoch3712/Open-DocLLM/assets/9283394/41d9d151-acb5-44da-9c10-0058f76c2512\" alt=\"Extract Thinker Logo\" width=\"200\"/\u003e \n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg alt=\"Python Version\" src=\"https://img.shields.io/badge/Python-3.9%2B-blue.svg\" /\u003e\n\u003ca href=\"https://medium.com/@enoch3712\"\u003e\n    \u003cimg alt=\"Medium\" src=\"https://img.shields.io/badge/Medium-12100E?style=flat\u0026logo=medium\u0026logoColor=white\" /\u003e\n\u003c/a\u003e\n\u003cimg alt=\"GitHub Last Commit\" src=\"https://img.shields.io/github/last-commit/enoch3712/Open-DocLLM\" /\u003e\n\u003cimg alt=\"Github License\" src=\"https://img.shields.io/badge/License-Apache%202.0-blue.svg\" /\u003e\n\u003c/p\u003e\n\n# ExtractThinker\n\nExtractThinker is a flexible document intelligence tool that leverages Large Language Models (LLMs) to extract and classify structured data from documents, functioning like an ORM for seamless document processing workflows.\n\n**TL;DR Document Intelligence for LLMs**\n\n## 🚀 Key Features\n\n- **Flexible Document Loaders**: Support for multiple document loaders, including Tesseract OCR, Azure Form Recognizer, AWS Textract, Google Document AI, and more.\n- **Customizable Contracts**: Define custom extraction contracts using Pydantic models for precise data extraction.\n- **Advanced Classification**: Classify documents or document sections using custom classifications and strategies.\n- **Asynchronous Processing**: Utilize asynchronous processing for efficient handling of large documents.\n- **Multi-format Support**: Seamlessly work with various document formats like PDFs, images, spreadsheets, and more.\n- **ORM-style Interaction**: Interact with documents and LLMs in an ORM-like fashion for intuitive development.\n- **Splitting Strategies**: Implement lazy or eager splitting strategies to process documents page by page or as a whole.\n- **Integration with LLMs**: Easily integrate with different LLM providers like OpenAI, Anthropic, Cohere, and more.\n- **Community-driven Development**: Inspired by the LangChain ecosystem with a focus on intelligent document processing.\n![image](https://github.com/user-attachments/assets/844b425c-0bb7-4abc-9d08-96e4a736d096)\n\n## 📦 Installation\n\nInstall ExtractThinker using pip:\n\n```bash\npip install extract_thinker\n```\n\n## 🛠️ Usage\n\n### Basic Extraction Example\n\nHere's a quick example to get you started with ExtractThinker. This example demonstrates how to load a document using PyPdf and extract specific fields defined in a contract.\n\n```python\nimport os\nfrom dotenv import load_dotenv\nfrom extract_thinker import Extractor, DocumentLoaderPyPdf, Contract\n\nload_dotenv()\n\nclass InvoiceContract(Contract):\n    invoice_number: str\n    invoice_date: str\n\n# Set the path to your Tesseract executable\ntest_file_path = os.path.join(\"path_to_your_files\", \"invoice.pdf\")\n\n# Initialize the extractor\nextractor = Extractor()\nextractor.load_document_loader(DocumentLoaderPyPdf())\nextractor.load_llm(\"gpt-4o-mini\")  # or any other supported model\n\n# Extract data from the document\nresult = extractor.extract(test_file_path, InvoiceContract)\n\nprint(\"Invoice Number:\", result.invoice_number)\nprint(\"Invoice Date:\", result.invoice_date)\n```\n\n### Classification Example\n\nExtractThinker allows you to classify documents or parts of documents using custom classifications:\n\n```python\nimport os\nfrom dotenv import load_dotenv\nfrom extract_thinker import (\n    Extractor, Classification, Process, ClassificationStrategy,\n    DocumentLoaderPyPdf, Contract\n)\n\nload_dotenv()\n\nclass InvoiceContract(Contract):\n    invoice_number: str\n    invoice_date: str\n\nclass DriverLicenseContract(Contract):\n    name: str\n    license_number: str\n\n# Initialize the extractor and load the document loader\nextractor = Extractor()\nextractor.load_document_loader(DocumentLoaderPyPdf())\nextractor.load_llm(\"gpt-4o-mini\")\n\n# Define classifications\nclassifications = [\n    Classification(\n        name=\"Invoice\",\n        description=\"An invoice document\",\n        contract=InvoiceContract,\n        extractor=extractor,\n    ),\n    Classification(\n        name=\"Driver License\",\n        description=\"A driver's license document\",\n        contract=DriverLicenseContract,\n        extractor=extractor,\n    ),\n]\n\n# Classify the document directly using the extractor\nresult = extractor.classify(\n    \"path_to_your_document.pdf\",  # Can be a file path or IO stream\n    classifications,\n    image=True  # Set to True for image-based classification\n)\n\n# The result will be a ClassificationResponse object with 'name' and 'confidence' fields\nprint(f\"Document classified as: {result.name}\")\nprint(f\"Confidence level: {result.confidence}\")\n```\n\n### Splitting Files Example\n\nExtractThinker allows you to split and process documents using different strategies. Here's how you can split a document and extract data based on classifications.\n\n```python\nimport os\nfrom dotenv import load_dotenv\nfrom extract_thinker import (\n    Extractor,\n    Process,\n    Classification,\n    ImageSplitter,\n    DocumentLoaderTesseract,\n    Contract,\n    SplittingStrategy,\n)\n\nload_dotenv()\n\nclass DriverLicenseContract(Contract):\n    name: str\n    license_number: str\n\nclass InvoiceContract(Contract):\n    invoice_number: str\n    invoice_date: str\n\n# Initialize the extractor and load the document loader\nextractor = Extractor()\nextractor.load_document_loader(DocumentLoaderPyPdf())\nextractor.load_llm(\"gpt-4o-mini\")\n\n# Define classifications\nclassifications = [\n    Classification(\n        name=\"Driver License\",\n        description=\"A driver's license document\",\n        contract=DriverLicenseContract,\n        extractor=extractor,\n    ),\n    Classification(\n        name=\"Invoice\",\n        description=\"An invoice document\",\n        contract=InvoiceContract,\n        extractor=extractor,\n    ),\n]\n\n# Initialize the process and load the splitter\nprocess = Process()\nprocess.load_document_loader(DocumentLoaderPyPdf())\nprocess.load_splitter(ImageSplitter(model=\"gpt-4o-mini\"))\n\n# Load and process the document\npath_to_document = \"path_to_your_multipage_document.pdf\"\nsplit_content = (\n    process.load_file(path_to_document)\n    .split(classifications, strategy=SplittingStrategy.LAZY)\n    .extract()\n)\n\n# Process the extracted content as needed\nfor item in split_content:\n    if isinstance(item, InvoiceContract):\n        print(\"Extracted Invoice:\")\n        print(\"Invoice Number:\", item.invoice_number)\n        print(\"Invoice Date:\", item.invoice_date)\n    elif isinstance(item, DriverLicenseContract):\n        print(\"Extracted Driver License:\")\n        print(\"Name:\", item.name)\n        print(\"License Number:\", item.license_number)\n\n```\n\n### Batch Processing Example\n\nYou can also perform batch processing of documents:\n\n```python\nfrom extract_thinker import Extractor, Contract\n\nclass ReceiptContract(Contract):\n    store_name: str\n    total_amount: float\n\nextractor = Extractor()\nextractor.load_llm(\"gpt-4o-mini\")\n\n# List of file paths or streams\ndocument = \"receipt1.jpg\"\n\nbatch_job = extractor.extract_batch(\n    source=document,\n    response_model=ReceiptContract,\n    vision=True,\n)\n\n# Monitor the batch job status\nprint(\"Batch Job Status:\", await batch_job.get_status())\n\n# Retrieve results once processing is complete\nresults = await batch_job.get_result()\nfor result in results.parsed_results:\n    print(\"Store Name:\", result.store_name)\n    print(\"Total Amount:\", result.total_amount)\n```\n\n### Local LLM Integration Example\n\nExtractThinker supports custom LLM integrations. Here's how you can use a custom LLM:\n\n```python\nfrom extract_thinker import Extractor, LLM, DocumentLoaderTesseract, Contract\n\nclass InvoiceContract(Contract):\n    invoice_number: str\n    invoice_date: str\n\n# Initialize the extractor\nextractor = Extractor()\nextractor.load_document_loader(DocumentLoaderTesseract(os.getenv(\"TESSERACT_PATH\")))\n\n# Load a custom LLM (e.g., Ollama)\nllm = LLM('ollama/phi3', api_base='http://localhost:11434')\nextractor.load_llm(llm)\n\n# Extract data\nresult = extractor.extract(\"invoice.png\", InvoiceContract)\nprint(\"Invoice Number:\", result.invoice_number)\nprint(\"Invoice Date:\", result.invoice_date)\n```\n\n## 📚 Documentation and Resources\n\n- **Examples**: Check out the examples directory for Jupyter notebooks and scripts demonstrating various use cases.\n- **Medium Articles**: Read articles about ExtractThinker on the author's Medium page.\n- **Test Suite**: Explore the test suite in the tests/ directory for more advanced usage examples and test cases.\n\n## 🧩 Integration with LLM Providers\n\nExtractThinker supports integration with multiple LLM providers:\n\n- **OpenAI**: Use models like gpt-3.5-turbo, gpt-4, etc.\n- **Anthropic**: Integrate with Claude models.\n- **Cohere**: Utilize Cohere's language models.\n- **Azure OpenAI**: Connect with Azure's OpenAI services.\n- **Local Models**: Ollama compatible models.\n\n## ⚙️ How It Works\n\nExtractThinker uses a modular architecture inspired by the LangChain ecosystem:\n\n- **Document Loaders**: Responsible for loading and preprocessing documents from various sources and formats.\n- **Extractors**: Orchestrate the interaction between the document loaders and LLMs to extract structured data.\n- **Splitters**: Implement strategies to split documents into manageable chunks for processing.\n- **Contracts**: Define the expected structure of the extracted data using Pydantic models.\n- **Classifications**: Classify documents or document sections to apply appropriate extraction contracts.\n- **Processes**: Manage the workflow of loading, classifying, splitting, and extracting data from documents.\n\n![image](https://github.com/user-attachments/assets/b12ba937-20a8-47da-a778-c126bc1748b3)\n\n## 📝 Why Use ExtractThinker?\n\nWhile general frameworks like LangChain offer a broad range of functionalities, ExtractThinker is specialized for Intelligent Document Processing (IDP). It simplifies the complexities associated with IDP by providing:\n\n- **Specialized Components**: Tailored tools for document loading, splitting, and extraction.\n- **High Accuracy with LLMs**: Leverages the power of LLMs to improve the accuracy of data extraction and classification.\n- **Ease of Use**: Intuitive APIs and ORM-style interactions reduce the learning curve.\n- **Community Support**: Active development and support from the community.\n\n## 🤝 Contributing\n\nWe welcome contributions from the community! To contribute:\n\n1. Fork the repository\n2. Create a new branch for your feature or bugfix\n3. Write tests for your changes\n4. Run tests to ensure everything is working correctly\n5. Submit a pull request with a description of your changes\n\n## 🌟 Community and Support\n\nStay updated and connect with the community:\n- [Scaling Document Extraction with o1, GPT-4o \u0026 Mini](https://medium.com/towards-artificial-intelligence/scaling-document-extraction-with-o1-gpt4o-and-mini-extractthinker-8f3340b4e69c)\n- [Claude 3.5 — The King of Document Intelligence](https://medium.com/gitconnected/claude-3-5-the-king-of-document-intelligence-f57bea1d209d?sk=124c5abb30c0e7f04313c5e20e79c2d1)\n- [Classification Tree for LLMs](https://medium.com/gitconnected/classification-tree-for-llms-32b69015c5e0?sk=8a258cf74fe3483e68ab164e6b3aaf4c)\n- [Advanced Document Classification with LLMs](https://medium.com/gitconnected/advanced-document-classification-with-llms-8801eaee3c58?sk=f5a22ee72022eb70e112e3e2d1608e79)\n- [Phi-3 and Azure: PDF Data Extraction | ExtractThinker](https://medium.com/towards-artificial-intelligence/phi-3-and-azure-pdf-data-extraction-extractthinker-cb490a095adb?sk=7be7e625b8f9932768442f87dd0ebcec)\n- [ExtractThinker: Document Intelligence for LLMs](https://medium.com/towards-artificial-intelligence/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef)\n\n## 📄 License\n\nThis project is licensed under the Apache License 2.0. See the LICENSE file for more details.\n\n## Contact\n\nFor any questions or issues, please open an issue on the GitHub repository or reach out via email.\n","funding_links":[],"categories":["Langchain"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fenoch3712%2Fextractthinker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fenoch3712%2Fextractthinker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fenoch3712%2Fextractthinker/lists"}