{"id":13574218,"url":"https://github.com/enoch3712/ExtractThinker","last_synced_at":"2025-04-04T14:32:04.498Z","repository":{"id":220509563,"uuid":"751474418","full_name":"enoch3712/ExtractThinker","owner":"enoch3712","description":"ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.","archived":false,"fork":false,"pushed_at":"2025-04-02T01:06:35.000Z","size":20680,"stargazers_count":1174,"open_issues_count":20,"forks_count":113,"subscribers_count":20,"default_branch":"main","last_synced_at":"2025-04-02T09:43:29.193Z","etag":null,"topics":["ai","document-image-analysis","document-intelligence","document-parsing","document-processing","langchain","llm","machine-learning","nlp","ocr","openai","pdf","pdf-to-text","python"],"latest_commit_sha":null,"homepage":"https://enoch3712.github.io/ExtractThinker","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/enoch3712.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-01T17:23:31.000Z","updated_at":"2025-04-02T03:26:22.000Z","dependencies_parsed_at":null,"dependency_job_id":"71343edc-d0c5-42ad-bcd1-ffc33be1b060","html_url":"https://github.com/enoch3712/ExtractThinker","commit_stats":null,"previous_names":["enoch3712/open-docllm"],"tags_count":40,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/enoch3712%2FExtractThinker/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/enoch3712","download_url":"https://codeload.github.com/enoch3712/ExtractThinker/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247194206,"owners_count":20899443,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","document-image-analysis","document-intelligence","document-parsing","document-processing","langchain","llm","machine-learning","nlp","ocr","openai","pdf","pdf-to-text","python"],"created_at":"2024-08-01T15:00:48.277Z","updated_at":"2025-04-04T14:31:59.487Z","avatar_url":"https://github.com/enoch3712.png","language":"Python","funding_links":[],"categories":["Python","开源工具","Building","Document Processing"],"sub_categories":["预处理","Workflows"],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/enoch3712/Open-DocLLM/assets/9283394/41d9d151-acb5-44da-9c10-0058f76c2512\" alt=\"Extract Thinker Logo\" width=\"200\"/\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://medium.com/@enoch3712\"\u003e\n    \u003cimg alt=\"Medium\" src=\"https://img.shields.io/badge/Medium-12100E?style=flat\u0026logo=medium\u0026logoColor=white\" /\u003e\n\u003c/a\u003e\n\u003cimg alt=\"GitHub Last Commit\" src=\"https://img.shields.io/github/last-commit/enoch3712/Open-DocLLM\" /\u003e\n\u003cimg alt=\"Github License\" src=\"https://img.shields.io/badge/License-Apache%202.0-blue.svg\" /\u003e\n\u003c/p\u003e\n\n# ExtractThinker\n\nLibrary to extract data from files and documents agnostically using LLMs. `extract_thinker` provides ORM-style interaction between files and LLMs, allowing for flexible and powerful document extraction workflows.\n\n## Features\n\n- Supports multiple document loaders including Tesseract OCR, Azure Form Recognizer, AWS TextExtract, Google Document AI.\n- Customizable extraction using contract definitions.\n- Asynchronous processing for efficient document handling.\n- Built-in support for various document formats.\n- ORM-style interaction between files and LLMs.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/enoch3712/Open-DocLLM/assets/9283394/b1b8800c-3c55-4ee5-92fe-b8b663c7a81f\" alt=\"Extract Thinker Features Diagram\" width=\"300\"/\u003e\n\u003c/p\u003e\n\n## Installation\n\nTo install `extract_thinker`, you can use `pip`:\n\n```bash\npip install extract_thinker\n```\n\n## Usage\nHere's a quick example to get you started with extract_thinker. This example demonstrates how to load a document using Tesseract OCR and extract specific fields defined in a contract.\n\n```python\nimport os\nfrom dotenv import load_dotenv\nfrom extract_thinker import DocumentLoaderTesseract, Extractor, Contract\n\nload_dotenv()\ncwd = os.getcwd()\n\nclass InvoiceContract(Contract):\n    invoice_number: str\n    invoice_date: str\n\ntesseract_path = os.getenv(\"TESSERACT_PATH\")\ntest_file_path = os.path.join(cwd, \"test_images\", \"invoice.png\")\n\nextractor = Extractor()\nextractor.load_document_loader(\n    DocumentLoaderTesseract(tesseract_path)\n)\nextractor.load_llm(\"claude-3-haiku-20240307\")\n\nresult = extractor.extract(test_file_path, InvoiceContract)\n\nprint(\"Invoice Number: \", result.invoice_number)\nprint(\"Invoice Date: \", result.invoice_date)\n```\n\n## Splitting Files Example\nYou can also split and process documents using extract_thinker. Here's how you can do it:\n\n```python\nimport os\nfrom dotenv import load_dotenv\nfrom extract_thinker import DocumentLoaderTesseract, Extractor, Process, Classification, ImageSplitter\n\nload_dotenv()\n\nclass DriverLicense(Contract):\n    # Define your DriverLicense contract fields here\n    pass\n\nclass InvoiceContract(Contract):\n    invoice_number: str\n    invoice_date: str\n\nextractor = Extractor()\nextractor.load_document_loader(DocumentLoaderTesseract(os.getenv(\"TESSERACT_PATH\")))\nextractor.load_llm(\"gpt-3.5-turbo\")\n\nclassifications = [\n    Classification(name=\"Driver License\", description=\"This is a driver license\", contract=DriverLicense, extractor=extractor),\n    Classification(name=\"Invoice\", description=\"This is an invoice\", contract=InvoiceContract, extractor=extractor)\n]\n\nprocess = Process()\nprocess.load_document_loader(DocumentLoaderTesseract(os.getenv(\"TESSERACT_PATH\")))\nprocess.load_splitter(ImageSplitter())\n\npath = \"...\"\n\nsplit_content = process.load_file(path)\\\n    .split(classifications)\\\n    .extract()\n\n# Process the split_content as needed\n```\n\n## Infrastructure\n\nThe `extract_thinker` project is inspired by the LangChain ecosystem, featuring a modular infrastructure with templates, components, and core functions to facilitate robust document extraction and processing. \n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://github.com/enoch3712/Open-DocLLM/assets/9283394/996fb2de-0558-4f13-ab3d-7ea56a593951\" alt=\"Extract Thinker Logo\" width=\"400\"/\u003e\n\u003c/p\u003e\n\n## 📖 Examples\n\n| Notebook | Description |\n|----------|-------------|\n| [Basic Usage](examples/notebooks/basic_example.ipynb) | Basic usage of ExtractThinker with PyPDF loader and GPT-4o-mini for invoice data extraction |\n\n## Why Just Not LangChain?\nWhile LangChain is a generalized framework designed for a wide array of use cases, extract_thinker is specifically focused on Intelligent Document Processing (IDP). Although achieving 100% accuracy in IDP remains a challenge, leveraging LLMs brings us significantly closer to this goal.\n\n## Additional Examples\nYou can find more examples in the repository. These examples cover various use cases and demonstrate the flexibility of extract_thinker. Also check my the medium of the author that contains several examples about the library\n\n## Contributing\nWe welcome contributions from the community! If you would like to contribute, please follow these steps:\n\nFork the repository.\nCreate a new branch for your feature or bugfix.\nWrite tests for your changes.\nRun tests to ensure everything is working correctly.\nSubmit a pull request with a description of your changes.\n\n## Community\nJúlio Almeida\n    https://pub.towardsai.net/extractthinker-ai-document-intelligence-with-llms-72cbce1890ef\n\n## License\nThis project is licensed under the Apache License 2.0. See the LICENSE file for more details.\n\n## Contact\nFor any questions or issues, please open an issue on the GitHub repository.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fenoch3712%2FExtractThinker","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fenoch3712%2FExtractThinker","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fenoch3712%2FExtractThinker/lists"}