{"id":32748378,"url":"https://github.com/telsho/extrai","last_synced_at":"2025-11-10T03:02:03.735Z","repository":{"id":322310110,"uuid":"1088935633","full_name":"Telsho/Extrai","owner":"Telsho","description":"Structured data extraction with LLM majority vote","archived":false,"fork":false,"pushed_at":"2025-11-03T18:48:52.000Z","size":853,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-03T20:28:04.531Z","etag":null,"topics":["agents","consensus","data-extraction","hallucination-mitigation","llm","majority-vote","sql","sqlalchemy","sqlmodel","structured-data","workflow"],"latest_commit_sha":null,"homepage":"https://www.extrai.xyz","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Telsho.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.rst","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-03T16:54:54.000Z","updated_at":"2025-11-03T20:01:07.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/Telsho/Extrai","commit_stats":null,"previous_names":["telsho/extrai"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Telsho/Extrai","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Telsho%2FExtrai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Telsho%2FExtrai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Telsho%2FExtrai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Telsho%2FExtrai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Telsho","download_url":"https://codeload.github.com/Telsho/Extrai/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Telsho%2FExtrai/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":282529256,"owners_count":26684508,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-03T02:00:05.676Z","response_time":108,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","consensus","data-extraction","hallucination-mitigation","llm","majority-vote","sql","sqlalchemy","sqlmodel","structured-data","workflow"],"created_at":"2025-11-03T21:00:51.282Z","updated_at":"2025-11-03T21:02:01.574Z","avatar_url":"https://github.com/Telsho.png","language":"Python","readme":"# Extrai\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/_static/logo.jpg\" alt=\"Extrai Logo\" width=\"80%\"/\u003e\n\u003c/p\u003e\n\n[![Python CI/CD](https://github.com/Telsho/extrai/actions/workflows/main.yml/badge.svg)](https://github.com/Telsho/Extrai/actions/workflows/main.yml)\n[![codecov](https://codecov.io/gh/Telsho/Extrai/graph/badge.svg?token=4ZITUAFCB4)](https://codecov.io/gh/Telsho/Extrai)\n[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3120/)\n[![MIT License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n\n[![Documentation](https://img.shields.io/badge/Documentation-blue)](https://docs.extrai.xyz)\n\n## 📖 Description\n\nWith `extrai`, you can extract data from text documents with LLMs, which will be formatted into a given `SQLModel` and registered in your database.\n\nThe core of the library is its [Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html). We make the same request multiple times, using the same or different providers, and then select the values that meet a certain threshold.\n\n`extrai` also has other features, like [generating `SQLModel`s](https://docs.extrai.xyz/how_to/generate_sql_model.html) from a prompt and documents, and [generating few-shot examples](https://docs.extrai.xyz/how_to/generate_example_json.html). For complex, nested data, the library offers [Hierarchical Extraction](https://docs.extrai.xyz/how_to/handle_complex_data_with_hierarchical_extraction.html), breaking down the extraction into manageable, hierarchical steps. It also includes [built-in analytics](https://docs.extrai.xyz/analytics_collector.html) to monitor performance and output quality.\n\n## ✨ Key Features\n\n- **[Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html)**: Improves extraction accuracy by consolidating multiple LLM outputs.\n- **[Dynamic SQLModel Generation](https://docs.extrai.xyz/sqlmodel_generator.html)**: Generate `SQLModel` schemas from natural language descriptions.\n- **[Hierarchical Extraction](https://docs.extrai.xyz/how_to/handle_complex_data_with_hierarchical_extraction.html)**: Handles complex, nested data by breaking down the extraction into manageable, hierarchical steps.\n- **[Extensible LLM Support](https://docs.extrai.xyz/llm_providers.html)**: Integrates with various LLM providers through a client interface.\n- **[Built-in Analytics](https://docs.extrai.xyz/analytics_collector.html)**: Collects metrics on LLM performance and output quality to refine prompts and monitor errors.\n- **[Workflow Orchestration](https://docs.extrai.xyz/workflow_orchestrator.html)**: A central orchestrator to manage the extraction pipeline.\n- **[Example JSON Generation](https://docs.extrai.xyz/example_json_generator.html)**: Automatically generate few-shot examples to improve extraction quality.\n- **[Customizable Prompts](https://docs.extrai.xyz/how_to/customize_extraction_prompts.html)**: Customize prompts at runtime to tailor the extraction process to specific needs.\n- **[Rotating LLMs providers](https://docs.extrai.xyz/how_to/using_multiple_llm_providers.html)**: Create the JSON revisions from multiple LLM providers.\n\n## 📚 Documentation\n\nFor a complete guide, please see the full documentation. Here are the key sections:\n\n- **Getting Started**\n  - [Introduction](https://docs.extrai.xyz/introduction.html)\n  - [Installation](https://docs.extrai.xyz/installation.html)\n  - [Step-by-Step Tutorial](https://docs.extrai.xyz/getting_started.html)\n- **How-to Guides**\n  - [Generate SQLModel Dynamically](https://docs.extrai.xyz/how_to/generate_sql_model.html)\n  - [Generate Few-shot Examples](https://docs.extrai.xyz/how_to/generate_example_json.html)\n  - [Customize Prompts](https://docs.extrai.xyz/how_to/customize_extraction_prompts.html)\n  - [Handle Complex Data with Hierarchical Extraction](https://docs.extrai.xyz/how_to/handle_complex_data_with_hierarchical_extraction.html)\n  - [Using Multiple LLM Providers](https://docs.extrai.xyz/how_to/using_multiple_llm_providers.html)\n- **Core Concepts**\n  - [Architecture Overview](https://docs.extrai.xyz/concepts/architecture_overview.html)\n  - [Consensus Mechanism](https://docs.extrai.xyz/concepts/consensus_mechanism.html)\n- **Reference**\n  - [Workflow Orchestrator](https://docs.extrai.xyz/workflow_orchestrator.html)\n  - [SQLModel Generator](https://docs.extrai.xyz/sqlmodel_generator.html)\n  - [Example JSON Generator](https://docs.extrai.xyz/example_json_generator.html)\n  - [Analytics Collector](https://docs.extrai.xyz/analytics_collector.html)\n  - [LLM Providers](https://docs.extrai.xyz/llm_providers.html)\n- **API Reference**\n  - [API Documentation](https://docs.extrai.xyz/api/modules.html)\n- **Community**\n  - [Contributing Guide](https://docs.extrai.xyz/contributing.html)\n\n## ⚙️ Worflow Overview\n\nThe library is built around a few key components that work together to manage the extraction workflow. The following diagram illustrates the high-level workflow (see [Architecture Overview](https://docs.extrai.xyz/concepts/architecture_overview.html)):\n\n```mermaid\ngraph TD\n    %% Define styles for different stages for better colors\n    classDef inputStyle fill:#f0f9ff,stroke:#0ea5e9,stroke-width:2px,color:#0c4a6e\n    classDef processStyle fill:#eef2ff,stroke:#6366f1,stroke-width:2px,color:#3730a3\n    classDef consensusStyle fill:#fffbeb,stroke:#f59e0b,stroke-width:2px,color:#78350f\n    classDef outputStyle fill:#f0fdf4,stroke:#22c55e,stroke-width:2px,color:#14532d\n    classDef modelGenStyle fill:#fdf4ff,stroke:#a855f7,stroke-width:2px,color:#581c87\n\n    subgraph \"Inputs (Static Mode)\"\n        A[\"📄\u003cbr/\u003eDocuments\"]\n        B[\"🏛️\u003cbr/\u003eSQLAlchemy Models\"]\n        L1[\"🤖\u003cbr/\u003eLLM\"]\n    end\n\n    subgraph \"Inputs (Dynamic Mode)\"\n        C[\"📋\u003cbr/\u003eTask Description\u003cbr/\u003e(User Prompt)\"]\n        D[\"📚\u003cbr/\u003eExample Documents\"]\n        L2[\"🤖\u003cbr/\u003eLLM\"]\n    end\n\n    subgraph \"Model Generation\u003cbr/\u003e(Optional)\"\n        MG(\"🔧\u003cbr/\u003eGenerate SQLModels\u003cbr/\u003evia LLM\")\n    end\n\n    subgraph \"Data Extraction\"\n        EG(\"📝\u003cbr/\u003eExample Generation\u003cbr/\u003e(Optional)\")\n        P(\"✍️\u003cbr/\u003ePrompt Generation\")\n        \n        subgraph \"LLM Extraction Revisions\"\n            direction LR\n            E1(\"🤖\u003cbr/\u003eRevision 1\")\n            H1(\"💧\u003cbr/\u003eSQLAlchemy Hydration 1\")\n            E2(\"🤖\u003cbr/\u003eRevision 2\")\n            H2(\"💧\u003cbr/\u003eSQLAlchemy Hydration 2\")\n            E3(\"🤖\u003cbr/\u003e...\")\n            H3(\"💧\u003cbr/\u003e...\")\n        end\n        \n        F(\"🤝\u003cbr/\u003eJSON Consensus\")\n        H(\"💧\u003cbr/\u003eSQLAlchemy Hydration\")\n    end\n\n    subgraph Outputs\n        SM[\"🏛️\u003cbr/\u003eGenerated SQLModels\u003cbr/\u003e(Optional)\"]\n        O[\"✅\u003cbr/\u003eHydrated Objects\"]\n        DB(\"💾\u003cbr/\u003eDatabase Persistence\u003cbr/\u003e(Optional)\")\n    end\n\n    %% Connections for Static Mode\n    L1 --\u003e P\n    A --\u003e P\n    B --\u003e EG\n    EG --\u003e P\n    P --\u003e E1\n    P --\u003e E2\n    P --\u003e E3\n    E1 --\u003e H1\n    E2 --\u003e H2\n    E3 --\u003e H3\n    H1 --\u003e F\n    H2 --\u003e F\n    H3 --\u003e F\n    F --\u003e H\n    H --\u003e O\n    H --\u003e DB\n\n    %% Connections for Dynamic Mode\n    L2 --\u003e MG\n    C --\u003e MG\n    D --\u003e MG\n    MG --\u003e EG\n    EG --\u003e P\n\n    MG --\u003e SM\n\n    %% Apply styles\n    class A,B,C,D,L1,L2 inputStyle;\n    class P,E1,E2,E3,H,EG processStyle;\n    class F consensusStyle;\n    class O,DB,SM outputStyle;\n    class MG modelGenStyle;\n```\n\n## ▶️ Getting Started\n\n### 📦 Installation\n\nInstall the library from PyPI:\n\n```bash\npip install extrai-workflow\n```\n\n### ✨ Usage Example\n\nFor a more detailed guide, please see the **[Getting Started Tutorial](https://docs.extrai.xyz/getting_started.html)**.\n\nHere is a minimal example:\n\n```python\nimport asyncio\nfrom typing import Optional\nfrom sqlmodel import Field, SQLModel, create_engine, Session\nfrom extrai.core import WorkflowOrchestrator\nfrom extrai.llm_providers.huggingface_client import HuggingFaceClient\n\n# 1. Define your data model\nclass Product(SQLModel, table=True):\n    id: Optional[int] = Field(default=None, primary_key=True)\n    name: str\n    price: float\n\n# 2. Set up the orchestrator\nllm_client = HuggingFaceClient(api_key=\"YOUR_HF_API_KEY\")\nengine = create_engine(\"sqlite:///:memory:\")\norchestrator = WorkflowOrchestrator(\n    llm_client=llm_client,\n    db_engine=engine,\n    root_model=Product,\n)\n\n# 3. Run the extraction and verify\ntext = \"The new SuperWidget costs $99.99.\"\nwith Session(engine) as session:\n    asyncio.run(orchestrator.synthesize_and_save([text], db_session=session))\n    product = session.query(Product).first()\n    print(product)\n    # Expected: name='SuperWidget' price=99.99 id=1\n```\n\n### 🚀 More Examples\n\nFor more in-depth examples, see the [`/examples`](https://github.com/Telsho/Extrai/tree/main/examples) directory in the repository.\n\n## 🙌 Contributing\n\nWe welcome contributions! Please see the **[Contributing Guide](https://docs.extrai.xyz/contributing.html)** for details on how to set up your development environment, run tests, and submit a pull request.\n\n## 📜 License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftelsho%2Fextrai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftelsho%2Fextrai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftelsho%2Fextrai/lists"}