{"id":26145191,"url":"https://github.com/blacksuan19/structx","last_synced_at":"2025-08-10T08:06:42.124Z","repository":{"id":280716047,"uuid":"931905301","full_name":"Blacksuan19/structx","owner":"Blacksuan19","description":"Type-safe structured data extraction from text using LLMs.","archived":false,"fork":false,"pushed_at":"2025-08-01T01:25:57.000Z","size":1611,"stargazers_count":5,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-01T01:37:14.539Z","etag":null,"topics":["instructor","litellm","llm","rag"],"latest_commit_sha":null,"homepage":"https://structx.blacksuan19.dev","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Blacksuan19.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"docs/contributing.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-13T03:34:15.000Z","updated_at":"2025-08-01T01:26:01.000Z","dependencies_parsed_at":"2025-03-04T23:31:34.444Z","dependency_job_id":"6552c2f1-fc20-4cc4-b90e-7de3df3fa1cc","html_url":"https://github.com/Blacksuan19/structx","commit_stats":null,"previous_names":["blacksuan19/structx"],"tags_count":44,"template":false,"template_full_name":null,"purl":"pkg:github/Blacksuan19/structx","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blacksuan19%2Fstructx","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blacksuan19%2Fstructx/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blacksuan19%2Fstructx/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blacksuan19%2Fstructx/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Blacksuan19","download_url":"https://codeload.github.com/Blacksuan19/structx/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Blacksuan19%2Fstructx/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":269693593,"owners_count":24460248,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-10T02:00:08.965Z","response_time":71,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["instructor","litellm","llm","rag"],"created_at":"2025-03-11T04:38:27.330Z","updated_at":"2025-08-10T08:06:42.093Z","avatar_url":"https://github.com/Blacksuan19.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# structx\n\nAdvanced structured data extraction from any document using LLMs with multimodal\nsupport.\n\n[![Documentation](https://img.shields.io/badge/docs-mkdocs-blue.svg?style=for-the-badge)](https://structx.blacksuan19.dev \"Documentation\")\n[![PyPI](https://img.shields.io/badge/PyPi-0.4.6-blue?style=for-the-badge)](https://pypi.org/project/structx-llm \"Package\")\n[![GitHub Actions](https://img.shields.io/badge/github%20actions-%232671E5.svg?style=for-the-badge\u0026logo=githubactions\u0026logoColor=white)](# \"Build with GitHub Actions\")\n\n`structx` is a powerful Python library for extracting structured data from any\ndocument or text using Large Language Models (LLMs). It features an innovative\nmultimodal PDF processing pipeline that converts any document to PDF and uses\ninstructor's vision capabilities for superior extraction quality.\n\n## ✨ Key Features\n\n### 🎯 **Advanced Document Processing**\n\n- **� Multimodal PDF Pipeline**: Converts any document (TXT, DOCX, etc.) to PDF\n  for optimal extraction\n- **🖼️ Vision-Enabled Extraction**: Native instructor multimodal support for\n  PDFs and images\n- **🔄 Smart Format Detection**: Automatic processing mode selection for best\n  results\n- **📊 Universal File Support**: CSV, Excel, JSON, Parquet, PDF, DOCX, TXT,\n  Markdown, and more\n\n### 🚀 **Intelligent Data Extraction**\n\n- **🔄 Dynamic Model Generation**: Create type-safe Pydantic models from natural\n  language queries\n- **🎯 Automatic Schema Inference**: Intelligent schema generation and\n  refinement\n- **📊 Complex Data Structures**: Support for nested and hierarchical data\n- **🔄 Natural Language Refinement**: Improve models with conversational\n  instructions\n\n### ⚡ **Performance \u0026 Reliability**\n\n- **🚀 High-Performance Processing**: Multi-threaded and async operations\n- **🔄 Robust Error Handling**: Automatic retry mechanism with exponential\n  backoff\n- **📈 Token Usage Tracking**: Detailed step-by-step metrics for cost monitoring\n- **� Flexible Configuration**: Configurable extraction using OmegaConf\n- **🔌 Multiple LLM Providers**: Support through litellm integration\n\n## Installation\n\n```bash\n# Core package with basic extraction capabilities\npip install structx-llm\n```\n\n### 📄 Enhanced Document Processing (Recommended)\n\nFor the best experience with all document types including advanced multimodal\nPDF processing:\n\n```bash\n# Complete document processing support\npip install structx-llm[docs]\n\n# Individual components\npip install structx-llm[pdf]   # PDF processing with multimodal support\npip install structx-llm[docx]  # Advanced DOCX conversion via docling\n```\n\n### 🔧 What Each Extra Provides\n\n- **`[docs]`**: Complete multimodal document processing pipeline\n  - PDF conversion from any document type\n  - Instructor multimodal vision support\n  - Advanced DOCX processing via docling\n  - Enhanced extraction quality\n- **`[pdf]`**: PDF-specific processing\n\n  - Multimodal PDF support via instructor\n  - PDF generation capabilities\n  - Basic PDF text extraction fallback\n\n- **`[docx]`**: Advanced DOCX support\n  - Document conversion via docling\n  - Structure preservation\n  - Markdown-based processing pipeline\n\n## Quick Start\n\n### Basic Text Extraction\n\n```python\nfrom structx import Extractor\n\n# Initialize extractor\nextractor = Extractor.from_litellm(\n    model=\"gpt-4o\",\n    api_key=\"your-api-key\",\n    max_retries=3,      # Automatically retry on transient errors\n    min_wait=1,         # Start with 1 second wait\n    max_wait=10         # Maximum 10 seconds between retries\n)\n\n# Extract from text\nresult = extractor.extract(\n    data=\"System check on 2024-01-15 detected high CPU usage (92%) on server-01.\",\n    query=\"extract incident date and details\"\n)\n\n# Access results\nprint(f\"Extracted {result.success_count} items\")\nprint(result.data[0].model_dump_json(indent=2))\n```\n\n### 📄 Document Processing with Multimodal Support\n\n```python\n# Process a PDF invoice directly with vision capabilities\nresult = extractor.extract(\n    data=\"scripts/example_input/S0305SampleInvoice.pdf\",      # Direct multimodal processing\n    query=\"extract the invoice number, total amount, and line items\"\n)\n\n# Convert a DOCX contract and process with multimodal support\nresult = extractor.extract(\n    data=\"scripts/example_input/free-consultancy-agreement.docx\", # Auto-converted to PDF -\u003e multimodal\n    query=\"extract parties, effective date, and payment terms\"\n)\n```\n\n### 📊 Token Usage Monitoring\n\n```python\n# Check token usage for cost monitoring\nusage = result.get_token_usage()\nif usage:\n    print(f\"Total tokens: {usage.total_tokens}\")\n    print(f\"By step: {[(s.name, s.tokens) for s in usage.steps]}\")\n```\n\n## 🚀 Why Multimodal PDF Processing?\n\nThe innovative multimodal approach provides significant advantages over\ntraditional text-based extraction:\n\n- **📄 Context Preservation**: Full document layout and structure are maintained\n- **🎯 Higher Accuracy**: Vision models can interpret tables, charts, and\n  complex layouts\n- **🔄 No Chunking Issues**: Eliminates problems with information split across\n  chunks\n- **📊 Universal Format**: Any document type becomes processable through PDF\n  conversion\n- **🖼️ Visual Understanding**: Handles documents with visual elements,\n  formatting, and structure\n\n## 📚 Documentation\n\nFor comprehensive documentation, examples, and guides, visit our\n[documentation site](https://structx.blacksuan19.dev).\n\n- [Getting Started](https://structx.blacksuan19.dev/getting-started)\n- [Basic Extraction](https://structx.blacksuan19.dev/guides/basic-extraction)\n- [Unstructured Text Processing](https://structx.blacksuan19.dev/guides/unstructured-text)\n- [Async Operations](https://structx.blacksuan19.dev/guides/async-operations)\n- [Multiple Queries](https://structx.blacksuan19.dev/guides/multiple-queries)\n- [Custom Models](https://structx.blacksuan19.dev/guides/custom-models)\n- [Token Usage Tracking](https://structx.blacksuan19.dev/guides/token-tracking)\n- [API Reference](https://structx.blacksuan19.dev/api/extractor)\n\n## Examples\n\nCheck out our [example gallery](https://structx.blacksuan19.dev/examples) for\nreal-world use cases,\n\n## 📁 Supported File Formats\n\n### 📊 Structured Data (Direct Processing)\n\n- **CSV**: Comma-separated values with custom delimiters\n- **Excel**: .xlsx/.xls with sheet selection and custom options\n- **JSON**: JavaScript Object Notation with nested support\n- **Parquet**: Columnar storage format for large datasets\n- **Feather**: Fast binary format for data frames\n\n### 📄 Unstructured Documents (Multimodal Pipeline)\n\n| Format   | Extensions                                    | Processing Method                     | Quality    |\n| -------- | --------------------------------------------- | ------------------------------------- | ---------- |\n| **PDF**  | `.pdf`                                        | Direct multimodal processing          | ⭐⭐⭐⭐⭐ |\n| **Word** | `.docx`, `.doc`                               | Docling → Markdown → PDF → Multimodal | ⭐⭐⭐⭐⭐ |\n| **Text** | `.txt`, `.md`, `.py`, `.log`, `.xml`, `.html` | Styled PDF → Multimodal               | ⭐⭐⭐⭐   |\n\n### 🔄 Processing Modes\n\n- **Multimodal PDF** (default): Best quality, preserves layout and context\n- **Simple Text**: Fallback mode with chunking for memory-constrained\n  environments\n- **Simple PDF**: Basic PDF text extraction without vision capabilities\n\n## Contributing\n\nContributions are welcome! Please read our\n[Contributing Guidelines](https://structx.blacksuan19.dev/contributing) for\ndetails.\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file\nfor details.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblacksuan19%2Fstructx","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblacksuan19%2Fstructx","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblacksuan19%2Fstructx/lists"}