{"id":49635416,"url":"https://github.com/renswickd/document-parser-collection","last_synced_at":"2026-05-05T14:34:43.969Z","repository":{"id":297923644,"uuid":"991650761","full_name":"renswickd/document-parser-collection","owner":"renswickd","description":"This is a collection of various document parsers and hands-on to construct structured data for your RAG applications.","archived":false,"fork":false,"pushed_at":"2025-08-17T20:23:34.000Z","size":100,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-17T22:16:16.951Z","etag":null,"topics":["amazon-textract","azure-document-intelligence","document-parsing","llama-parse","mistral-ocr","unstructured-io"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/renswickd.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-05-28T00:36:01.000Z","updated_at":"2025-08-17T20:23:37.000Z","dependencies_parsed_at":"2025-08-17T22:09:35.965Z","dependency_job_id":"1a21db1d-5951-400d-9cae-375963aa17a6","html_url":"https://github.com/renswickd/document-parser-collection","commit_stats":null,"previous_names":["renswickd/document-parser-collection"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/renswickd/document-parser-collection","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fdocument-parser-collection","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fdocument-parser-collection/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fdocument-parser-collection/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fdocument-parser-collection/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/renswickd","download_url":"https://codeload.github.com/renswickd/document-parser-collection/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/renswickd%2Fdocument-parser-collection/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32653658,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-05T11:29:49.557Z","status":"ssl_error","status_checked_at":"2026-05-05T11:29:48.587Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["amazon-textract","azure-document-intelligence","document-parsing","llama-parse","mistral-ocr","unstructured-io"],"created_at":"2026-05-05T14:34:43.251Z","updated_at":"2026-05-05T14:34:43.960Z","avatar_url":"https://github.com/renswickd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Document Parsing Solutions 📄\n\nA comprehensive toolkit for extracting and processing content from PDF documents using various parsing technologies.\n\n## Overview\n\nThis project implements five different document parsing approaches:\n- Unstructured.io API\n- Llama Parse\n- Mistral OCR\n- Azure Document Intelligence\n- Amazon Textract\n\n## Parser Comparison\n\n### 1. Unstructured.io API\n**Strengths:**\n- Excellent at handling complex document layouts\n- Advanced table extraction capabilities\n- Maintains document structure and formatting\n- Supports multiple document formats\n\n**Limitations:**\n- API rate limits on free tier\n- Higher latency due to cloud processing\n- Cost increases with document volume\n\n**Best For:**\n- Complex documents with mixed content\n- Documents with tables and structured data\n- Batch processing requirements\n\n### 2. Llama Parse\n**Strengths:**\n- Strong text extraction capabilities\n- Good handling of simple layouts\n- Local processing option available\n- Efficient for text-heavy documents\n\n**Limitations:**\n- Limited table extraction capabilities\n- May struggle with complex layouts\n- Requires more computational resources locally\n\n**Best For:**\n- Text-heavy documents\n- Simple document layouts\n- Local processing requirements\n\n### 3. Mistral OCR\n**Strengths:**\n- Excellent OCR accuracy\n- Good language support\n- Handles handwritten text well\n- Real-time processing capabilities\n\n**Limitations:**\n- Limited formatting preservation\n- May struggle with complex tables\n- Higher cost for high-volume processing\n\n**Best For:**\n- Documents with handwritten content\n- Multi-language documents\n- Real-time OCR requirements\n\n### 4. Azure Document Intelligence\n**Strengths:**\n- Advanced AI-powered extraction\n- Excellent form field recognition\n- Strong table extraction\n- Built-in pretraining for common documents\n\n**Limitations:**\n- Azure platform lock-in\n- Higher cost for large-scale processing\n- Requires Azure subscription\n\n**Best For:**\n- Forms and structured documents\n- Enterprise-scale deployments\n- Integration with Azure services\n\n### 5. Amazon Textract\n**Strengths:**\n- Excellent table extraction\n- Good form field recognition\n- Scales well for large volumes\n- Strong integration with AWS\n\n**Limitations:**\n- AWS platform lock-in\n- Cost can be high for large volumes\n- Limited customization options\n\n**Best For:**\n- AWS ecosystem integration\n- Large-scale document processing\n- Forms and table extraction\n\n## 🔧 Setup\n\n### 1. API Keys and Credentials\n\n#### Unstructured.io\n- Sign up at [Unstructured.io](https://unstructured.io)\n- Obtain API key from dashboard\n\n#### Llama Parse\n- Sign up at [Llama Cloud](https://cloud.llamaindex.ai/login)\n- Generate API key from dashboard\n\n#### Mistral API\n- Visit [Mistral AI](https://mistral.ai/)\n- Create account and generate API key\n\n#### Azure Document Intelligence\n- Create resource in [Azure Portal](https://portal.azure.com/)\n- Get endpoint URL and API key from resource settings\n\n#### Amazon Textract\n- Set up AWS account\n- Create IAM user with Textract permissions\n- Get AWS access key and secret\n\n### 2. Environment Setup\n\n1. **Install Dependencies**\n```bash\npython -m venv venv\nsource venv/bin/activate  # For Mac/Linux\npip install -r requirements.txt\n```\n\n2. **Configure Environment Variables**\nCreate a `.env` file:\n```ini\n# Unstructured.io\nUNSTRUCTURED_API_KEY=your_key\n\n# Llama Parse\nLLAMA_API_KEY=your_key\n\n# Mistral\nMISTRAL_API_KEY=your_key\n\n# Azure\nAZURE_ENDPOINT=your_endpoint\nAZURE_API_KEY=your_key\n\n# AWS\nAWS_ACCESS_KEY_ID=your_key\nAWS_SECRET_ACCESS_KEY=your_secret\nAWS_REGION=your_region\n```\n\n## Usage Guidelines\n\n### Document Type Selection\n\n1. **Simple Text Documents**\n   - Recommended: Llama Parse or Unstructured.io\n   - Alternative: Mistral OCR\n\n2. **Forms and Structured Documents**\n   - Recommended: Azure Document Intelligence or Amazon Textract\n   - Alternative: Unstructured.io\n\n3. **Complex Tables**\n   - Recommended: Amazon Textract or Azure Document Intelligence\n   - Alternative: Unstructured.io\n\n4. **Handwritten Content**\n   - Recommended: Mistral OCR\n   - Alternative: Azure Document Intelligence\n\n5. **Multi-Language Documents**\n   - Recommended: Mistral OCR or Azure Document Intelligence\n   - Alternative: Amazon Textract\n\n## Contributing\n\nContributions are welcome! Please read our contributing guidelines and submit pull requests to our repository.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frenswickd%2Fdocument-parser-collection","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frenswickd%2Fdocument-parser-collection","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frenswickd%2Fdocument-parser-collection/lists"}