{"id":34743937,"url":"https://github.com/vectorize-io/vectorize-iris","last_synced_at":"2025-12-25T04:28:54.784Z","repository":{"id":324886721,"uuid":"1098265370","full_name":"vectorize-io/vectorize-iris","owner":"vectorize-io","description":"Vectorize Iris CLI and SDKs","archived":false,"fork":false,"pushed_at":"2025-11-26T15:12:39.000Z","size":1470,"stargazers_count":4,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-27T21:46:00.440Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vectorize-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-17T13:23:09.000Z","updated_at":"2025-11-26T15:12:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/vectorize-io/vectorize-iris","commit_stats":null,"previous_names":["vectorize-io/vectorize-iris"],"tags_count":16,"template":false,"template_full_name":null,"purl":"pkg:github/vectorize-io/vectorize-iris","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fvectorize-iris","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fvectorize-iris/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fvectorize-iris/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fvectorize-iris/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vectorize-io","download_url":"https://codeload.github.com/vectorize-io/vectorize-iris/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vectorize-io%2Fvectorize-iris/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28019500,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-12-25T02:00:05.988Z","response_time":58,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-25T04:28:54.653Z","updated_at":"2025-12-25T04:28:54.772Z","avatar_url":"https://github.com/vectorize-io.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"iris.svg\" alt=\"Vectorize Iris\" width=\"200\" /\u003e\n\u003c/p\u003e\n\n# Vectorize Iris\n\nVectorize Iris is a model-based extraction solution that transforms how RAG systems handle PDFs. It combines extraction and chunking into one streamlined process, making it easier than ever to get clean, usable text from complex documents.\n\nDocumentation: [docs.vectorize.io](https://docs.vectorize.io/build-deploy/extract-information/extraction-tester/#vectorize-iris)\n\n## Table of Contents\n\n- [Why Iris?](#why-iris)\n- [Quick Start](#quick-start)\n- [Installation](#installation)\n- [Features](#features)\n  - [Basic Text Extraction](#basic-text-extraction)\n  - [Smart Chunking](#smart-chunking)\n  - [Metadata Extraction](#metadata-extraction)\n  - [Parsing Instructions](#parsing-instructions)\n- [CLI Examples](#cli-examples)\n  - [Basic Extraction](#basic-extraction)\n  - [Extract from URL](#extract-from-url)\n  - [JSON Output](#json-output-for-piping)\n  - [Plain Text Output](#plain-text-output)\n  - [Save to File](#save-to-file)\n  - [Process Directory](#process-directory)\n  - [Chunking for RAG](#chunking-for-rag)\n  - [Custom Parsing Instructions](#custom-parsing-instructions)\n  - [Document Classification](#document-classification)\n  - [Advanced Options](#advanced-options)\n- [Configuration](#configuration)\n  - [CLI Configuration](#cli-configuration)\n  - [Python \u0026 Node.js Configuration](#python--nodejs-configuration)\n- [Documentation](#documentation)\n- [License](#license)\n- [Support](#support)\n\n## Why Iris?\n\nTraditional OCR tools struggle with complex layouts, poor scans, and structured data. **Iris uses advanced AI** to understand document structure and context, delivering:\n\n- 📄 **Universal format support** - Works with all unstructured document types (PDFs, images, scans, and more)\n- ✨ **High accuracy** - Handles poor quality scans and complex layouts\n- 📊 **Structure preservation** - Maintains tables, lists, and formatting\n- 🎯 **Smart chunking** - Semantic splitting for RAG pipelines\n- 🔍 **Metadata extraction** - Extract specific fields using natural language\n- 🚀 **Simple API** - One function call to extract text\n- ⚡ **Parallel processing** - Process multiple documents simultaneously\n- 🌐 **URL support** - Extract directly from HTTP/HTTPS URLs\n- 📂 **Batch processing** - Process entire directories automatically\n- 🔧 **Multiple formats** - Output as JSON, YAML, or plain text\n- 🪶 **Lightweight** - Single binary CLI with no dependencies\n- ☁️ **Cloud-native** - Serverless-ready APIs\n- 🌍 **Multi-lingual** - 100+ languages including Hindi, Arabic, Chinese\n- 🔌 **Multi-platform** - Python, Node.js, and CLI support\n\n## Quick Start\n\nChoose your preferred tool:\n\n### 🐍 Python API\n```python\nfrom vectorize_iris import extract_text_from_file\n\nresult = extract_text_from_file('document.pdf')\nprint(result.text)\n```\n\n[→ See Python examples](python-api/)\n\n### 📦 Node.js/TypeScript API\n```typescript\nimport { extractTextFromFile } from '@vectorize-io/iris';\n\nconst result = await extractTextFromFile('document.pdf');\nconsole.log(result.text);\n```\n\n[→ See Node.js examples](nodejs-api/)\n\n### ⚡ CLI\n\n```bash\nvectorize-iris document.pdf\n```\n\n## Installation\n\n**CLI:**\n```bash\ncurl -fsSL https://get-iris.vectorize.io | sh\n```\n\n\n**Python:**\n```bash\npip install vectorize-iris\n```\n\n**Node.js:**\n```bash\nnpm install @vectorize-io/iris\n```\n\n\n\n## Features\n\n### Basic Text Extraction\nExtract clean, structured text from any document format.\n\n### Smart Chunking\nSplit documents into semantic chunks perfect for RAG pipelines:\n- Markdown-aware chunking\n- Configurable chunk sizes\n- Preserves context across chunks\n\n### Metadata Extraction\nExtract structured data using JSON schemas (OpenAPI spec format recommended):\n```python\nresult = extract_text_from_file('invoice.pdf', options=ExtractionOptions(\n    metadata_schemas=[{\n        'id': 'invoice-data',\n        'schema': {\n            'invoice_number': 'string',\n            'date': 'string',\n            'total_amount': 'number',\n            'vendor_name': 'string'\n        }\n    }]\n))\n# Returns structured JSON metadata\n```\n\n### Parsing Instructions\nGuide the extraction with custom instructions:\n```python\nresult = extract_text_from_file('document.pdf', options=ExtractionOptions(\n    parsing_instructions='Focus on extracting tables and ignore headers/footers'\n))\n```\n\n## CLI Examples\n\n### Basic Extraction\n\nBeautiful terminal output with progress indicators:\n\n```bash\nvectorize-iris document.pdf\n```\n\n**Output:**\n```\n✨ Vectorize Iris Extraction\n──────────────────────────────────────────────────\n\n✓ Upload prepared\n✓ File uploaded successfully\n✓ Extraction started\n✓ Extraction completed in 7s\n\n─────────────────────────────────────────────────────────\n📄 Extracted Text\n─────────────────────────────────────────────────────────\n\nStats: 5536 chars • 1245 words • 89 lines\n\nThis is the extracted text from your PDF document.\nAll formatting and structure is preserved.\n\nTables, lists, and other elements are properly extracted.\n```\n\n### Extract from URL\n\nDownload and extract files directly from HTTP/HTTPS URLs:\n\n```bash\nvectorize-iris https://arxiv.org/pdf/2206.01062\n```\n\n### JSON Output (for piping)\n\n```bash\nvectorize-iris document.pdf -o json\n```\n\n**Output:**\n```json\n{\n  \"success\": true,\n  \"text\": \"This is the extracted text from your PDF document...\",\n  \"chunks\": null,\n  \"metadata\": null\n}\n```\n\n**Pipe to jq:**\n```bash\nvectorize-iris document.pdf -o json | jq -r '.text' \u003e output.txt\n```\n\n### Plain Text Output\n\nGet only the extracted text:\n\n```bash\nvectorize-iris document.pdf -o text\n```\n\n**Pipe directly:**\n```bash\nvectorize-iris document.pdf -o text \u003e output.txt\n```\n\n### Save to File\n\nUse `-f` to save output directly:\n\n```bash\nvectorize-iris document.pdf -o json -f output.json\n```\n\n**Output:**\n```\n✨ Vectorize Iris Extraction\n──────────────────────────────────────────────────\n\n✓ Upload prepared\n✓ File uploaded successfully\n✓ Extraction started\n✓ Extraction completed in 7s\n✓ Output written to output.json\n```\n\n### Process Directory\n\nProcess all files in a directory automatically:\n\n```bash\nvectorize-iris ./documents -f ./output\n```\n\n**Output:**\n```\n📦 Processing Directory\n──────────────────────────────────────────────────\n\n💡 Found 5 files to process\n\n⚙️  Processing 1/5 - report-q1.pdf\n✨ Vectorize Iris Extraction\n──────────────────────────────────────────────────\n✓ Upload prepared\n✓ File uploaded successfully\n✓ Extraction started\n✓ Extraction completed in 8s\n✓ Output written to output/report-q1.txt\n\n⚙️  Processing 2/5 - report-q2.pdf\n...\n\n──────────────────────────────────────────────────\n✨ Batch Processing Complete\n\n  ✓ Successful: 5\n```\n\n**With custom output format:**\n```bash\n# Extract all PDFs to JSON\nvectorize-iris ./documents -o json -f ./output\n\n# Extract all files to plain text\nvectorize-iris ./scans -o text -f ./extracted\n```\n\n### Chunking for RAG\n\n```bash\nvectorize-iris long-document.pdf --chunk-size 512\n```\n\nSplits documents at semantic boundaries, perfect for RAG pipelines.\n\n### Custom Parsing Instructions\n\n```bash\nvectorize-iris report.pdf --parsing-instructions \"Extract only tables and numerical data, ignore narrative text\"\n```\n\n### Document Classification\n\nPass multiple metadata schemas and Iris will automatically classify which schema matches best:\n\n```bash\nvectorize-iris invoice.pdf \\\n  --metadata-schema 'invoice:{\"invoice_number\":\"string\",\"date\":\"string\",\"total_amount\":\"number\",\"vendor\":\"string\"}' \\\n  --metadata-schema 'receipt:{\"store_name\":\"string\",\"date\":\"string\",\"items\":\"array\",\"total\":\"number\"}' \\\n  --metadata-schema 'contract:{\"parties\":\"array\",\"effective_date\":\"string\",\"terms\":\"string\"}' \\\n  --metadata-schema 'cv:{\"name\":\"string\",\"contact_info\":\"object\",\"skills\":\"array\",\"experience\":\"array\"}' \\\n  -o json\n```\n\n**Output:**\n```json\n{\n  \"success\": true,\n  \"text\": \"...\",\n  \"metadata\": \"{\\\"invoice_number\\\":\\\"INV-2024-001\\\",\\\"date\\\":\\\"2024-01-15\\\",\\\"total_amount\\\":1250.00,\\\"vendor\\\":\\\"Acme Corp\\\"}\",\n  \"metadataSchema\": \"invoice\"\n}\n```\n\nIris automatically detected this was an invoice and extracted the relevant fields using the matching schema.\n\n### Advanced Options\n\n```bash\n# Custom chunk size with metadata extraction\nvectorize-iris document.pdf \\\n  --chunk-size 256 \\\n  --infer-metadata-schema \\\n  --parsing-instructions \"Focus on extracting structured data\" \\\n  -o yaml -f output.yaml\n\n# Longer timeout for large documents\nvectorize-iris large-document.pdf \\\n  --timeout 600 \\\n  --poll-interval 5\n```\n\n## Configuration\n\n### CLI Configuration\n\nThe CLI offers multiple ways to configure your credentials:\n\n#### Interactive Configuration (Recommended)\n\nThe easiest way to get started - opens your browser for authentication:\n\n```bash\nvectorize-iris configure\n```\n\n**What happens:**\n1. Opens your browser to the Vectorize platform\n2. Click \"Authorize\" to grant access\n3. Credentials are automatically saved to `~/.vectorize-iris/credentials`\n4. Done! You're ready to extract\n\n#### Manual Configuration\n\nIf you prefer not to use the browser, prompt for credentials manually:\n\n```bash\nvectorize-iris configure --manual\n```\n\nYou'll be asked to enter:\n- Access Token\n- Organization ID\n\nGet these from [platform.vectorize.io](https://platform.vectorize.io) → Account → Org Settings → Access Tokens\n\n#### Non-Interactive Configuration\n\nFor scripts and automation, pass credentials directly:\n\n```bash\nvectorize-iris configure --api-token \"your-token\" --org-id \"your-org-id\"\n```\n\n#### Environment Variables\n\nAlternatively, set credentials via environment variables (works for all clients):\n\n```bash\nexport VECTORIZE_TOKEN=\"your-token\"\nexport VECTORIZE_ORG_ID=\"your-org-id\"\n```\n\n### Python \u0026 Node.js Configuration\n\nFor Python and Node.js clients, use environment variables or pass credentials programmatically:\n\n**Environment variables:**\n```bash\nexport VECTORIZE_TOKEN=\"your-token\"\nexport VECTORIZE_ORG_ID=\"your-org-id\"\n```\n\n**Python:**\n```python\nfrom vectorize_iris import VectorizeIrisClient\n\nclient = VectorizeIrisClient(\n    api_token=\"your-token\",\n    org_id=\"your-org-id\"\n)\n```\n\n**Node.js:**\n```typescript\nimport { extractTextFromFile } from '@vectorize-io/iris';\n\nconst result = await extractTextFromFile('document.pdf', {\n    apiToken: 'your-token',\n    orgId: 'your-org-id'\n});\n```\n\n## Documentation\n\nFor detailed documentation, API reference, and advanced features:\n\n📚 **[docs.vectorize.io](https://docs.vectorize.io)**\n\n## License\n\nMIT\n\n## Support\n\n- 📖 [Documentation](https://docs.vectorize.io)\n- 💬 [Community](https://vectorize.io/community)\n- 🐛 [Issues](https://github.com/vectorize/vectorize-iris/issues)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectorize-io%2Fvectorize-iris","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvectorize-io%2Fvectorize-iris","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvectorize-io%2Fvectorize-iris/lists"}