{"id":33914936,"url":"https://github.com/dpaluy/reducto_ai","last_synced_at":"2026-04-07T15:32:10.071Z","repository":{"id":323611794,"uuid":"1087990711","full_name":"dpaluy/reducto_ai","owner":"dpaluy","description":"ReductoAI Ruby wrapper","archived":false,"fork":false,"pushed_at":"2026-03-30T19:58:30.000Z","size":59,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-03-30T21:24:59.603Z","etag":null,"topics":["ai","ocr","ruby","ruby-gem"],"latest_commit_sha":null,"homepage":"","language":"Ruby","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dpaluy.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-02T04:09:57.000Z","updated_at":"2026-03-30T19:57:56.000Z","dependencies_parsed_at":"2025-11-11T06:57:28.865Z","dependency_job_id":null,"html_url":"https://github.com/dpaluy/reducto_ai","commit_stats":null,"previous_names":["dpaluy/reducto_ai"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/dpaluy/reducto_ai","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpaluy%2Freducto_ai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpaluy%2Freducto_ai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpaluy%2Freducto_ai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpaluy%2Freducto_ai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dpaluy","download_url":"https://codeload.github.com/dpaluy/reducto_ai/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dpaluy%2Freducto_ai/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31243871,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-31T15:15:33.278Z","status":"ssl_error","status_checked_at":"2026-03-31T15:15:28.327Z","response_time":111,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","ocr","ruby","ruby-gem"],"created_at":"2025-12-12T06:44:39.212Z","updated_at":"2026-04-07T15:32:10.066Z","avatar_url":"https://github.com/dpaluy.png","language":"Ruby","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ReductoAi\n\nRuby wrapper on [ReductoAI API](https://docs.reducto.ai/api-reference)\n\n[![Gem Version](https://badge.fury.io/rb/reducto_ai.svg)](https://badge.fury.io/rb/reducto_ai)\n[![ci](https://github.com/dpaluy/reducto_ai/actions/workflows/ci.yml/badge.svg)](https://github.com/dpaluy/reducto_ai/actions/workflows/ci.yml)\n\n## Installation\n\n```\nbundle add reducto_ai\n```\n\n## Usage\n\nConfigure once:\n\n```ruby\nReductoAI.configure do |config|\n  config.api_key = ENV.fetch(\"REDUCTO_API_KEY\")\nend\n```\n\n### Choosing an action\n\n- **Parse**: Start here for any document. Converts uploads or URLs into structured chunks and OCR text so later steps can reuse the returned `job_id`.\n- **Split**: Use after parsing when you need logical sections. Provide `split_description` names/rules to segment the parsed document into labeled ranges.\n- **Extract**: Run when you need structured answers (fields, JSON). Supply instructions or schema to pull values from raw input or an existing parse `job_id`.\n- **Edit**: Generate marked-up PDFs using `document_url` plus `edit_instructions` (PDF forms supported via `form_schema`).\n- **Pipeline**: The current gem surface remains `steps:`-based for multi-step workflows.\n\n### Async Operations\n\nAsync variants return immediately with a `job_id`. Use `client.jobs.wait(...)` for polling or configure Svix-backed webhooks in your app.\n\nNotes:\n- Reducto prioritizes sync jobs over async jobs.\n- Async results may be deleted on Reducto's normal 12-hour cleanup cadence unless you persist them yourself or opt into `persist_results`.\n- `client.jobs.configure_webhook` returns a Svix portal URL string.\n- `client.jobs.wait` requires either `timeout:` or `max_attempts:` so it cannot poll forever by accident.\n- `client.edit.async` is the exception to the generic async shape: Reducto's `/edit_async` endpoint only accepts top-level `priority` and `webhook`, not async `metadata`.\n\n```ruby\nclient = ReductoAI::Client.new\n\njob = client.parse.async(\n  input: \"https://example.com/large-doc.pdf\",\n  output_formats: { markdown: true },\n  async: {\n    priority: false,\n    webhook: { mode: \"svix\", channels: [\"production\"] },\n    metadata: { document_id: \"doc-123\" }\n  },\n  settings: { persist_results: true }\n)\n\n# =\u003e { \"job_id\" =\u003e \"async-123\", \"status\" =\u003e \"Pending\" }\n\nresult = client.jobs.wait(job_id: job[\"job_id\"], interval: 2, timeout: 300)\n\n# =\u003e { \"job_id\" =\u003e \"async-123\", \"status\" =\u003e \"Completed\", \"result\" =\u003e {...} }\n\nportal_url = client.jobs.configure_webhook\n# =\u003e \"https://dashboard.svix.com/...\"\n```\n\nAvailable async helpers:\n- `client.parse.async(input:, async:, **options)`\n- `client.extract.async(input:, instructions:, async:, **options)`\n- `client.split.async(input:, async:, **options)`\n- `client.edit.async(input:, instructions:, async:, **options)` where `async:` may only include `priority` and `webhook`\n- `client.pipeline.async(input:, steps:, async:, **options)`\n- `client.jobs.wait(job_id:, interval: 2, timeout: nil, max_attempts: nil, raise_on_failure: true)`\n- `client.jobs.pending?/in_progress?/completing?/completed?/failed?/terminal?`\n\n### Rails\n\nCreate `config/initializers/reducto_ai.rb`:\n\n```ruby\nReductoAI.configure do |c|\n  c.api_key = Rails.application.credentials.dig(:reducto, :api_key)\n  c.webhook_secret = Rails.application.credentials.dig(:reducto, :webhook_secret)\n  # c.base_url = \"https://platform.reducto.ai\"\n  # c.open_timeout = 5; c.read_timeout = 30\nend\n```\n\nIn your host app, own the route/controller/job:\n\n```ruby\n# config/routes.rb\npost \"/webhooks/reducto\", to: \"reducto_webhooks#create\"\n```\n\n```ruby\nclass ReductoWebhooksController \u003c ActionController::API\n  def create\n    event = ReductoAI::Rails::RequestVerifier.verify!(request)\n\n    return head :ok if WebhookDelivery.exists?(provider: \"reducto\", delivery_id: event.svix_id)\n\n    WebhookDelivery.create!(provider: \"reducto\", delivery_id: event.svix_id, job_id: event.job_id)\n    ReductoWebhookJob.perform_later(event.job_id, event.svix_id)\n\n    head :ok\n  rescue ReductoAI::WebhookVerificationError\n    head :unauthorized\n  end\nend\n```\n\nReturn 2xx quickly, dedupe on `svix-id`, and fetch/store final results in the background job.\n\n### Quick Start\n\n```ruby\nclient = ReductoAI::Client.new\n\n# Parse a document\n# API Reference: https://docs.reducto.ai/api-reference/parse\nparse = client.parse.sync(input: \"https://example.com/invoice.pdf\")\njob_id = parse[\"job_id\"]\n\n# Response:\n# {\n#   \"job_id\" =\u003e \"abc-123\",\n#   \"status\" =\u003e \"Completed\",\n#   \"result\" =\u003e {...}\n# }\n\n# Extract structured data\n# API Reference: https://docs.reducto.ai/api-reference/extract\nextraction = client.extract.sync(\n  input: job_id,\n  instructions: {\n    schema: {\n      type: \"object\",\n      properties: {\n        invoice_number: { type: \"string\" },\n        total_due: { type: \"string\" }\n      },\n      required: [\"invoice_number\", \"total_due\"]\n    }\n  }\n)\n\n# Response:\n# {\n#   \"job_id\" =\u003e \"820dca1b-3215-4d24-be09-6494d4c3cd88\",\n#   \"usage\" =\u003e {\"num_pages\" =\u003e 1, \"num_fields\" =\u003e 2, \"credits\" =\u003e 2.0},\n#   \"studio_link\" =\u003e \"https://studio.reducto.ai/job/820dca1b-3115-4d24-be09-6494d4c3cd88\",\n#   \"result\" =\u003e [{\"invoice_number\" =\u003e \"INV-2024-001\", \"total_due\" =\u003e \"$1,234.56\"}],\n#   \"citations\" =\u003e nil\n# }\n```\n\n### Complete Example: Multi-invoice Processing\n\n```ruby\nclient = ReductoAI::Client.new\n\n# 1. Parse the document\n# API Reference: https://docs.reducto.ai/api-reference/parse\nparse = client.parse.sync(input: \"https://example.com/invoices.pdf\")\n\n# Response:\n# {\n#   \"job_id\" =\u003e \"parse-123\",\n#   \"status\" =\u003e \"Completed\",\n#   \"result\" =\u003e {...}\n# }\n\n# 2. Split into individual invoices\n# API Reference: https://docs.reducto.ai/api-reference/split\nsplit = client.split.sync(\n  input: parse[\"job_id\"],\n  split_description: [\n    {\n      name: \"Invoice\",\n      description: \"All pages that belong to a single invoice\",\n      partition_key: \"invoice_number\"\n    }\n  ],\n  split_rules: \u003c\u003c~PROMPT\n    The document contains multiple invoices one after another. Each invoice has a unique invoice number formatted like \"Invoice #12345\" near the top of the first page.\n    Segment the document into one partition per invoice. Keep pages contiguous per invoice and include any following appendices until the next invoice number.\n    Name each partition using the exact invoice number you detect (e.g., \"Invoice #12345\").\n  PROMPT\n)\n\n# Response:\n# {\n#   \"job_id\" =\u003e \"split-456\",\n#   \"result\" =\u003e {\n#     \"splits\" =\u003e [{\n#       \"name\" =\u003e \"Invoice\",\n#       \"partitions\" =\u003e [\n#         {\"name\" =\u003e \"Invoice #12345\", \"pages\" =\u003e [0, 1, 2]},\n#         {\"name\" =\u003e \"Invoice #12346\", \"pages\" =\u003e [3, 4]}\n#       ]\n#     }]\n#   }\n# }\n\n# 3. Extract data from each invoice\n# API Reference: https://docs.reducto.ai/api-reference/extract\ninvoice_partitions = split.dig(\"result\", \"splits\").first.fetch(\"partitions\")\ninvoice_details = invoice_partitions.map do |partition|\n  client.extract.sync(\n    input: parse[\"job_id\"],\n    instructions: {\n      schema: {\n        type: \"object\",\n        properties: {\n          invoice_number: { type: \"string\" },\n          total_due: { type: \"string\" }\n        },\n        required: [\"invoice_number\", \"total_due\"]\n      }\n    },\n    settings: { page_range: partition[\"pages\"] }\n  )\nend\n\n# Response per invoice:\n# {\n#   \"job_id\" =\u003e \"extract-789\",\n#   \"result\" =\u003e [{\"invoice_number\" =\u003e \"INV-12345\", \"total_due\" =\u003e \"$2,500.00\"}],\n#   \"usage\" =\u003e {\"credits\" =\u003e 2.0}\n# }\n```\n\n### Direct Split Example\n\nSplit a multi-invoice PDF directly without pre-parsing:\n\n```ruby\nclient = ReductoAI::Client.new\n\n# Split document directly from URL\n# API Reference: https://docs.reducto.ai/api-reference/split\nresponse = client.split.sync(\n  input: { url: \"https://example.com/invoices.pdf\" },\n  split_description: [\n    {\n      name: \"Invoice\",\n      description: \"Individual invoices within the document\",\n      partition_key: \"invoice_number\"\n    }\n  ]\n)\n\n# Response:\n# {\n#   \"usage\" =\u003e {\"num_pages\" =\u003e 2, \"credits\" =\u003e nil},\n#   \"result\" =\u003e {\n#     \"section_mapping\" =\u003e nil,\n#     \"splits\" =\u003e [{\n#       \"name\" =\u003e \"Invoice\",\n#       \"pages\" =\u003e [1, 2],\n#       \"conf\" =\u003e \"high\",\n#       \"partitions\" =\u003e [\n#         {\"name\" =\u003e \"0000569050-001\", \"pages\" =\u003e [1], \"conf\" =\u003e \"high\"},\n#         {\"name\" =\u003e \"0000569050-002\", \"pages\" =\u003e [2], \"conf\" =\u003e \"high\"}\n#       ]\n#     }]\n#   }\n# }\n\n# Access partitions\npartitions = response.dig(\"result\", \"splits\").first[\"partitions\"]\n# =\u003e [{\"name\"=\u003e\"0000569050-001\", \"pages\"=\u003e[1], \"conf\"=\u003e\"high\"}, ...]\n```\n\n### Document Classification Example\n\n```ruby\nclient = ReductoAI::Client.new\n\n# Parse document\n# API Reference: https://docs.reducto.ai/api-reference/parse\nparse = client.parse.sync(input: \"https://example.com/document.pdf\")\n\n# Extract with classification\n# API Reference: https://docs.reducto.ai/api-reference/extract\nextraction = client.extract.sync(\n  input: parse[\"job_id\"],\n  instructions: {\n    schema: {\n      type: \"object\",\n      properties: {\n        document_type: {\n          type: \"string\",\n          enum: [\"invoice\", \"credit\", \"debit\"],\n          description: \"Document category\"\n        },\n        document_number: {\n          type: \"string\",\n          description: \"Invoice number or equivalent identifier\"\n        }\n      },\n      required: [\"document_type\", \"document_number\"]\n    }\n  },\n  settings: { citations: { enabled: false } }\n)\n\n# Response:\n# {\n#   \"job_id\" =\u003e \"class-123\",\n#   \"result\" =\u003e [{\"document_type\" =\u003e \"invoice\", \"document_number\" =\u003e \"INV-2024-001\"}],\n#   \"usage\" =\u003e {\"credits\" =\u003e 2.0}\n# }\n\ndocument_type = extraction.dig(\"result\", 0, \"document_type\")\ndocument_number = extraction.dig(\"result\", 0, \"document_number\")\n```\n\n### API Reference\n\nFull endpoint details live in the [Reducto API documentation](https://docs.reducto.ai/).\n\n### Best Practices: Cost-Efficient Document Processing\n\nFollow these patterns to minimize credit usage when processing documents:\n\n#### 1. Parse Once, Reuse Everywhere\n\n**❌ Expensive:** Calling extract/split with URLs directly\n\n```ruby\n# DON'T: Each operation parses the document again\nextract1 = client.extract.sync(input: url, instructions: schema_a)  # Parse + Extract = 2 credits\nextract2 = client.extract.sync(input: url, instructions: schema_b)  # Parse + Extract = 2 credits\nsplit = client.split.sync(input: url, split_description: [...])     # Parse + Split = 3 credits\n# Total: 7 credits for a 1-page document\n```\n\n**✅ Cost-efficient:** Parse once, reuse `job_id`\n\n```ruby\n# DO: Parse once, reuse the job_id\nparse = client.parse.sync(input: url)                              # 1 credit\njob_id = parse[\"job_id\"]\n\nextract1 = client.extract.sync(input: job_id, instructions: schema_a)  # 1 credit\nextract2 = client.extract.sync(input: job_id, instructions: schema_b)  # 1 credit\nsplit = client.split.sync(input: job_id, split_description: [...])     # 2 credits\n# Total: 5 credits for a 1-page document (saved 2 credits)\n```\n\n#### 2. Split Before Extract for Multi-Document Files\n\n**✅ Best practice:** Split first, then extract per partition\n\n```ruby\n# 1. Parse the document once\nparse = client.parse.sync(input: \"multi-invoice.pdf\")  # 1 credit × 10 pages = 10 credits\njob_id = parse[\"job_id\"]\n\n# 2. Split into partitions\nsplit = client.split.sync(\n  input: job_id,\n  split_description: [{ name: \"Invoice\", description: \"...\" }]\n)  # 2 credits × 10 pages = 20 credits\n\n# 3. Extract only from specific partitions\npartitions = split.dig(\"result\", \"splits\").first[\"partitions\"]\ninvoices = partitions.map do |partition|\n  client.extract.sync(\n    input: job_id,\n    instructions: { schema: invoice_schema },\n    settings: { page_range: partition[\"pages\"] }  # Extract only relevant pages\n  )\nend  # 1 credit × 10 pages = 10 credits\n\n# Total: 40 credits for 10-page document with 5 invoices\n```\n\n#### 3. Use Async for Large Documents\n\n**✅ For documents \u003e 10 pages:** Use async to avoid timeouts\n\n```ruby\n# Parse async for large files\njob = client.parse.async(input: large_pdf_url)\njob_id = job[\"job_id\"]\n\n# Poll or use webhooks\nresult = client.jobs.wait(job_id: job_id, interval: 2, timeout: 300)\n\n# Then reuse the job_id for split/extract\nsplit = client.split.sync(input: job_id, split_description: [...])\n```\n\n#### 4. Store and Reuse Parse Results\n\n**✅ For repeated processing:** Store `job_id` to avoid re-parsing\n\n```ruby\n# Store the job_id with your document record\ndocument.update(reducto_job_id: parse[\"job_id\"])\n\n# Later: Extract different schemas without re-parsing\nschema_v1 = client.extract.sync(input: document.reducto_job_id, instructions: schema_v1)\nschema_v2 = client.extract.sync(input: document.reducto_job_id, instructions: schema_v2)\n# Only 2 credits instead of 4\n```\n\n#### Credit Math Summary\n\n| Operation | Direct URL | With job_id | Savings |\n|-----------|-----------|-------------|---------|\n| Parse | 1 credit/page | N/A | - |\n| Extract | 2 credits/page | 1 credit/page | 50% |\n| Split | 3 credits/page | 2 credits/page | 33% |\n| Multiple extracts (3×) | 6 credits/page | 3 credits/page | 50% |\n\n**Golden rule:** Always parse once and reuse `job_id` for all subsequent operations.\n\n### Credits \u0026 pricing overview\n\nReducto bills every API call in credits. Current public rates are:\n\n- **Parse**: 1 credit per standard page (2 for complex VLM-enhanced pages).\n- **Extract**: 2 credits per page (4 if agent-in-loop mode is enabled). Parsing credits are also charged if you **don't** reuse a previous `job_id`.\n- **Split**: 2 credits per page when run standalone; free if you supply a prior parse job.\n- **Edit**: 4 credits per page (beta pricing).\n\nYou can process ~15k credits/month before overages; additional credits are billed at **$0.015 USD** each according to [Reducto's pricing page](https://reducto.ai/pricing).\n\n#### Why Extract costs 2 credits for 1 page\n\nWhen you call `extract.sync(input: url, instructions: schema)` with a URL instead of a `job_id`, Reducto automatically performs two operations:\n\n1. **Parse** (1 credit): Converts PDF → structured text\n2. **Extract** (1 credit): Applies schema → structured JSON\n3. **Total: 2 credits**\n\n**Cost optimization:** Parse once, extract multiple times:\n\n```ruby\n# Parse once (1 credit)\nparse = client.parse.sync(input: \"https://example.com/doc.pdf\")\njob_id = parse[\"job_id\"]\n\n# Extract multiple schemas (1 credit each)\nschema_a = client.extract.sync(input: job_id, instructions: schema_a)\nschema_b = client.extract.sync(input: job_id, instructions: schema_b)\n# Total: 3 credits instead of 4\n```\n\n#### Credit math for the examples above\n\n- **Parse → Split → Extract**: when you start with `ReductoAI.parse` and pass the resulting `job_id` to `split` and `extract`, you pay **1 + 2** = **3 credits per page** (parse + extract). Split reuses the parsed content so it doesn't add extra parse credits.\n- **Document type + number extraction**: the JSON-schema `extract` call uses an existing parse job, so it consumes **parse (1) + extract (2) = 3 credits per page**. Enabling agentic or citations may raise the per-page cost per the [credit usage guide](https://docs.reducto.ai/faq/credit-usage-overview).\n\n## Development\n\n```\nbundle exec rake test\nbundle exec rubocop\n```\n\n## TODO\n\n- [ ] Document webhook workflow and retry semantics\n\n## Contributing\n\nBug reports and pull requests are welcome on GitHub at https://github.com/dpaluy/reducto_ai.\n\n## License\n\nThe gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpaluy%2Freducto_ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdpaluy%2Freducto_ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdpaluy%2Freducto_ai/lists"}