{"id":50492836,"url":"https://github.com/mj-deving/invoice-parse-agent","last_synced_at":"2026-06-02T04:30:50.792Z","repository":{"id":357829003,"uuid":"1237952726","full_name":"mj-deving/invoice-parse-agent","owner":"mj-deving","description":"PDF/image invoices into structured JSON: OCR, LLM extraction, schema validation, low-confidence review queue, ground-truth eval. TypeScript, Tesseract, Claude.","archived":false,"fork":false,"pushed_at":"2026-05-25T17:39:20.000Z","size":673,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-25T19:28:56.579Z","etag":null,"topics":["document-ai","llm","ocr","rag","typescript"],"latest_commit_sha":null,"homepage":"https://mj-deving.github.io/invoice-parse-agent/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mj-deving.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-13T17:05:51.000Z","updated_at":"2026-05-25T17:39:24.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mj-deving/invoice-parse-agent","commit_stats":null,"previous_names":["mj-deving/invoice-parse-agent"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mj-deving/invoice-parse-agent","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mj-deving%2Finvoice-parse-agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mj-deving%2Finvoice-parse-agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mj-deving%2Finvoice-parse-agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mj-deving%2Finvoice-parse-agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mj-deving","download_url":"https://codeload.github.com/mj-deving/invoice-parse-agent/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mj-deving%2Finvoice-parse-agent/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33806987,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-02T02:00:07.132Z","response_time":109,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["document-ai","llm","ocr","rag","typescript"],"created_at":"2026-06-02T04:30:48.842Z","updated_at":"2026-06-02T04:30:50.782Z","avatar_url":"https://github.com/mj-deving.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Invoice Parse Agent\n\nOCR and document-processing hero repo for turning invoice PDFs into structured JSON:\n\n```bash\ncurl -s -X POST http://localhost:8787/parse \\\n  -H 'content-type: application/json' \\\n  -d '{\"url\":\"https://example.com/invoice.pdf\"}'\n```\n\nThe API extracts document text, asks Claude Haiku 4.5 for schema-constrained invoice JSON, and reports evaluation accuracy against a small ground-truth corpus.\n\nLive proof dashboard: https://mj-deving.github.io/invoice-parse-agent/\n\nLive backend dashboard: https://missioncontrol.mjdeving.com/invoice-parse/dashboard\n\nRendered dashboard proof: `docs/proof/dashboard-local.png`\n\n## Why this exists\n\nInvoice processing is still full of manual handoffs: PDFs arrive by email or webhook, OCR output is noisy, vendor layouts vary, and low-confidence fields need human review before they can enter accounting or logistics workflows.\n\nThis project shows a practical intake pipeline for that workflow. It turns invoice PDFs and scans into structured JSON, scores extraction quality against ground truth, keeps a review queue for uncertain results, and stores reviewed corrections as vendor memory so recurring documents become easier to process over time.\n\nThe scope is intentionally narrow: semi-structured B2B invoices for logistics, orders, and supplier operations. The goal is not universal document understanding; it is a reliable automation loop for a common back-office process.\n\n## Architecture\n\n![Invoice Parse Agent pipeline — PDF/image through OCR or text extraction, Qdrant memory retrieval, Claude Haiku extraction, Zod validation, SQLite job ledger and review queue](docs/diagrams/pipeline.png)\n\n```text\nPOST /parse\n  URL or multipart PDF/image\n  -\u003e document text layer extraction for embedded-text PDFs\n  -\u003e Tesseract.js OCR for image/scanned inputs\n  -\u003e optional managed Vision/Document AI adapter boundary\n  -\u003e optional Qdrant retrieval of similar prior invoices\n  -\u003e Claude structured extraction\n  -\u003e optional Qdrant storage of parsed invoice memory\n  -\u003e SQLite job ledger and review queue for low-confidence parses\n  -\u003e zod-validated invoice JSON\n\nGET /eval\n  5 invoice fixtures\n  -\u003e extraction\n  -\u003e field hit-rate + confidence report\n```\n\n## Tradeoffs\n\n### Tesseract.js vs managed Vision APIs\n\nTesseract.js is the primary OCR path because it is self-hosted, cheap, inspectable, and works in Docker without sending invoice images to a third party. That matters for supplier invoices, logistics documents, and regulated customer data.\n\nManaged OCR such as Google Vision API, AWS Textract, or Azure Document Intelligence is the better production choice when handwriting, tables, rotated scans, multi-page invoices, or SLA-backed accuracy matter more than cost and data locality. This repo exposes `src/ocr/vision.ts` as the managed fallback boundary, but keeps it disabled by default.\n\n### Hono on Node/Docker vs Cloudflare Workers\n\nThe runtime is Hono on Node via Bun. Cloudflare Workers are useful for routing and orchestration, but self-hosted Tesseract and PDF rasterization are a poor fit for Worker bundle size, CPU, filesystem, and native utility constraints. Docker is the deployable unit here; Workers can still call this service as an internal API.\n\n### Claude structured extraction vs regex\n\nRegex is reliable for the synthetic fixtures and remains as an offline fallback for tests. Claude Haiku 4.5 is used for the real extraction path because OCR output often shifts labels, table order, and address formatting. The schema boundary keeps the LLM output operational: invalid JSON fails fast instead of silently entering an accounting workflow.\n\n### Qdrant vs no document memory\n\nQdrant is used as the optional vector memory layer, not as a replacement for OCR or structured extraction. When `QDRANT_URL` is configured, `/parse` retrieves similar prior invoices before extraction and stores the parsed result afterward. That gives vendor-specific examples to Claude and creates a reusable memory for recurring suppliers, purchase orders, and logistics documents.\n\nThe demo uses deterministic local hash vectors so the repo works without another model provider. In production, replace `src/memory/embedding.ts` with OpenAI, Voyage, Cohere, or local embedding vectors and keep the Qdrant storage/search contract unchanged.\n\nSet `EMBEDDING_PROVIDER=openai` with `OPENAI_API_KEY` to use production OpenAI embeddings for Qdrant memory. The default `hash` provider stays deterministic for CI and local demos.\n\n### Intake desk vs one-shot parsing\n\nThe live use case is an invoice intake triage desk. Every parse creates a persisted job in SQLite. Results below `REVIEW_CONFIDENCE_THRESHOLD` enter `needs_review`; the dashboard lets an operator edit the extracted invoice JSON and save it as `reviewed`. Reviewed corrections are stored back into Qdrant, so recurring vendor invoices improve over time.\n\n## Run\n\n```bash\nbun install\nbun run fixtures\nbun run dev\n```\n\nOpen:\n\n```bash\nopen http://localhost:8787/dashboard\ncurl http://localhost:8787/eval\ncurl -X POST http://localhost:8787/parse \\\n  -F \"file=@corpus/mustard-logistics-001.pdf\"\n```\n\nSet `.dev.vars` or environment variables:\n\n```bash\nANTHROPIC_API_KEY=sk-ant-api03-...\nANTHROPIC_MODEL=claude-haiku-4-5\nVISION_API_ENABLED=false\nQDRANT_URL=http://localhost:6333\nQDRANT_COLLECTION=invoice_parse_agent\nEMBEDDING_PROVIDER=hash\nOPENAI_API_KEY=\nINVOICE_DB_PATH=data/invoices.sqlite\nREVIEW_CONFIDENCE_THRESHOLD=0.8\n```\n\nWithout `ANTHROPIC_API_KEY`, the app uses a deterministic extractor so tests and demos stay reproducible.\n\n## Docker\n\n```bash\ndocker build -t invoice-parse-agent .\ndocker run --rm -p 8787:8787 \\\n  -e ANTHROPIC_API_KEY=\"$ANTHROPIC_API_KEY\" \\\n  -e ANTHROPIC_MODEL=claude-haiku-4-5 \\\n  invoice-parse-agent\n```\n\nWith Qdrant:\n\n```bash\ndocker compose up --build\n```\n\nThe compose stack starts Qdrant on `localhost:6333` and the app on `localhost:8787`.\n\n## API\n\n### `POST /parse`\n\nAccepted inputs:\n\n- JSON URL: `{ \"url\": \"https://...\" }`\n- JSON text for controlled tests: `{ \"text\": \"Vendor: ...\" }`\n- multipart upload: field name `file`\n- raw PDF/image/text body\n\nResponse shape:\n\n```json\n{\n  \"source\": { \"mode\": \"pdf-text\", \"pages\": 1, \"bytes\": 12345 },\n  \"memory\": { \"provider\": \"qdrant\", \"collection\": \"invoice_parse_agent\", \"hits\": 1, \"stored\": true },\n  \"invoice\": {\n    \"vendor\": { \"name\": \"Mustard Yellow Logistics GmbH\" },\n    \"invoiceNumber\": \"MYL-2026-001\",\n    \"invoiceDate\": \"2026-04-30\",\n    \"lineItems\": [],\n    \"tax\": { \"amount\": 59.09, \"currency\": \"EUR\" },\n    \"total\": { \"amount\": 370.09, \"currency\": \"EUR\" },\n    \"confidence\": 0.92,\n    \"warnings\": []\n  },\n  \"rawText\": \"...\"\n}\n```\n\n### `GET /eval`\n\nRuns the ground-truth corpus and returns per-case misses plus aggregate field hit rate.\n\n### `GET /dashboard`\n\nServes a browser dashboard for upload, sample parsing, eval metrics, JSON output, review queue, editable corrections, and operational fit.\n\n### `GET /jobs`\n\nLists recent invoice parse jobs for the review queue.\n\n### `GET /jobs/:id`\n\nReturns one persisted parse job with invoice JSON and raw OCR text.\n\n### `PATCH /jobs/:id`\n\nAccepts reviewed invoice JSON, marks the job `reviewed`, and stores the corrected invoice back into Qdrant when `QDRANT_URL` is configured.\n\nCurrent deterministic eval output:\n\n```bash\nbun run eval\n```\n\n## n8n integration\n\n`n8n-template.json` wires:\n\n```text\nWebhook -\u003e /parse -\u003e confidence gate -\u003e email accounting / JSON response\n```\n\nThis is the intended process-automation pattern: invoices enter via webhook, parsing is centralized, confidence gates decide whether to straight-through-process or send for review.\n\n## Quality gates\n\n```bash\nbun run typecheck\nbun test\nbun run eval\n```\n\n## Corpus\n\nThe corpus uses synthetic invoices to avoid licensing ambiguity and to keep ground truth exact. The fixture names and fields are logistics-oriented: freight, cold chain, parts, terminal handling, and customs preparation.\n\n`corpus/mustard-logistics-001-scan.png` is a rendered scanned-image fixture. The OCR smoke test runs the actual Tesseract.js wrapper against it and checks confidence plus recovered invoice identifiers.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmj-deving%2Finvoice-parse-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmj-deving%2Finvoice-parse-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmj-deving%2Finvoice-parse-agent/lists"}