{"id":47172624,"url":"https://github.com/agxp/docpulse","last_synced_at":"2026-03-13T06:05:23.564Z","repository":{"id":343070638,"uuid":"1175647004","full_name":"agxp/docpulse","owner":"agxp","description":"Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT","archived":false,"fork":false,"pushed_at":"2026-03-08T18:22:38.000Z","size":41,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-08T22:09:27.754Z","etag":null,"topics":["async","document-extraction","document-processing","go","gpt-4o","json-schema","llm","multi-tenant","ocr","openai","pdf","postgresql","rest-api","structured-data","tesseract","worker"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/agxp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-08T01:25:00.000Z","updated_at":"2026-03-08T18:31:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/agxp/docpulse","commit_stats":null,"previous_names":["agxp/docpulse"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/agxp/docpulse","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Fdocpulse","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Fdocpulse/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Fdocpulse/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Fdocpulse/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/agxp","download_url":"https://codeload.github.com/agxp/docpulse/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/agxp%2Fdocpulse/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30459817,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-13T03:55:51.346Z","status":"ssl_error","status_checked_at":"2026-03-13T03:55:33.055Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["async","document-extraction","document-processing","go","gpt-4o","json-schema","llm","multi-tenant","ocr","openai","pdf","postgresql","rest-api","structured-data","tesseract","worker"],"created_at":"2026-03-13T06:05:13.147Z","updated_at":"2026-03-13T06:05:23.558Z","avatar_url":"https://github.com/agxp.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DocPulse — Document Intelligence API\n\nMulti-tenant document extraction platform. Submit any document + a JSON schema describing what to extract → get back structured JSON with per-field confidence scores.\n\n## Quickstart\n\n### Single command (Docker)\n\n```bash\nOPENAI_API_KEY=sk-... docker compose up --build\n```\n\nThen open **http://localhost:8081** — the web UI loads with a dev API key pre-filled. Upload a document, pick a schema preset, and extract.\n\nThe stack (`api`, `worker`, `postgres`, `redis`) starts automatically. Migrations run on boot. A dev tenant is seeded with the key `di_devkey_changeme_in_production` (override via `DEV_API_KEY` env var).\n\n### Local development (without Docker for the Go services)\n\n```bash\n# 1. Start infrastructure\ndocker compose up -d postgres redis\n\n# 2. Set environment\ncp .env.example .env\n# Edit .env — add your OPENAI_API_KEY\n\nset -a \u0026\u0026 source .env \u0026\u0026 set +a\n\n# 3. Run migrations + create dev tenant\nmake migrate    # requires psql installed locally\nmake seed       # prints your API key — save it\n\n# 4. Start API and worker (separate terminals)\nmake run-api\nmake run-worker\n```\n\n## Usage\n\n### Web UI\n\nThe API server serves a frontend at `/`. In dev mode the API key is auto-filled. Steps:\n\n1. Upload a PDF, DOCX, or image (max 50 MB)\n2. Define a JSON Schema — or pick a preset (Invoice, Resume, Contract, Receipt, ID)\n3. Click **Extract** — the UI polls for the result and displays each field with a confidence score\n\nA sample document is included at `testdata/sample-invoice.docx`.\n\n### API\n\n#### Submit an extraction job\n\n```bash\ncurl -X POST http://localhost:8081/v1/extract \\\n  -H \"Authorization: Bearer di_your_key_here\" \\\n  -F \"document=@invoice.pdf\" \\\n  -F 'schema={\n    \"type\": \"object\",\n    \"properties\": {\n      \"vendor\": {\"type\": \"string\"},\n      \"invoice_number\": {\"type\": \"string\"},\n      \"total\": {\"type\": \"number\"},\n      \"line_items\": {\n        \"type\": \"array\",\n        \"items\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"description\": {\"type\": \"string\"},\n            \"amount\": {\"type\": \"number\"}\n          }\n        }\n      }\n    },\n    \"required\": [\"vendor\", \"total\"]\n  }'\n\n# Response:\n# {\"job_id\": \"abc-123\", \"status\": \"pending\", \"poll_url\": \"/v1/jobs/abc-123\"}\n```\n\n#### Poll for results\n\n```bash\ncurl http://localhost:8081/v1/jobs/abc-123 \\\n  -H \"Authorization: Bearer di_your_key_here\"\n```\n\n#### List jobs\n\n```bash\ncurl \"http://localhost:8081/v1/jobs?limit=20\u0026offset=0\" \\\n  -H \"Authorization: Bearer di_your_key_here\"\n```\n\nDefault limit is 20, max is 100.\n\n#### Webhooks\n\nRegister a URL to receive a POST when a job completes. The secret is generated server-side and shown **once** — store it to verify signatures.\n\n```bash\n# Register\ncurl -X POST http://localhost:8081/v1/webhooks \\\n  -H \"Authorization: Bearer di_your_key_here\" \\\n  -H \"Content-Type: application/json\" \\\n  -d '{\"url\": \"https://example.com/webhook\"}'\n\n# Response includes the secret — save it:\n# {\"id\": \"...\", \"url\": \"...\", \"secret\": \"abc123...\", \"active\": true}\n\n# Delete\ncurl -X DELETE http://localhost:8081/v1/webhooks/{id} \\\n  -H \"Authorization: Bearer di_your_key_here\"\n```\n\nEach delivery is a `POST` with:\n- `Content-Type: application/json` — body is the full job object\n- `X-DocPulse-Signature: sha256=\u003chmac\u003e` — HMAC-SHA256 of the body using your secret\n\nVerify the signature on your server:\n\n```python\nimport hmac, hashlib\n\ndef verify(secret: str, body: bytes, header: str) -\u003e bool:\n    expected = \"sha256=\" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()\n    return hmac.compare_digest(expected, header)\n```\n\nFailed deliveries are retried up to 5 times with exponential backoff.\n\n## Architecture\n\n```\nClient → API (Go/chi) → PostgreSQL (job queue)\n                              ↓\n                         Worker Pool\n                    ┌────────┼────────┐\n                    │        │        │\n                 Ingest   Chunk    Extract\n                    │        │        │\n               PDF/OCR   Semantic  LLM Router\n               DOCX      Boundary  (fast/strong)\n                    │        │        │\n                    └────────┼────────┘\n                              ↓\n                     Result Assembly\n                     + Confidence Scoring\n                              ↓\n                     Job Complete / Webhook\n```\n\n**Key decisions:**\n- Async-first: jobs never block HTTP connections\n- FOR UPDATE SKIP LOCKED: safe concurrent job claiming without a separate queue\n- Two-tier LLM routing: cheap model for simple schemas, strong model for complex ones + automatic escalation on validation failure\n- Content-hash cache: SHA-256(document + schema) catches exact duplicates at zero cost\n- Magic-byte format detection: more robust than trusting file extensions\n- HMAC-signed webhooks: recipients can verify payload integrity\n\n## Project Structure\n\n```\ncmd/api/          — HTTP server entry point\ncmd/worker/       — Job processor entry point\ninternal/\n  api/            — HTTP handlers, routing, embedded frontend\n  api/middleware/  — Auth, logging, rate limiting\n  auth/           — API key generation and hashing\n  config/         — Environment-based configuration\n  database/       — PostgreSQL stores (jobs, tenants, webhooks)\n  domain/         — Core types shared across packages\n  extraction/     — Chunking engine\n  ingestion/      — Format detection, text extraction (PDF/OCR/DOCX)\n  jobs/           — Worker loop and job processing pipeline\n  llm/            — Model routing and structured extraction\n  storage/        — Object storage interface (local filesystem only)\n  webhook/        — Webhook delivery with HMAC signing + retries\nmigrations/       — SQL schema (auto-applied on API startup)\ntestdata/         — Sample documents for testing\nscripts/          — Dev utilities (seed tenant)\nDockerfile        — Multi-stage build: api and worker targets\n```\n\n## Stack\n\nGo 1.24 · PostgreSQL 16 · Redis 7 · OpenAI API · Docker · Fly.io\n\n**System dependencies** (for text extraction):\n- `poppler-utils` — pdftotext for native PDFs\n- `tesseract-ocr` — OCR for scanned PDFs and images\n- `pandoc` — DOCX to text conversion\n\n## Known limitations\n\n- **Storage**: only local filesystem (`LocalStore`) is implemented. S3 support is stubbed but not built.\n- **Schema validation**: validates structure (type=object, properties present, each property has a type), but does not implement the full JSON Schema specification.\n- **Job list pagination**: `limit`/`offset` work and response includes a `total` count, but there is no cursor-based pagination.\n- **Worker cache**: Redis-backed with a configurable TTL (`WORKER_CACHE_TTL`, default 24h), but no LRU eviction beyond TTL.\n- **`make migrate`**: runs `psql` directly — requires `psql` installed on your machine. When using Docker (`docker compose up`), migrations run automatically on API startup instead.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagxp%2Fdocpulse","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagxp%2Fdocpulse","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagxp%2Fdocpulse/lists"}