https://github.com/agxp/docpulse
Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT
https://github.com/agxp/docpulse
async document-extraction document-processing go gpt-4o json-schema llm multi-tenant ocr openai pdf postgresql rest-api structured-data tesseract worker
Last synced: 28 days ago
JSON representation
Async document intelligence API — upload any PDF/DOCX/image + a JSON Schema, get back structured JSON with per-field confidence scores. Go, PostgreSQL, GPT
- Host: GitHub
- URL: https://github.com/agxp/docpulse
- Owner: agxp
- Created: 2026-03-08T01:25:00.000Z (about 1 month ago)
- Default Branch: main
- Last Pushed: 2026-03-08T18:22:38.000Z (about 1 month ago)
- Last Synced: 2026-03-08T22:09:27.754Z (about 1 month ago)
- Topics: async, document-extraction, document-processing, go, gpt-4o, json-schema, llm, multi-tenant, ocr, openai, pdf, postgresql, rest-api, structured-data, tesseract, worker
- Language: Go
- Homepage:
- Size: 40 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DocPulse — Document Intelligence API
Multi-tenant document extraction platform. Submit any document + a JSON schema describing what to extract → get back structured JSON with per-field confidence scores.
## Quickstart
### Single command (Docker)
```bash
OPENAI_API_KEY=sk-... docker compose up --build
```
Then open **http://localhost:8081** — the web UI loads with a dev API key pre-filled. Upload a document, pick a schema preset, and extract.
The stack (`api`, `worker`, `postgres`, `redis`) starts automatically. Migrations run on boot. A dev tenant is seeded with the key `di_devkey_changeme_in_production` (override via `DEV_API_KEY` env var).
### Local development (without Docker for the Go services)
```bash
# 1. Start infrastructure
docker compose up -d postgres redis
# 2. Set environment
cp .env.example .env
# Edit .env — add your OPENAI_API_KEY
set -a && source .env && set +a
# 3. Run migrations + create dev tenant
make migrate # requires psql installed locally
make seed # prints your API key — save it
# 4. Start API and worker (separate terminals)
make run-api
make run-worker
```
## Usage
### Web UI
The API server serves a frontend at `/`. In dev mode the API key is auto-filled. Steps:
1. Upload a PDF, DOCX, or image (max 50 MB)
2. Define a JSON Schema — or pick a preset (Invoice, Resume, Contract, Receipt, ID)
3. Click **Extract** — the UI polls for the result and displays each field with a confidence score
A sample document is included at `testdata/sample-invoice.docx`.
### API
#### Submit an extraction job
```bash
curl -X POST http://localhost:8081/v1/extract \
-H "Authorization: Bearer di_your_key_here" \
-F "document=@invoice.pdf" \
-F 'schema={
"type": "object",
"properties": {
"vendor": {"type": "string"},
"invoice_number": {"type": "string"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"}
}
}
}
},
"required": ["vendor", "total"]
}'
# Response:
# {"job_id": "abc-123", "status": "pending", "poll_url": "/v1/jobs/abc-123"}
```
#### Poll for results
```bash
curl http://localhost:8081/v1/jobs/abc-123 \
-H "Authorization: Bearer di_your_key_here"
```
#### List jobs
```bash
curl "http://localhost:8081/v1/jobs?limit=20&offset=0" \
-H "Authorization: Bearer di_your_key_here"
```
Default limit is 20, max is 100.
#### Webhooks
Register a URL to receive a POST when a job completes. The secret is generated server-side and shown **once** — store it to verify signatures.
```bash
# Register
curl -X POST http://localhost:8081/v1/webhooks \
-H "Authorization: Bearer di_your_key_here" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/webhook"}'
# Response includes the secret — save it:
# {"id": "...", "url": "...", "secret": "abc123...", "active": true}
# Delete
curl -X DELETE http://localhost:8081/v1/webhooks/{id} \
-H "Authorization: Bearer di_your_key_here"
```
Each delivery is a `POST` with:
- `Content-Type: application/json` — body is the full job object
- `X-DocPulse-Signature: sha256=` — HMAC-SHA256 of the body using your secret
Verify the signature on your server:
```python
import hmac, hashlib
def verify(secret: str, body: bytes, header: str) -> bool:
expected = "sha256=" + hmac.new(secret.encode(), body, hashlib.sha256).hexdigest()
return hmac.compare_digest(expected, header)
```
Failed deliveries are retried up to 5 times with exponential backoff.
## Architecture
```
Client → API (Go/chi) → PostgreSQL (job queue)
↓
Worker Pool
┌────────┼────────┐
│ │ │
Ingest Chunk Extract
│ │ │
PDF/OCR Semantic LLM Router
DOCX Boundary (fast/strong)
│ │ │
└────────┼────────┘
↓
Result Assembly
+ Confidence Scoring
↓
Job Complete / Webhook
```
**Key decisions:**
- Async-first: jobs never block HTTP connections
- FOR UPDATE SKIP LOCKED: safe concurrent job claiming without a separate queue
- Two-tier LLM routing: cheap model for simple schemas, strong model for complex ones + automatic escalation on validation failure
- Content-hash cache: SHA-256(document + schema) catches exact duplicates at zero cost
- Magic-byte format detection: more robust than trusting file extensions
- HMAC-signed webhooks: recipients can verify payload integrity
## Project Structure
```
cmd/api/ — HTTP server entry point
cmd/worker/ — Job processor entry point
internal/
api/ — HTTP handlers, routing, embedded frontend
api/middleware/ — Auth, logging, rate limiting
auth/ — API key generation and hashing
config/ — Environment-based configuration
database/ — PostgreSQL stores (jobs, tenants, webhooks)
domain/ — Core types shared across packages
extraction/ — Chunking engine
ingestion/ — Format detection, text extraction (PDF/OCR/DOCX)
jobs/ — Worker loop and job processing pipeline
llm/ — Model routing and structured extraction
storage/ — Object storage interface (local filesystem only)
webhook/ — Webhook delivery with HMAC signing + retries
migrations/ — SQL schema (auto-applied on API startup)
testdata/ — Sample documents for testing
scripts/ — Dev utilities (seed tenant)
Dockerfile — Multi-stage build: api and worker targets
```
## Stack
Go 1.24 · PostgreSQL 16 · Redis 7 · OpenAI API · Docker · Fly.io
**System dependencies** (for text extraction):
- `poppler-utils` — pdftotext for native PDFs
- `tesseract-ocr` — OCR for scanned PDFs and images
- `pandoc` — DOCX to text conversion
## Known limitations
- **Storage**: only local filesystem (`LocalStore`) is implemented. S3 support is stubbed but not built.
- **Schema validation**: validates structure (type=object, properties present, each property has a type), but does not implement the full JSON Schema specification.
- **Job list pagination**: `limit`/`offset` work and response includes a `total` count, but there is no cursor-based pagination.
- **Worker cache**: Redis-backed with a configurable TTL (`WORKER_CACHE_TTL`, default 24h), but no LRU eviction beyond TTL.
- **`make migrate`**: runs `psql` directly — requires `psql` installed on your machine. When using Docker (`docker compose up`), migrations run automatically on API startup instead.