https://github.com/mulkatz/mulder
Config-driven Document Intelligence Platform on GCP. PDFs → Knowledge Graph, defined by one YAML, deployed by one command.
https://github.com/mulkatz/mulder
cli document-intelligence gcp gemini knowledge-graph pgvector postgresql rag terraform typescript
Last synced: about 15 hours ago
JSON representation
Config-driven Document Intelligence Platform on GCP. PDFs → Knowledge Graph, defined by one YAML, deployed by one command.
- Host: GitHub
- URL: https://github.com/mulkatz/mulder
- Owner: mulkatz
- License: apache-2.0
- Created: 2026-03-26T14:59:17.000Z (19 days ago)
- Default Branch: main
- Last Pushed: 2026-04-13T20:34:20.000Z (1 day ago)
- Last Synced: 2026-04-13T22:29:38.883Z (about 24 hours ago)
- Topics: cli, document-intelligence, gcp, gemini, knowledge-graph, pgvector, postgresql, rag, terraform, typescript
- Language: TypeScript
- Homepage: https://mulder.mulkatz.dev
- Size: 3.4 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Roadmap: docs/roadmap.md
Awesome Lists containing this project
README
Mulder
Config-driven Document Intelligence Platform on GCP
Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.
The truth is in the documents.
Live Demo ·
Functional Spec ·
Roadmap ·
Example Config
---
**Development Progress** `59 / 125 steps`
```
M1 Foundation ██████████████████████████████ 11/11 ✓
M2 Ingest+Extract ██████████████████████████████ 9/9 ✓
M3 Segment+Enrich ██████████████████████████████ 10/10 ✓
QA Gate: Pre-Search ██████████████████████████████ 6/6 ✓
M4 Search (v1.0) ██████████████████████████████ 11/11 ✓
QA Gate: Post-MVP ██████████████████████████████ 7/7 ✓
M5 Curation ██████████████████████████████ 5/5 ✓
M6 Intelligence ██████████████████████████████ 7/7 ✓
M7 API+Workers ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/9
M8 Operations ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/6
M9 Multi-Format ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/13
M10 Provenance ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/9
M11 Trust Layer ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/5
M12 Discovery ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/4
M13 Observability ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/5
M14 Research Agent ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/8
```
## What it does
Mulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.
You define your domain ontology in a single `mulder.config.yaml`. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.
```
mulder.config.yaml → terraform apply → mulder pipeline run ./pdfs/ → mulder query "..."
```
## Capabilities
| # | Capability | What it does |
|---|-----------|-------------|
| 1 | **Layout Extraction** | Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts |
| 2 | **Domain Ontology** | One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema. |
| 3 | **Taxonomy** | Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual |
| 4 | **Hybrid Retrieval** | Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking |
| 5 | **Web Grounding** | Gemini verifies entities against live web data — coordinates, bios, org descriptions |
| 6 | **Spatio-Temporal** | PostGIS proximity queries, temporal clustering, pattern detection across time and space |
| 7 | **Evidence Scoring** | Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains |
| 8 | **Cross-Lingual Resolution** | 3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages |
| 9 | **Deduplication** | MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring |
| 10 | **Schema Evolution** | Config-hash tracking per document per step, selective reprocessing after config changes |
| 11 | **Visual Intelligence** _(v3.0 / Phase 2)_ | Image extraction, Gemini analysis, image embeddings, map/diagram data extraction |
| 12 | **Pattern Discovery** _(v3.0 / Phase 2)_ | Cluster anomalies, temporal spikes, subgraph similarity, proactive insights |
## Pipeline
```
PDF
│
┌─────▼─────┐
│ Ingest │ Upload to Cloud Storage, pre-flight validation
└─────┬─────┘
│
┌─────▼─────┐
│ Extract │ Document AI + Gemini Vision fallback → layout JSON + page images → GCS
└─────┬─────┘
│
┌─────▼─────┐
│ Segment │ Gemini identifies stories from page images → Markdown + metadata → GCS
└─────┬─────┘
│
┌─────▼─────┐
│ Enrich │ Entity extraction, taxonomy normalization, cross-lingual resolution
└─────┬─────┘
│
┌─────▼─────┐
│ Ground │ Web enrichment via Gemini Search — coordinates, bios, verification
└─────┬─────┘
│
┌─────▼─────┐
│ Embed │ Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25
└─────┬─────┘
│
┌─────▼─────┐
│ Graph │ Deduplication, corroboration scoring, contradiction flagging
└─────┬─────┘
│
┌─────▼─────┐
│ Analyze │ Contradiction resolution, PageRank reliability, evidence chains
└─────┬─────┘
│
Knowledge
Graph
```
Every step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.
## Configuration
All domain logic lives in `mulder.config.yaml`. Define your domain, the pipeline adapts:
```yaml
project:
name: investigative-journalism
ontology:
entity_types:
- name: person
description: Individual mentioned in documents
attributes:
- { name: role, type: string }
- { name: affiliation, type: string }
- name: event
description: A specific incident or occurrence
attributes:
- { name: date, type: date }
- { name: location, type: string }
- name: location
description: Geographic place
attributes:
- { name: coordinates, type: geo_point, optional: true }
relationships:
- { name: involved_in, from: person, to: event }
- { name: occurred_at, from: event, to: location }
```
Everything beyond `project` and `ontology` has sensible defaults. See [`mulder.config.example.yaml`](./mulder.config.example.yaml) for the full reference.
## Architecture
Single PostgreSQLpgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub
Content in GCSPDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only.
Service AbstractionAll GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost.
CLI-firstEvery capability is a CLI command. The API is a job producer, not a direct executor.
PostgreSQL is truthPipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring).
**Baseline cost:** ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.
## Tech Stack
| | |
|---|---|
| **Language** | TypeScript (ESM, strict mode) |
| **Monorepo** | pnpm + Turborepo |
| **Infrastructure** | Terraform (modular) |
| **OCR** | Document AI Layout Parser |
| **LLM** | Gemini 2.5 Flash (Vertex AI) |
| **Embeddings** | text-embedding-004 (768-dim Matryoshka) |
| **Database** | Cloud SQL PostgreSQL |
| **Search** | pgvector (HNSW) + tsvector (BM25) + recursive CTEs |
| **Geospatial** | PostGIS |
| **CLI** | Commander.js |
| **Testing** | Vitest |
## Status
Mulder's **v1.0 MVP (M4)** is complete — the full pipeline from ingest through hybrid retrieval is operational. PDFs go in, a knowledge graph comes out, and natural-language queries return ranked passages with LLM re-ranking. The [functional spec](./docs/functional-spec.md), [implementation roadmap](./docs/roadmap.md), and [config schema](./mulder.config.example.yaml) are finalized.
See the [roadmap](./docs/roadmap.md) for all 14 milestones from foundation to autonomous research agent.
## Contributing
Contributions, feedback, and ideas are welcome. Open an issue or start a discussion.
## License
[Apache 2.0](LICENSE)