An open API service indexing awesome lists of open source software.

https://github.com/mulkatz/mulder

Config-driven Document Intelligence Platform on GCP. PDFs → Knowledge Graph, defined by one YAML, deployed by one command.
https://github.com/mulkatz/mulder

cli document-intelligence gcp gemini knowledge-graph pgvector postgresql rag terraform typescript

Last synced: about 15 hours ago
JSON representation

Config-driven Document Intelligence Platform on GCP. PDFs → Knowledge Graph, defined by one YAML, deployed by one command.

Awesome Lists containing this project

README

          


Mulder

Mulder


Config-driven Document Intelligence Platform on GCP

Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.


The truth is in the documents.


Live Demo
License
TypeScript
GCP
Status


Live Demo ·
Functional Spec ·
Roadmap ·
Example Config

---

**Development Progress**   `59 / 125 steps`

```
M1 Foundation ██████████████████████████████ 11/11 ✓
M2 Ingest+Extract ██████████████████████████████ 9/9 ✓
M3 Segment+Enrich ██████████████████████████████ 10/10 ✓
QA Gate: Pre-Search ██████████████████████████████ 6/6 ✓
M4 Search (v1.0) ██████████████████████████████ 11/11 ✓
QA Gate: Post-MVP ██████████████████████████████ 7/7 ✓
M5 Curation ██████████████████████████████ 5/5 ✓
M6 Intelligence ██████████████████████████████ 7/7 ✓
M7 API+Workers ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/9
M8 Operations ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/6
M9 Multi-Format ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/13
M10 Provenance ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/9
M11 Trust Layer ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/5
M12 Discovery ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/4
M13 Observability ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/5
M14 Research Agent ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ 0/8
```


Mulder Dashboard

## What it does

Mulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.

You define your domain ontology in a single `mulder.config.yaml`. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.

```
mulder.config.yaml → terraform apply → mulder pipeline run ./pdfs/ → mulder query "..."
```

## Capabilities

| # | Capability | What it does |
|---|-----------|-------------|
| 1 | **Layout Extraction** | Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts |
| 2 | **Domain Ontology** | One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema. |
| 3 | **Taxonomy** | Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual |
| 4 | **Hybrid Retrieval** | Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking |
| 5 | **Web Grounding** | Gemini verifies entities against live web data — coordinates, bios, org descriptions |
| 6 | **Spatio-Temporal** | PostGIS proximity queries, temporal clustering, pattern detection across time and space |
| 7 | **Evidence Scoring** | Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains |
| 8 | **Cross-Lingual Resolution** | 3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages |
| 9 | **Deduplication** | MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring |
| 10 | **Schema Evolution** | Config-hash tracking per document per step, selective reprocessing after config changes |
| 11 | **Visual Intelligence** _(v3.0 / Phase 2)_ | Image extraction, Gemini analysis, image embeddings, map/diagram data extraction |
| 12 | **Pattern Discovery** _(v3.0 / Phase 2)_ | Cluster anomalies, temporal spikes, subgraph similarity, proactive insights |

## Pipeline

```
PDF

┌─────▼─────┐
│ Ingest │ Upload to Cloud Storage, pre-flight validation
└─────┬─────┘

┌─────▼─────┐
│ Extract │ Document AI + Gemini Vision fallback → layout JSON + page images → GCS
└─────┬─────┘

┌─────▼─────┐
│ Segment │ Gemini identifies stories from page images → Markdown + metadata → GCS
└─────┬─────┘

┌─────▼─────┐
│ Enrich │ Entity extraction, taxonomy normalization, cross-lingual resolution
└─────┬─────┘

┌─────▼─────┐
│ Ground │ Web enrichment via Gemini Search — coordinates, bios, verification
└─────┬─────┘

┌─────▼─────┐
│ Embed │ Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25
└─────┬─────┘

┌─────▼─────┐
│ Graph │ Deduplication, corroboration scoring, contradiction flagging
└─────┬─────┘

┌─────▼─────┐
│ Analyze │ Contradiction resolution, PageRank reliability, evidence chains
└─────┬─────┘

Knowledge
Graph
```

Every step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.

## Configuration

All domain logic lives in `mulder.config.yaml`. Define your domain, the pipeline adapts:

```yaml
project:
name: investigative-journalism

ontology:
entity_types:
- name: person
description: Individual mentioned in documents
attributes:
- { name: role, type: string }
- { name: affiliation, type: string }
- name: event
description: A specific incident or occurrence
attributes:
- { name: date, type: date }
- { name: location, type: string }
- name: location
description: Geographic place
attributes:
- { name: coordinates, type: geo_point, optional: true }

relationships:
- { name: involved_in, from: person, to: event }
- { name: occurred_at, from: event, to: location }
```

Everything beyond `project` and `ontology` has sensible defaults. See [`mulder.config.example.yaml`](./mulder.config.example.yaml) for the full reference.

## Architecture

Single PostgreSQLpgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub
Content in GCSPDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only.
Service AbstractionAll GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost.
CLI-firstEvery capability is a CLI command. The API is a job producer, not a direct executor.
PostgreSQL is truthPipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring).

**Baseline cost:** ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.

## Tech Stack

| | |
|---|---|
| **Language** | TypeScript (ESM, strict mode) |
| **Monorepo** | pnpm + Turborepo |
| **Infrastructure** | Terraform (modular) |
| **OCR** | Document AI Layout Parser |
| **LLM** | Gemini 2.5 Flash (Vertex AI) |
| **Embeddings** | text-embedding-004 (768-dim Matryoshka) |
| **Database** | Cloud SQL PostgreSQL |
| **Search** | pgvector (HNSW) + tsvector (BM25) + recursive CTEs |
| **Geospatial** | PostGIS |
| **CLI** | Commander.js |
| **Testing** | Vitest |

## Status

Mulder's **v1.0 MVP (M4)** is complete — the full pipeline from ingest through hybrid retrieval is operational. PDFs go in, a knowledge graph comes out, and natural-language queries return ranked passages with LLM re-ranking. The [functional spec](./docs/functional-spec.md), [implementation roadmap](./docs/roadmap.md), and [config schema](./mulder.config.example.yaml) are finalized.

See the [roadmap](./docs/roadmap.md) for all 14 milestones from foundation to autonomous research agent.

## Contributing

Contributions, feedback, and ideas are welcome. Open an issue or start a discussion.

## License

[Apache 2.0](LICENSE)