{"id":48810363,"url":"https://github.com/mulkatz/mulder","last_synced_at":"2026-04-14T07:01:59.602Z","repository":{"id":347785443,"uuid":"1192783584","full_name":"mulkatz/mulder","owner":"mulkatz","description":"Config-driven Document Intelligence Platform on GCP. PDFs → Knowledge Graph, defined by one YAML, deployed by one command.","archived":false,"fork":false,"pushed_at":"2026-04-13T20:34:20.000Z","size":3565,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-13T22:29:38.883Z","etag":null,"topics":["cli","document-intelligence","gcp","gemini","knowledge-graph","pgvector","postgresql","rag","terraform","typescript"],"latest_commit_sha":null,"homepage":"https://mulder.mulkatz.dev","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mulkatz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":"docs/roadmap.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-26T14:59:17.000Z","updated_at":"2026-04-13T20:34:23.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mulkatz/mulder","commit_stats":null,"previous_names":["mulkatz/mulder"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mulkatz/mulder","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mulkatz%2Fmulder","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mulkatz%2Fmulder/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mulkatz%2Fmulder/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mulkatz%2Fmulder/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mulkatz","download_url":"https://codeload.github.com/mulkatz/mulder/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mulkatz%2Fmulder/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31785681,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-14T02:24:21.117Z","status":"ssl_error","status_checked_at":"2026-04-14T02:24:20.627Z","response_time":153,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","document-intelligence","gcp","gemini","knowledge-graph","pgvector","postgresql","rag","terraform","typescript"],"created_at":"2026-04-14T07:01:55.237Z","updated_at":"2026-04-14T07:01:59.597Z","avatar_url":"https://github.com/mulkatz.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./public/mulder-icon.svg\" width=\"100\" alt=\"Mulder\" /\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eMulder\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eConfig-driven Document Intelligence Platform on GCP\u003c/strong\u003e\u003cbr /\u003e\n  Turn document collections into searchable knowledge graphs — defined by one config file, deployed by one command.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cem\u003eThe truth is in the documents.\u003c/em\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://mulder.mulkatz.dev\"\u003e\u003cimg src=\"https://img.shields.io/badge/demo-live-blue?style=flat-square\" alt=\"Live Demo\" /\u003e\u003c/a\u003e\n  \u003ca href=\"./LICENSE\"\u003e\u003cimg src=\"https://img.shields.io/badge/license-Apache_2.0-green?style=flat-square\" alt=\"License\" /\u003e\u003c/a\u003e\n  \u003cimg src=\"https://img.shields.io/badge/TypeScript-strict-3178C6?style=flat-square\u0026logo=typescript\u0026logoColor=white\" alt=\"TypeScript\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/GCP-native-4285F4?style=flat-square\u0026logo=googlecloud\u0026logoColor=white\" alt=\"GCP\" /\u003e\n  \u003cimg src=\"https://img.shields.io/badge/status-v1.0_complete-green?style=flat-square\" alt=\"Status\" /\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://mulder.mulkatz.dev\"\u003eLive Demo\u003c/a\u003e \u0026middot;\n  \u003ca href=\"./docs/functional-spec.md\"\u003eFunctional Spec\u003c/a\u003e \u0026middot;\n  \u003ca href=\"./docs/roadmap.md\"\u003eRoadmap\u003c/a\u003e \u0026middot;\n  \u003ca href=\"./mulder.config.example.yaml\"\u003eExample Config\u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n\u003c!-- PROGRESS:START — auto-updated by auto-pilot --\u003e\n\u003ctable align=\"center\"\u003e\n\u003ctr\u003e\u003ctd\u003e\n\n**Development Progress** \u0026ensp; `59 / 125 steps`\n\n```\nM1  Foundation       ██████████████████████████████ 11/11 ✓\nM2  Ingest+Extract   ██████████████████████████████  9/9  ✓\nM3  Segment+Enrich   ██████████████████████████████ 10/10 ✓\nQA Gate: Pre-Search  ██████████████████████████████  6/6  ✓\nM4  Search (v1.0)    ██████████████████████████████ 11/11 ✓\nQA Gate: Post-MVP    ██████████████████████████████  7/7  ✓\nM5  Curation         ██████████████████████████████  5/5  ✓\nM6  Intelligence     ██████████████████████████████  7/7  ✓\nM7  API+Workers      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/9\nM8  Operations       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/6\nM9  Multi-Format     ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/13\nM10 Provenance       ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/9\nM11 Trust Layer      ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/5\nM12 Discovery        ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/4\nM13 Observability    ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/5\nM14 Research Agent   ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░  0/8\n```\n\n\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\u003c!-- PROGRESS:END --\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"./public/01-dashboard-light.png\" width=\"720\" alt=\"Mulder Dashboard\" /\u003e\n\u003c/p\u003e\n\n## What it does\n\nMulder transforms unstructured document collections — PDFs with complex layouts like magazines, newspapers, government correspondence — into structured, searchable knowledge.\n\nYou define your domain ontology in a single `mulder.config.yaml`. The pipeline adapts: extraction, entity resolution, retrieval, and analysis all derive from that one config file. No custom code per domain.\n\n```\nmulder.config.yaml  →  terraform apply  →  mulder pipeline run ./pdfs/  →  mulder query \"...\"\n```\n\n## Capabilities\n\n| # | Capability | What it does |\n|---|-----------|-------------|\n| 1 | **Layout Extraction** | Document AI + Gemini Vision fallback for magazines, newspapers, multi-column layouts |\n| 2 | **Domain Ontology** | One YAML defines entities, relationships, extraction rules. Gemini structured output with auto-generated JSON Schema. |\n| 3 | **Taxonomy** | Auto-bootstrapped after ~25 docs, incremental growth, human-in-the-loop curation, cross-lingual |\n| 4 | **Hybrid Retrieval** | Vector (pgvector) + BM25 (tsvector) + graph traversal (recursive CTEs), fused via RRF + LLM re-ranking |\n| 5 | **Web Grounding** | Gemini verifies entities against live web data — coordinates, bios, org descriptions |\n| 6 | **Spatio-Temporal** | PostGIS proximity queries, temporal clustering, pattern detection across time and space |\n| 7 | **Evidence Scoring** | Corroboration scores, two-phase contradiction detection, source reliability (PageRank), evidence chains |\n| 8 | **Cross-Lingual Resolution** | 3-tier entity resolution (attribute match, embedding similarity, LLM-assisted) across 100+ languages |\n| 9 | **Deduplication** | MinHash/SimHash near-duplicate detection, dedup-aware corroboration scoring |\n| 10 | **Schema Evolution** | Config-hash tracking per document per step, selective reprocessing after config changes |\n| 11 | **Visual Intelligence** _(v3.0 / Phase 2)_ | Image extraction, Gemini analysis, image embeddings, map/diagram data extraction |\n| 12 | **Pattern Discovery** _(v3.0 / Phase 2)_ | Cluster anomalies, temporal spikes, subgraph similarity, proactive insights |\n\n## Pipeline\n\n```\n          PDF\n           │\n     ┌─────▼─────┐\n     │   Ingest  │  Upload to Cloud Storage, pre-flight validation\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │  Extract  │  Document AI + Gemini Vision fallback → layout JSON + page images → GCS\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │  Segment  │  Gemini identifies stories from page images → Markdown + metadata → GCS\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │   Enrich  │  Entity extraction, taxonomy normalization, cross-lingual resolution\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │   Ground  │  Web enrichment via Gemini Search — coordinates, bios, verification\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │   Embed   │  Semantic chunking + text-embedding-004 (768-dim) → pgvector + BM25\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │   Graph   │  Deduplication, corroboration scoring, contradiction flagging\n     └─────┬─────┘\n           │\n     ┌─────▼─────┐\n     │  Analyze  │  Contradiction resolution, PageRank reliability, evidence chains\n     └─────┬─────┘\n           │\n       Knowledge\n         Graph\n```\n\nEvery step is idempotent, independently runnable, and CLI-accessible. Content artifacts live in GCS, search index in PostgreSQL.\n\n## Configuration\n\nAll domain logic lives in `mulder.config.yaml`. Define your domain, the pipeline adapts:\n\n```yaml\nproject:\n  name: investigative-journalism\n\nontology:\n  entity_types:\n    - name: person\n      description: Individual mentioned in documents\n      attributes:\n        - { name: role, type: string }\n        - { name: affiliation, type: string }\n    - name: event\n      description: A specific incident or occurrence\n      attributes:\n        - { name: date, type: date }\n        - { name: location, type: string }\n    - name: location\n      description: Geographic place\n      attributes:\n        - { name: coordinates, type: geo_point, optional: true }\n\n  relationships:\n    - { name: involved_in, from: person, to: event }\n    - { name: occurred_at, from: event, to: location }\n```\n\nEverything beyond `project` and `ontology` has sensible defaults. See [`mulder.config.example.yaml`](./mulder.config.example.yaml) for the full reference.\n\n## Architecture\n\n\u003ctable\u003e\n\u003ctr\u003e\u003ctd\u003e\u003cstrong\u003eSingle PostgreSQL\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003epgvector + tsvector + PostGIS + recursive CTEs + job queue — one instance, no graph DB, no Redis, no Pub/Sub\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003cstrong\u003eContent in GCS\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003ePDFs, layout JSON, page images, story Markdown in Cloud Storage. PostgreSQL holds references + search index only.\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003cstrong\u003eService Abstraction\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003eAll GCP services behind interfaces. Dev mode uses fixtures — zero API calls, zero cost.\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003cstrong\u003eCLI-first\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003eEvery capability is a CLI command. The API is a job producer, not a direct executor.\u003c/td\u003e\u003c/tr\u003e\n\u003ctr\u003e\u003ctd\u003e\u003cstrong\u003ePostgreSQL is truth\u003c/strong\u003e\u003c/td\u003e\u003ctd\u003ePipeline state, job queue, config tracking. Firestore is observability-only (UI monitoring).\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n**Baseline cost:** ~30-40 EUR/mo for a small Cloud SQL instance. Scales with Gemini API usage.\n\n## Tech Stack\n\n| | |\n|---|---|\n| **Language** | TypeScript (ESM, strict mode) |\n| **Monorepo** | pnpm + Turborepo |\n| **Infrastructure** | Terraform (modular) |\n| **OCR** | Document AI Layout Parser |\n| **LLM** | Gemini 2.5 Flash (Vertex AI) |\n| **Embeddings** | text-embedding-004 (768-dim Matryoshka) |\n| **Database** | Cloud SQL PostgreSQL |\n| **Search** | pgvector (HNSW) + tsvector (BM25) + recursive CTEs |\n| **Geospatial** | PostGIS |\n| **CLI** | Commander.js |\n| **Testing** | Vitest |\n\n## Status\n\nMulder's **v1.0 MVP (M4)** is complete — the full pipeline from ingest through hybrid retrieval is operational. PDFs go in, a knowledge graph comes out, and natural-language queries return ranked passages with LLM re-ranking. The [functional spec](./docs/functional-spec.md), [implementation roadmap](./docs/roadmap.md), and [config schema](./mulder.config.example.yaml) are finalized.\n\nSee the [roadmap](./docs/roadmap.md) for all 14 milestones from foundation to autonomous research agent.\n\n## Contributing\n\nContributions, feedback, and ideas are welcome. Open an issue or start a discussion.\n\n## License\n\n[Apache 2.0](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmulkatz%2Fmulder","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmulkatz%2Fmulder","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmulkatz%2Fmulder/lists"}