{"id":50360430,"url":"https://github.com/senolisci/mykg","last_synced_at":"2026-06-03T05:00:37.562Z","repository":{"id":360801960,"uuid":"1251771970","full_name":"SenolIsci/mykg","owner":"SenolIsci","description":"Knowledge graph extractor: Markdown (or any format) → knowledge graph with RDFS/OWL ontology","archived":false,"fork":false,"pushed_at":"2026-06-01T23:59:41.000Z","size":2589,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-02T04:23:35.996Z","etag":null,"topics":["ai-agent","ai-workflow","ai-workflow-automation","claude-code","claude-skills","knowledge-graph","neo4j-graph","network-x","obsidian","ontology","ontology-alignment","ontology-engineering","ontology-matching","open-source","owl","protege","rdf","rdfs","second-brain","skos"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SenolIsci.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-27T22:40:59.000Z","updated_at":"2026-06-01T23:59:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"1e716248-9c21-4783-b4f9-d5749c8031fe","html_url":"https://github.com/SenolIsci/mykg","commit_stats":null,"previous_names":["senolisci/mykg"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/SenolIsci/mykg","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SenolIsci%2Fmykg","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SenolIsci%2Fmykg/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SenolIsci%2Fmykg/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SenolIsci%2Fmykg/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SenolIsci","download_url":"https://codeload.github.com/SenolIsci/mykg/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SenolIsci%2Fmykg/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33848862,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-03T02:00:06.370Z","response_time":59,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-agent","ai-workflow","ai-workflow-automation","claude-code","claude-skills","knowledge-graph","neo4j-graph","network-x","obsidian","ontology","ontology-alignment","ontology-engineering","ontology-matching","open-source","owl","protege","rdf","rdfs","second-brain","skos"],"created_at":"2026-05-30T01:05:42.431Z","updated_at":"2026-06-03T05:00:37.508Z","avatar_url":"https://github.com/SenolIsci.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# myKG — Knowledge Graph Extractor\n\n[![Python 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n[![Tests](https://img.shields.io/badge/tests-687%20passing-brightgreen.svg)](tests/)\n[![Coverage](https://img.shields.io/badge/coverage-87%25-brightgreen.svg)](htmlcov/index.html)\n[![Providers](https://img.shields.io/badge/LLM-Anthropic%20%7C%20OpenAI%20%7C%20Ollama%20%7C%20OpenRouter-orange.svg)](#configuration)\n[![PyPI version](https://img.shields.io/pypi/v/mykg.svg)](https://pypi.org/project/mykg/)\n[![PyPI Downloads](https://img.shields.io/pypi/dm/mykg.svg)](https://pypi.org/project/mykg/)\n[![GitHub Stars](https://img.shields.io/github/stars/SenolIsci/mykg?style=flat-square\u0026logo=github)](https://github.com/SenolIsci/mykg/stargazers)\n[![GitHub Issues](https://img.shields.io/github/issues/SenolIsci/mykg.svg)](https://github.com/SenolIsci/mykg/issues)\n[![Visitors](https://visitor-badge.laobi.icu/badge?page_id=SenolIsci.mykg)](https://github.com/SenolIsci/mykg)\n[![LinkedIn](https://img.shields.io/badge/LinkedIn-senolisci-0077B5?logo=linkedin)](https://www.linkedin.com/in/senolisci/)\n\n**myKG** automatically generates a confidence-scored knowledge graph from a directory of Markdown files and convertible source documents, grounded in an induced RDFS/OWL ontology schema.\n\nIt uses a **two-pass LLM pipeline**: Pass 1 induces a global RDFS/OWL schema from your document corpus; Pass 2 extracts typed entity and relationship instances per file against that schema. The result is exported to multiple formats: JSONL for property-graph consumers such as Neo4j, Turtle RDF for OWL toolchains, seven NetworkX formats for graph analysis and visualization, and an **Obsidian vault** — a second brain of wikilinked Markdown notes your AI coding assistant (Claude Code, Cursor, Copilot) can read and reason over directly.\n\n## Command line\n\n```\nmykg extract-graph my_notes/\n```\n\n## Output\n\n```\nsessions/2026-05-17T18-31-07/\n  output/\n    nodes.jsonl                    ← typed entities with confidence scores\n    edges.jsonl                    ← typed relationships with provenance\n    knowledge_graph.ttl            ← RDFS/OWL TBox + RDF ABox (Protégé, SPARQL)\n    networkx_output/               ← GML, GraphML, GEXF, Pajek, JSON node-link,\n                                      knowledge_graph.html (interactive vis)\n    obsidian_vault/                ← Obsidian-ready linked Markdown notes\n      index.md                     ←   overview table with links to every entity\n      Person/                      ←   one .md note per entity, grouped by type\n      Organization/\n      ...\n  walkthrough.md                   ← per-run report: schema, stats, timing\n```\n\n---\n\n## Contents\n\n- [Features](#features)\n- [Quick Start](#quick-start)\n- [Using with Claude Code](#using-with-claude-code)\n- [Configuration](#configuration)\n- [Extract Pipeline](#extract-pipeline)\n  - [Running](#running)\n  - [Sessions](#sessions)\n  - [Pipeline Steps](#pipeline-steps)\n  - [Outputs](#outputs)\n  - [Re-running from a Specific Step](#re-running-from-a-specific-step)\n  - [Orphan-Connection Pass](#orphan-connection-pass)\n- [Advanced Options](#advanced-options)\n  - [Human Review Gate](#human-review-gate---review)\n  - [Locked Base Schema](#locked-base-schema---base-schema)\n  - [SKOS Thesaurus](#skos-thesaurus---thesaurus)\n  - [Append Mode](#append-mode)\n  - [Merging Sessions](#merging-sessions)\n  - [Walkthrough Report](#walkthrough-report)\n  - [Obsidian Vault Export](#obsidian-vault-export)\n- [Development](#development)\n- [Roadmap](#roadmap)\n- [Design](#design)\n\n## Features\n\n### Ontology-Guided Extraction\n\n- **Schema-guided knowledge graph generation** — the extracted graph is always grounded in a formal RDFS/OWL schema: concept types, property names, domain/range constraints, and the is-a hierarchy are explicit and inspectable before any entity is extracted\n- **Bring your own ontology** — supply a `--base-schema` TTL file to lock in classes and properties from an existing formal ontology; the LLM expands it with domain-specific concepts but cannot rename, remove, or contradict your authoritative vocabulary\n- **SKOS thesaurus support** — pass `--thesaurus` to load a SKOS vocabulary; `skos:exactMatch` terms are collapsed silently, `skos:closeMatch` terms trigger a warning — giving the schema merger richer synonym awareness than string matching alone\n- **Verifiable TTL ontology** — after Pass 1, the induced schema is exported as a valid RDFS/OWL Turtle file (`intermediate/schema.ttl`) that can be opened directly in ontology editors such as [Protégé](https://protege.stanford.edu/). The TTL is validated by rdflib (syntax + semantic checks: domain/range refer to declared classes, no conflicting ranges) before any extraction begins\n- **Human-in-the-loop ontology design** — run with `--review` to pause after schema induction; inspect and edit `schema.json` (or load `schema.ttl` in Protégé, modify, and save back) before a single entity is extracted; resume with `mykg approve-schema`\n- **Incremental updates** — run with `--append` on an existing session to add new or modified Markdown files without re-running Pass 1; the schema is reused and only the new files go through Pass 2\n- **AI coding assistant friendly** — designed for smooth use alongside AI coding assistants such as [Claude Code](https://claude.ai/code); run extractions, inspect outputs, and iterate on your knowledge graph without leaving your coding environment; see [Using with Claude Code](#using-with-claude-code)\n- **Second brain for AI coding assistants** — the Obsidian vault output turns your extracted knowledge graph into a directory of wikilinked Markdown notes that any AI coding assistant can read as project context; point Claude Code, Cursor, or Copilot at `output/obsidian_vault/` and ask questions, trace relationships, and get answers grounded in your own documents\n\n### Input\n\n- **Markdown files** — any directory of `.md` files; subdirectory structure is preserved; YAML/TOML frontmatter, headings, lists, and code blocks are all treated as structural signals\n- **Other formats (built-in)** — when `pipeline.preprocess.enabled` is on, myKG converts non-Markdown sources into Markdown before extraction. Supported preprocess formats include PDF, DOCX, PPTX, XLSX, PNG, JPG, JPEG, and HTML. HTML files are converted in-process via `markdownify`; binary documents and images use MinerU when installed via the optional `mykg[mineru]` extras. The base package works on Python 3.10 and newer; the `mineru` extra is only supported on Python 3.10–3.13.\n\n### Graph \u0026 Output\n\n- **Provider-agnostic** — works with Anthropic (Claude), OpenAI (GPT-4o), Ollama (local), OpenRouter, or the `claude` CLI with no API key\n- **Four output families** — JSONL for Neo4j/NetworkX/RAG, Turtle RDF for OWL toolchains, NetworkX multi-format for graph analysis, and Obsidian vault for linked personal knowledge management\n- **Obsidian vault — second brain for AI coding assistants** — every extracted entity becomes a wikilinked Markdown note in `output/obsidian_vault/`; open it in [Obsidian](https://obsidian.md) to navigate the graph with backlinks and Graph View, or point your AI coding assistant (Claude Code, Cursor, Copilot) at the vault folder so it can answer questions, trace relationships, and reason over your knowledge base in natural language\n- **Interactive HTML graph** — node/edge filtering, search, hover popups; opens directly in a browser\n- **Confidence scoring** — every extracted attribute, node, and edge carries a `0.0–1.0` confidence score\n- **Name normalization** — surface-form variants (\"Acme Corp\", \"ACME\", \"Acme Corporation\") resolved to a single canonical node with aliases\n- **Orphan-connection pass** — reconnects isolated nodes via co-occurrence heuristic + LLM confirmation\n- **Cross-session merge** — combine two independently-produced graphs into one unified knowledge graph\n- **Resumable pipeline** — every stage persists intermediate state; re-enter at any step after a crash or edit\n- **Session isolation** — each run is fully self-contained; inputs, intermediate state, outputs, and logs co-located\n- **Query knowledge graph** — natural-language and structured queries directly against the extracted graph via AI coding assistants such as [Claude Code](https://claude.ai/code), SPARQL endpoints, or graph traversal APIs\n\n## Quick Start\n\nRequires Python 3.11+ and one of: an Anthropic/OpenAI/OpenRouter API key, Ollama running locally, or the `claude` CLI.\n\n### Install from PyPI\n\nInstall mykg, then run the interactive setup wizard — it asks for your provider, model, and API key and writes `mykg_config.yaml` and `.env.mykg` in one step.\n\n```bash\npip install mykg\nmykg init\nmykg extract-graph my_notes/\n```\n\nTo also ingest PDF / DOCX / PPTX / XLSX / image files, install the optional `mineru` extras (pulls in PyTorch + OpenCV — skip if your corpus is Markdown/HTML only). The core `mykg` package supports Python 3.10 and newer, but `mykg[mineru]` is only supported on Python 3.10 through 3.13.\n\n```bash\npip install 'mykg[mineru]'\n```\n\nOpen `sessions/\u003ctimestamp\u003e/output/knowledge_graph.html` in your browser to explore the result.\n\n### Install from source\n\nInstall [uv](https://docs.astral.sh/uv/getting-started/installation/), clone the repo, sync dependencies, run the setup wizard, then extract.\n\n```bash\ngit clone https://github.com/SenolIsci/mykg \u0026\u0026 cd mykg\nuv sync \u0026\u0026 mykg init\nuv run mykg extract-graph my_notes/\n```\n\nFor non-Markdown inputs (PDF, DOCX, etc.) from a source install, sync with the `mineru` extra:\n\n```bash\nuv sync --extra mineru\n```\n\nFor Ollama (local inference, no API key needed), pull a model and select the `ollama-local` profile when `mykg init` prompts you.\n\n```bash\nollama pull llama3.3\nmykg init\nmykg extract-graph my_notes/\n```\n\n## Using with Claude Code\n\nmyKG ships with a `claude-cli` profile that runs extractions through the locally-installed `claude` CLI — no API key or billing setup needed beyond your existing Claude Pro/Max plan.\n\n### Setup\n\nInstall the `claude` CLI, then install mykg and run the setup wizard — select **[5] Claude CLI** when prompted (no API key needed).\n\n```bash\nnpm install -g @anthropic-ai/claude-code\npip install mykg \u0026\u0026 mykg init\nmykg extract-graph my_notes/\n```\n\n### How it works\n\nThe `claude-cli` provider calls `claude -p` as a subprocess for every LLM step (Pass 1 schema induction, Pass 2 extraction, orphan connection, name normalization). All pipeline features — session isolation, resumability, orphan recovery, cross-session merge — work identically to API-based providers.\n\n**Key constraints of the `claude-cli` profile:**\n- `max_workers` must be `1` — the `claude` CLI is serial by design; parallel workers will queue\n- No API key required — billing goes through your Claude Pro/Max subscription\n- The `effort` and `model` fields in `mykg_config.yaml` map directly to `--effort` and `--model` flags passed to `claude -p`\n\n### Using myKG from inside Claude Code\n\nYou can run myKG extractions as a tool call from within a Claude Code session. This is useful for building knowledge graphs from notes or documentation while you work:\n\n```bash\n# From any Claude Code session terminal:\nmykg extract-graph ./docs/ --session my-docs-kg\n\n# Then reference the output in your session:\n# sessions/my-docs-kg/output/nodes.jsonl\n# sessions/my-docs-kg/output/knowledge_graph.ttl\n```\n\nClaude Code can then read `nodes.jsonl` or `edges.jsonl` directly to answer questions about the extracted graph, or load `knowledge_graph.ttl` into a SPARQL tool for structured queries.\n\n### Recommended `mykg_config.yaml` settings for Claude Code\n\n```yaml\nprofile: claude-cli\n\nprofiles:\n  claude-cli:\n    llm:\n      model: sonnet       # or opus for higher quality\n      effort: medium      # low | medium | high\n    pipeline:\n      pass1:\n        max_workers: 1    # required — claude CLI is serial\n      pass2:\n        max_workers: 1\n```\n\n---\n\n## Configuration\n\nAll configuration lives in a single `mykg_config.yaml` file discovered automatically from the working directory (or any parent). There are no hardcoded defaults in the code — the YAML is the sole source of truth.\n\n```bash\nmykg init           # interactive: choose provider, model, paste API key\n                    # writes mykg_config.yaml and .env.mykg in one step\nmykg init --force   # overwrite an existing config\nmykg init --profile openrouter-free --model google/llama-4-maverick --api-key sk-or-...  # non-interactive\n```\n\nThe wizard walks you through three prompts:\n\n1. **Profile** — choose your LLM provider (OpenRouter, Anthropic, OpenAI, Ollama, Claude CLI)\n2. **Model** — accept the default or type any model slug for that provider\n3. **API key** — paste your key (skipped for Ollama and Claude CLI)\n\n\n### API Keys\n\nmyKG reads API keys from environment variables. Set them by exporting directly or by creating a `.env.mykg` file in your project directory (loaded automatically on startup).\n\n**Option A — export in your shell:**\n\n```bash\nexport ANTHROPIC_API_KEY=sk-ant-...\n```\n\n**Option B — create a `.env.mykg` file:**\n\n```bash\n# .env.mykg\nANTHROPIC_API_KEY=sk-ant-...\n```\n\n| Variable | Profile | Notes |\n|---|---|---|\n| `ANTHROPIC_API_KEY` | `anthropic-claude` | Claude API key |\n| `OPENAI_API_KEY` | `openai` | OpenAI API key |\n| `OPENROUTER_API_KEY` | `openrouter-free` | OpenRouter API key |\n| *(none required)* | `claude-cli` | Billing via Claude Pro/Max subscription |\n| *(none required)* | `ollama-local` | Local inference, no account needed |\n\nFor source installs you can also copy [`sample.env.mykg`](sample.env.mykg) to `.env.mykg` as a starting template.\n\n### LLM Providers\n\n| Provider | Profile name | API key env var | Notes |\n|---|---|---|---|\n| Anthropic (Claude) | custom (see Quick Start) | `ANTHROPIC_API_KEY` | Recommended for quality |\n| OpenAI (GPT-4o) | `openai` | `OPENAI_API_KEY` | |\n| Ollama | `ollama-local` | — | Local inference, no key needed |\n| OpenRouter | `openrouter-free` | `OPENROUTER_API_KEY` | Access many models via one key |\n| Claude CLI | `claude-cli` | — | Uses `claude -p` subprocess; billing via Claude Pro/Max; serial only |\n\nSwitch provider by setting `profile:` at the top of [`mykg_config.yaml`](mykg_config.yaml).\n\n### Key Pipeline Parameters\n\n| Key | Default | Description |\n|---|---|---|\n| `pipeline.chunking.window_tokens` | `2000` | Chunk size in tokens |\n| `pipeline.chunking.overlap_tokens` | `200` | Overlap between adjacent chunks |\n| `pipeline.pass1.batch_token_target` | `8000` | Max tokens per Pass 1 LLM batch |\n| `pipeline.pass1.max_workers` | `4` | Parallel LLM workers for Pass 1 |\n| `pipeline.pass2.max_workers` | `1` | Parallel workers for Pass 2 |\n| `pipeline.pass2.stateful_chunks` | `false` | Pass prior-chunk node IDs to subsequent chunks for stable IDs |\n| `pipeline.pass2.prep_mode` | `per_file` | `per_file` \\| `concat` \\| `batch_chunks` |\n| `pipeline.normalize_names.enabled` | `true` | Run LLM name normalization step |\n| `pipeline.orphan_pass.enabled` | `true` | Run the orphan-connection pass |\n| `pipeline.orphan_pass.schema_max_restarts` | `1` | Max automated Pass 2 restarts from schema-gap recovery |\n| `pipeline.export.networkx_enabled` | `true` | Write NetworkX formats to `output/networkx_output/` |\n| `pipeline.export.obsidian_enabled` | `true` | Write Obsidian vault to `output/obsidian_vault/` |\n| `pipeline.export.obsidian_vault_dir` | `obsidian_vault` | Subdirectory name for the Obsidian vault inside `output/` |\n| `pipeline.error_gate.enabled` | `true` | Pause all workers on repeated API errors |\n\nRun `context-calculator --context \u003cN\u003e --max-output \u003cM\u003e` to compute correct `window_tokens` and `batch_token_target` for a different model's context window.\n\n## Extract Pipeline\n\nReads a directory of `.md` files and produces a typed knowledge graph in three output formats. The pipeline runs 11 sequential steps; all intermediate state is persisted so any step can be re-entered without repeating upstream work.\n\n### Running\n\n```bash\nmykg extract-graph \u003cinput_dir\u003e [OPTIONS]\n# source installs: uv run mykg extract-graph \u003cinput_dir\u003e [OPTIONS]\n```\n\n`\u003cinput_dir\u003e` is any directory of `.md` files. Subdirectories are included recursively.\n\n### Options\n\n| Option | Description |\n|---|---|\n| `--session NAME` | Resume an existing session by folder name |\n| `--from-step NAME` | Delete a step's outputs and re-run from that point |\n| `--review` | Pause after Pass 1 for manual schema review |\n| `--append` | Skip Pass 1; re-run only on new/modified files |\n| `--workers N` | Parallel workers for Pass 2 |\n| `--confidence-agg mean\\|max` | Confidence aggregation when deduplicating |\n| `--base-schema PATH` | Locked TBox TTL file (locked classes/properties cannot be changed by the LLM) |\n| `--thesaurus PATH` | SKOS TTL thesaurus for synonym resolution in schema merge |\n| `--obsidian-vault` | Force Obsidian vault export for this run (overrides config) |\n| `--log-file PATH` | Write logs here (relative paths placed inside the session folder) |\n| `--verbose / -v` | Enable DEBUG-level logging |\n\n### Examples\n\n```bash\n# New run — auto-creates a timestamped session\nmykg extract-graph my_notes/\n\n# Resume a session with 4 parallel Pass 2 workers\nmykg extract-graph my_notes/ --session 2026-05-17T18-31-07 --workers 4\n\n# Pause for schema review after Pass 1\nmykg extract-graph my_notes/ --review\n# → edit sessions/\u003cname\u003e/intermediate/schema.json\nmykg approve-schema --session 2026-05-17T18-31-07\nmykg extract-graph my_notes/ --session 2026-05-17T18-31-07 --review\n\n# Re-run from assembly onward (reuses existing extractions)\nmykg extract-graph my_notes/ --session 2026-05-17T18-31-07 --from-step assemble\n\n# Lock a base ontology so the LLM won't rename its classes\nmykg extract-graph my_notes/ --base-schema ontology/core.ttl\n```\n\n### Sessions\n\nEvery run automatically creates an isolated session folder:\n\n```\nsessions/\n  2026-05-17T18-31-07/\n    input/           ← archived copy of all input Markdown files\n    intermediate/    ← all intermediate pipeline state\n    output/          ← final outputs (JSONL, TTL, HTML, NetworkX)\n    run.log          ← log file\n    walkthrough.md   ← post-run report\n```\n\nSessions are the primary unit of resumability. Pass `--session \u003cname\u003e` to resume from the last completed step. Pass `--from-step \u003cstep\u003e` to force-restart from a specific point.\n\nThe sessions root is configurable via `pipeline.paths.sessions_dir` (default: `sessions/` in the current directory).\n\n### Pipeline Steps\n\nThe pipeline runs 12 steps in sequence. All intermediate state is written to disk so any step can be re-entered without repeating upstream work.\n\n| # | Step | LLM | Key outputs |\n|---|---|---|---|\n| 1 | `preprocess` | — | `preprocess.done`, `preprocess_manifest.json` *(only converts non-Markdown sources; no-op for pure-MD corpora)* |\n| 2 | `ingest` | — | `file_manifest.json` |\n| 3 | `pass1` | ✓ (3 calls) | `schema.json`, `schema.ttl`, `schema_history/` |\n| 4 | `schema_validate` | — | `schema_validate.done` |\n| 5 | `human_review` | — | `schema_approved.flag` *(only with `--review`)* |\n| 6 | `schema_flatten` | — | `flattened_schema.json` |\n| 7 | `pass2` | ✓ | `raw_extractions.json`, `chunk_node_index.json` |\n| 8 | `normalize_names` | ✓ | `name_normalization.json` |\n| 9 | `assemble` | — | `edge_metadata.json`, `nodes.json`, `merge_log.json` |\n| 10 | `orphan_score` | — | `orphan_candidates.json` |\n| 11 | `orphan_connect` | ✓ | `orphan_connections.json`, `orphan_log.json` |\n| 12 | `validate_graph` | — | `nodes.jsonl`, `edges.jsonl`, `knowledge_graph.ttl`, `knowledge_graph.html`, `networkx_output/`, `obsidian_vault/` |\n\nPass 1 internally runs four sequential stages: parallel batch induction → algorithmic merge → harmonization LLM call → quality review LLM call.\n\n### Other input formats (PDF, DOCX, PPTX, XLSX, images, HTML)\n\nmyKG can convert non-Markdown sources to Markdown automatically before extraction. Two converters share one routing layer (D39–D48):\n\n- **HTML / HTM** — handled in-process via `markdownify`. Works out of the box, no extra install.\n- **PDF / DOCX / DOC / PPTX / XLSX / PNG / JPG / JPEG** — handled by [MinerU](https://github.com/opendatalab/mineru) invoked as a CLI subprocess. Install the optional extras to enable:\n\n```bash\npip install 'mykg[mineru]'\n```\n\nTwo entry points use the same converter:\n\n1. **In-pipeline (`preprocess` step)** — drop mixed `.md` / `.pdf` / `.docx` files into your input directory and run `mykg extract-graph` normally; converted markdown files land next to the source files inside `sessions/\u003crun\u003e/input/` with a `\u003cstem\u003e.mineru.json` provenance sidecar, and `ingest` picks them up alongside hand-written notes.\n\n   ```bash\n   mykg extract-graph ./my_mixed_corpus/\n   ```\n\n2. **Standalone (`mykg convert`)** — convert a directory once, inspect or edit the output, then feed it to `extract-graph` separately:\n\n   ```bash\n   mykg convert -i ./pdfs/ -o ./converted/\n   mykg extract-graph ./converted/\n   ```\n\nAvailable options for `mykg convert`:\n\n```bash\nmykg convert -i \u003cinput-dir\u003e -o \u003coutput-dir\u003e \\\n  [--workers N] \\\n  [--backend pipeline | hybrid-auto-engine | vlm-*] \\\n  [--language en] \\\n  [--include pdf,docx,html] \\\n  [--fail-fast]\n```\n\nAll defaults flow from the `preprocess:` block in `mykg_config.yaml`:\n\n```yaml\npreprocess:\n  enabled: true\n  max_workers: 4\n  mineru_path: mineru        # absolute path or PATH binary\n  backend: pipeline\n  language: en\n  timeout_seconds: 900\n  extensions: [pdf, docx, doc, pptx, xlsx, png, jpg, jpeg]\n  html_extensions: [html, htm]\n  fail_fast: false\n```\n\nIf MinerU is not installed but non-HTML files are present, the step logs an actionable error (`MinerU not found … install with: pip install mykg[mineru]`); pure-HTML corpora work without MinerU. Set `preprocess.enabled: false` to disable the step entirely.\n\n## Outputs\n\n### Property Graph (JSONL)\n\n**`nodes.jsonl`** — one JSON line per entity:\n\n```json\n{\n  \"id\": \"person-alice\",\n  \"type\": \"Person\",\n  \"confidence\": 0.94,\n  \"source_files\": [\"team.md\"],\n  \"attributes\": {\n    \"name\":  {\"value\": \"Alice\",          \"confidence\": 1.0},\n    \"email\": {\"value\": \"alice@acme.com\", \"confidence\": 0.88}\n  },\n  \"aliases\": [\"Alice Smith\", \"A. Smith\"]\n}\n```\n\n**`edges.jsonl`** — one JSON line per relationship:\n\n```json\n{\n  \"id\": \"works_at-abc123\",\n  \"type\": \"works_at\",\n  \"from\": \"person-alice\",\n  \"to\": \"org-acme-corp\",\n  \"confidence\": 0.96,\n  \"method\": \"llm_extraction\",\n  \"attributes\": {\n    \"role\":       {\"value\": \"Engineer\", \"confidence\": 0.91},\n    \"start_date\": {\"value\": null,       \"confidence\": 0.0}\n  }\n}\n```\n\nMissing attributes are never dropped — they are represented as `{\"value\": null, \"confidence\": 0.0}`.\n\nThe `method` field distinguishes edges extracted by Pass 2 (`llm_extraction`) from edges inferred by the orphan pass (`orphan_inferred`).\n\n### RDF / OWL (Turtle)\n\n**`knowledge_graph.ttl`** — pure RDFS/OWL triples, no edge metadata:\n\n```turtle\n@prefix ex: \u003chttp://mykg.local/schema/\u003e .\n@prefix :   \u003chttp://mykg.local/data/\u003e .\n\nex:Person  a rdfs:Class .\nex:works_at  rdfs:domain ex:Person ;  rdfs:range ex:Organization .\n\n:person-alice  a ex:Person ;  rdfs:label \"Alice\" .\n:person-alice  ex:works_at  :org-acme-corp .\n```\n\nLoad in Protégé, query with SPARQL (Fuseki, GraphDB), or reason with HermiT/Pellet.\n\n### Interactive HTML\n\n**`knowledge_graph.html`** — self-contained D3.js force-directed graph. Open in any browser, no server required. Supports:\n- Filter nodes and edges by type\n- Filter by confidence threshold\n- Search by name\n- Hover popups with full attribute values\n- Resizable sidebar\n\n### NetworkX Formats (`networkx_output/`)\n\n| File | Format | Best for |\n|---|---|---|\n| `knowledge_graph.graphml` | GraphML | yEd, Gephi, Cytoscape |\n| `knowledge_graph.gexf` | GEXF | Gephi native (rich metadata) |\n| `knowledge_graph.json` | JSON node-link | D3.js, Sigma.js, web apps |\n| `knowledge_graph.gml` | GML | Human-readable inspection |\n| `knowledge_graph.net` | Pajek | Network analysis |\n| `edges_nx.txt` | Edge list | Text pipelines |\n| `adjacency.txt` | Adjacency list | Topology consumers |\n\nNode/edge attributes are exported as `attr_\u003cname\u003e_value` / `attr_\u003cname\u003e_confidence` scalar pairs for GML compatibility.\n\n### Obsidian Vault (`obsidian_vault/`)\n\nOne `.md` note per extracted entity, grouped into subdirectories by concept type. Each note has YAML frontmatter (id, type, confidence, sources), an attributes section, outgoing and incoming wikilink relationship sections, and a source files list. An `index.md` at the vault root summarizes node counts per type with links to every entity.\n\nOpen `output/obsidian_vault/` as a vault in [Obsidian](https://obsidian.md) to get Graph View, backlink navigation, and full-text search across the extracted entities.\n\n### Re-running from a Specific Step\n\nUse `--from-step` to delete a step's outputs and all downstream outputs, then re-run from that point.\n\n```bash\nSESSION=2026-05-17T18-31-07\n\n# Re-run from Pass 2 (reuse the existing schema)\nmykg extract-graph my_notes/ --session $SESSION --from-step pass2\n\n# Re-run only assembly + export (reuse raw extractions)\nmykg extract-graph my_notes/ --session $SESSION --from-step assemble\n\n# Re-run both orphan stages\nmykg extract-graph my_notes/ --session $SESSION --from-step orphan_score\n\n# Orphan LLM pass only — full clean sweep\nmykg extract-graph my_notes/ --session $SESSION --from-step orphan_connect_fullsweep\n\n# Orphan LLM pass only — additive (preserves prior confirmed edges)\nmykg extract-graph my_notes/ --session $SESSION --from-step orphan_connect_incremental\n```\n\n**Four re-entry patterns:**\n\n| Pattern | When to use | Command |\n|---|---|---|\n| **A — Schema changed** | Wrong concept types, missing properties | Edit `schema.json` → `approve-schema` → `--from-step pass1` |\n| **B — Extraction errors** | LLM missed entities or invented edge types | Edit shard in `raw_extractions_shards/` → `--from-step pass2` |\n| **C — Assembly errors** | Bad dedup decisions in `merge_log.json` | Edit `raw_extractions.json` → `--from-step assemble` |\n| **D — Orphan pass** | Wrong candidates or confirmations | `--from-step orphan_score` or `orphan_connect_fullsweep` |\n\n### Orphan-Connection Pass\n\nAfter assembly, nodes with zero edges are \"orphans\" — present in the graph but unreachable by traversal. The orphan pass reconnects them in two stages:\n\n**Stage 1 — `orphan_score` (no LLM):** Uses `chunk_node_index.json` to find nodes that co-occur in the same source chunk as each orphan. Candidates are scored by co-occurrence frequency and filtered by schema type compatibility. Written to `orphan_candidates.json`.\n\n**Stage 2 — `orphan_connect` (LLM):** One LLM call per source chunk. The prompt includes the full chunk text, all orphan IDs from that chunk, co-occurring connected nodes, and all schema properties. Confirmed edges carry `\"method\": \"orphan_inferred\"` and are merged directly into `edge_metadata.json`.\n\nUnconnectable orphans (no resolvable source chunk) are logged as `orphan_unconnectable` advisory events in `orphan_log.json`.\n\nConfigure via `pipeline.orphan_pass.*` in `mykg_config.yaml`. Disable entirely with `pipeline.orphan_pass.enabled: false`.\n\n## Advanced Options\n\n### Human Review Gate (`--review`)\n\nPause after Pass 1 to inspect and edit the induced schema before Pass 2 runs:\n\n```bash\nmykg extract-graph my_notes/ --review\n# → pipeline halts; edit sessions/\u003cname\u003e/intermediate/schema.json\nmykg approve-schema --session \u003cname\u003e\nmykg extract-graph my_notes/ --session \u003cname\u003e --review   # resumes from Pass 2\n```\n\n### Locked Base Schema (`--base-schema`)\n\nLock certain classes and properties so the LLM cannot rename, remove, or restructure them:\n\n```bash\nmykg extract-graph my_notes/ --base-schema ontology/base.ttl\n```\n\nLocked entries can still receive additional attributes proposed by the LLM. Near-duplicate LLM proposals are collapsed into the locked entry with a warning.\n\n### SKOS Thesaurus (`--thesaurus`)\n\nResolve near-duplicate concept names during schema merge using a SKOS vocabulary:\n\n```bash\nmykg extract-graph my_notes/ --thesaurus ontology/terms.skos.ttl\n```\n\n- `skos:exactMatch` → silent collapse\n- `skos:closeMatch` → collapse with warning in `merge_log.json`\n- `skos:broader` / `skos:narrower` → advisory hints only\n\n### Append Mode\n\nRe-run the pipeline on new or modified files without re-running Pass 1:\n\n```bash\nmykg extract-graph my_notes/ --session \u003cname\u003e --append\n```\n\n### Merging Sessions\n\nCombine two independently-produced sessions into a unified knowledge graph:\n\n```bash\nmykg merge-graphs \u003csession-A\u003e \u003csession-B\u003e [OPTIONS]\n\n# Example\nmykg merge-graphs 2026-05-01T10-00-00 2026-05-15T14-30-00\n\n# Resume a merge (last incomplete step auto-detected)\nmykg merge-graphs A B --output-session \u003cmerged-name\u003e\n```\n\n**Options:**\n\n| Option | Description |\n|---|---|\n| `--output-session TEXT` | Name for the merged session (default: auto-timestamped) |\n| `--no-review` | Skip the human review gate after schema merge |\n| `--thesaurus PATH` | SKOS thesaurus for schema synonym matching |\n| `--base-schema PATH` | Locked TBox TTL base schema |\n| `--from-step NAME` | Force re-run from a specific merge step |\n\n**What happens:**\n\n1. Both schemas are merged via the same three-stage chain as Pass 1 (algorithmic union → LLM harmonization → LLM quality review)\n2. All file-keyed structures are namespaced (`session_a/\u003cfilename\u003e`, `session_b/\u003cfilename\u003e`) before merging\n3. Nodes are deduplicated across sessions: same type + canonical name → single node, regardless of source session\n4. Re-extraction strategy (`none` / `surgical` / `full`) handles properties absent from one session's schema\n5. `source_map.json` records full file provenance; `merge_manifest.json` records schema deltas and strategy used\n6. `walkthrough.md` includes a Merge Provenance section with before/after counts and node/edge breakdowns\n\nConfigure the re-extraction strategy:\n\n```yaml\nmerge_graphs:\n  reextraction_strategy: surgical   # none | surgical | full\n```\n\n### Obsidian Vault Export\n\nEvery run writes a linked Markdown vault to `output/obsidian_vault/` by default. Open that folder in [Obsidian](https://obsidian.md) to explore the extracted knowledge graph with Graph View and backlinks.\n\n**Vault structure:**\n\n```\noutput/obsidian_vault/\n  index.md                  ← overview: node count per type, links to every entity\n  Person/\n    person-alice-smith.md   ← one note per entity\n    person-bob-jones.md\n  Organization/\n    organization-acme-corp.md\n  ...\n```\n\n**Each entity note contains:**\n\n```markdown\n---\nid: person-alice-smith\ntype: Person\nconfidence: 0.94\nsources:\n  - team.md\n---\n\n# Alice Smith\n\n## Attributes\n- **role**: Engineer (0.91)\n- **email**: alice@acme.com (1.0)\n\n## Relationships\n\n### Outgoing\n- [[Acme Corp]] — works_at (0.96)\n\n### Incoming\n- [[Bob Jones]] — manages (0.88)\n\n## Source Files\n- team.md\n```\n\nWikilinks (`[[...]]`) are Obsidian-native — clicking them in the app navigates to the linked entity note, and the Graph View shows the full relationship network automatically.\n\n**Config:**\n\n```yaml\npipeline:\n  export:\n    obsidian_enabled: true          # default — set false to skip vault export\n    obsidian_vault_dir: obsidian_vault   # subfolder name inside output/\n```\n\nOr use `--obsidian-vault` on the command line for a one-off run without editing config.\n\n### Walkthrough Report\n\nA human-readable summary is written to `sessions/\u003cname\u003e/walkthrough.md` after every run:\n\n```bash\n# Regenerate the walkthrough for an existing session\nmykg walkthrough --session 2026-05-17T18-31-07\n```\n\nDisable with `pipeline.report.enabled: false`.\n\n---\n\n## Development\n\n### Installation\n\n```bash\ngit clone https://github.com/SenolIsci/mykg \u0026\u0026 cd mykg\nuv sync\n```\n\n### Testing\n\n```bash\n# All non-live tests (fast, no API key needed)\nuv run pytest -m \"not live\" -v\n\n# All tests including live API integration tests\n# Requires OPENROUTER_API_KEY in environment or .env.mykg (see sample.env.mykg)\nuv run pytest -m live -v\n\n# Single file\nuv run pytest tests/test_assembler.py -v\n\n# With coverage (HTML report at htmlcov/index.html)\nuv run pytest -m \"not live\"\nopen htmlcov/index.html\n```\n\n### Linting and Formatting\n\n```bash\nuv run ruff check src/ tests/          # lint\nuv run ruff check --fix src/ tests/    # auto-fix\nuv run ruff format src/ tests/         # format\n```\n\n### Token Budget Calculator\n\nWhen switching to a model with a different context window:\n\n```bash\ncontext-calculator --context 128000 --max-output 16384\n```\n\nOutputs a ready-to-paste YAML snippet for the `pipeline:` block.\n\n### Profiling\n\n```bash\npython -m cProfile -o profile.out -m mykg.cli extract input_files/\nuv run snakeviz profile.out\n```\n\n---\n\n## Roadmap\n\n- **Query knowledge graph** — natural-language and structured queries directly against the extracted graph; planned support for SPARQL, graph traversal, and LLM-assisted Q\u0026A over nodes and edges\n\n---\n\n## Design\n\nFor a thorough description of the architecture, algorithm, data models, and design decisions, see [architecture.md](architecture.md).\n\n---\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsenolisci%2Fmykg","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsenolisci%2Fmykg","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsenolisci%2Fmykg/lists"}