{"id":50583675,"url":"https://github.com/rasinmuhammed/node-canon","last_synced_at":"2026-06-05T04:30:40.426Z","repository":{"id":355246248,"uuid":"1225838221","full_name":"rasinmuhammed/node-canon","owner":"rasinmuhammed","description":"Entity resolution for LLM knowledge graphs. Merges duplicate nodes created by chunked extraction; \"IBM\", \"I.B.M.\", \"International Business Machines\" → one canonical node. No API key required.","archived":false,"fork":false,"pushed_at":"2026-05-14T18:21:36.000Z","size":157,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-14T20:29:14.294Z","etag":null,"topics":["deduplication","entity-resolution","graphrag","knowledge-graph","lightrag","llamaindex","nlp","python","rag"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rasinmuhammed.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-30T17:37:56.000Z","updated_at":"2026-05-14T18:21:57.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/rasinmuhammed/node-canon","commit_stats":null,"previous_names":["rasinmuhammed/node-canon"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/rasinmuhammed/node-canon","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasinmuhammed%2Fnode-canon","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasinmuhammed%2Fnode-canon/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasinmuhammed%2Fnode-canon/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasinmuhammed%2Fnode-canon/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rasinmuhammed","download_url":"https://codeload.github.com/rasinmuhammed/node-canon/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rasinmuhammed%2Fnode-canon/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33930307,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-05T02:00:06.157Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deduplication","entity-resolution","graphrag","knowledge-graph","lightrag","llamaindex","nlp","python","rag"],"created_at":"2026-06-05T04:30:37.494Z","updated_at":"2026-06-05T04:30:40.395Z","avatar_url":"https://github.com/rasinmuhammed.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# nodecanon\n\n[![PyPI](https://img.shields.io/badge/pypi-v0.1.0-blue)](https://pypi.org/project/nodecanon/)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/)\n[![License: MIT](https://img.shields.io/badge/license-MIT-green)](LICENSE)\n[![Tests](https://img.shields.io/badge/tests-282%20passing-brightgreen)](tests/)\n[![Typed](https://img.shields.io/badge/typed-py.typed-informational)](nodecanon/py.typed)\n\n**Entity resolution and deduplication for LLM-extracted knowledge graphs.**\n\nYour knowledge graph extracted 847 entities. You should have 312.\n\nThe other 535 are the same real-world things written differently: \"IBM\", \"I.B.M.\", \"International Business Machines\", \"IBM Corp\". The LLM that extracted them had no memory of what it called the same company three chunks ago.\n\nnodecanon fixes that.\n\n```bash\npip install nodecanon\n```\n\n```python\nfrom nodecanon import Resolver, GraphBuilder\n\ngraph = (\n    GraphBuilder()\n    .add_node(\"IBM\", type=\"ORGANIZATION\")\n    .add_node(\"I.B.M.\", type=\"ORGANIZATION\")\n    .add_node(\"International Business Machines\", type=\"ORGANIZATION\")\n    .add_node(\"Watson AI\", type=\"PRODUCT\")\n    .add_edge(\"IBM\", \"Watson AI\", \"MAKES\")\n    .add_edge(\"I.B.M.\", \"Watson AI\", \"MAKES\")\n    .add_edge(\"International Business Machines\", \"Watson AI\", \"MAKES\")\n    .build()\n)\n\nresult = Resolver().resolve(graph)\nprint(result.merge_report())\n```\n\n```\nMerged 4 nodes into 2 canonical nodes\nAbsorbed 2 alias nodes\nRemoved 2 redundant edges\nFlagged 0 conflicts for human review\n```\n\nNo LLM calls. No API keys. Runs locally in under two minutes on 10,000 nodes.\n\n---\n\n## Why this exists\n\nMulti-hop reasoning over a knowledge graph only works if the graph is actually connected. When \"IBM\" and \"I.B.M.\" are two separate nodes with no edge between them, your retrieval pipeline cannot traverse that gap. It treats them as strangers. Every query that crosses this invisible seam comes back wrong or incomplete.\n\nThis is not a bug in GraphRAG or LlamaIndex. It is a fundamental consequence of how LLMs process text in chunks: each chunk names entities independently, with no awareness of how the same entity was named 3,000 tokens earlier. The same problem has been independently reported across every major GraphRAG framework.\n\nnodecanon is the post-processing step that reconnects the graph.\n\n### What makes this problem specific\n\nLLM-extracted knowledge graphs have three properties that make entity resolution harder than the standard case:\n\n- **No fixed schema**: one node has a description, another has none; one has a type label, another has five different ones extracted across chunks\n- **Graph-structured identity**: two nodes may be the same entity not because their attributes match, but because they connect to the same neighbors in the graph\n- **Schema-free types**: \"COMPANY\", \"ORGANIZATION\", \"FIRM\", \"CORP\" all mean the same thing but look different to any string or embedding comparison\n\nnodecanon is built specifically for this combination.\n\n---\n\n## How it works\n\nFour layers run in sequence.\n\n### 1. Block: O(n), not O(n²)\n\nAt 10,000 nodes, all-pairs scoring requires 50 million comparisons. Blocking cuts this to roughly 1-5% of pairs by generating only plausible candidates.\n\nFour strategies combine via union:\n\n- **TokenOverlapBlocker**: pairs nodes that share at least one non-stopword token. Catches \"IBM Corp\" / \"IBM Inc\". Misses pure abbreviations.\n- **NGramFingerprintBlocker**: pairs nodes with overlapping character trigrams. \"IBM\" and \"I.B.M.\" both normalize to `ibm`, sharing the trigram fingerprint. Catches abbreviation variants that token overlap misses.\n- **AbbreviationBlocker**: pairs a short name with a longer name when the short one looks like an abbreviation. Three tests: initialism (`ML` from `Machine Learning`), consonant contraction (`NVDA` from `NVIDIA`), subsequence (`MSFT` from `Microsoft`).\n- **TypeCompatibilityBlocker**: a filter, not a generator. Removes type-incompatible pairs from the union before scoring. `PERSON` + `ORGANIZATION` never reach the scorer.\n\n### 2. Score: five-component ScoreVector\n\nFor each candidate pair, a `ScoreVector` is computed rather than a single number. The vector preserves *why* two nodes are similar, which drives both the merge decision and the audit trail.\n\n```python\nScoreVector(\n    name_similarity        = 0.94,   # rapidfuzz WRatio + Jaro-Winkler on metaphone forms\n    semantic_similarity    = 0.91,   # cosine similarity of all-MiniLM-L6-v2 embeddings\n    type_agreement         = 1.00,   # 1.0 if compatible, 0.5 if unknown, 0.0 if incompatible\n    neighbor_overlap       = 0.87,   # soft Jaccard of 1-hop neighbor name sets\n    description_similarity = 0.83,   # cosine similarity of description embeddings\n)\n```\n\nThe `neighbor_overlap` component is the key differentiator from classical ER. If \"IBM\" and \"I.B.M.\" both connect to \"Watson\", \"Ginni Rometty\", and \"Armonk NY\", their structural position in the graph is identical even when their name similarity is moderate. Two nodes that occupy the same position in a graph are almost certainly the same entity.\n\nWhen both nodes have zero neighbors, `neighbor_overlap` is 0.0, not 1.0. Absence of evidence is not evidence of match.\n\n### 3. Match: weighted threshold\n\nThe weighted sum is compared against a configurable threshold (default 0.75):\n\n```\nscore = 0.30 * name + 0.25 * semantic + 0.20 * type + 0.20 * neighbor + 0.05 * description\n```\n\nPairs above the threshold merge. An optional ambiguous zone (default 0.65-0.80) can route uncertain pairs to an LLM for a binary yes/no call. Off by default, affects roughly 5-10% of candidates when enabled.\n\n### 4. Merge: union-find, full provenance\n\nUnion-find ensures transitivity: if A matches B and B matches C, all three collapse into one canonical node without re-scoring.\n\nThe most-connected node becomes canonical. Every merge is logged on the resulting node:\n\n```python\nnode._merged_from    = [\"ibm_001\", \"ibm_047\", \"ibm_203\"]\nnode._merge_evidence = {\"name_similarity\": 0.94, \"neighbor_overlap\": 0.87, ...}\nnode._merge_strategy = \"rule_based\"\nnode._resolved_types = [\"ORGANIZATION\", \"COMPANY\"]\n```\n\nNothing is silently dropped.\n\n---\n\n## Installation\n\n```bash\npip install nodecanon\n```\n\nFor Microsoft GraphRAG integration (adds pandas and pyarrow):\n```bash\npip install nodecanon[graphrag]\n```\n\nFor LLM-assisted matching on ambiguous pairs:\n```bash\npip install nodecanon[llm]   # installs openai + anthropic\n```\n\nFor Neo4j full roundtrip (load from live instance, write back resolved):\n```bash\npip install nodecanon[neo4j]\n```\n\nAll adapters at once:\n```bash\npip install nodecanon[graphrag,llamaindex,lightrag,neo4j,llm]\n```\n\n---\n\n## Building a graph\n\n### From plain dicts\n\nThe most common path when loading from a database or JSON file:\n\n```python\nfrom nodecanon import KGGraph\n\ngraph = KGGraph.from_dicts(\n    nodes=[\n        {\"name\": \"IBM\",                             \"type\": \"ORGANIZATION\"},\n        {\"name\": \"I.B.M.\",                          \"type\": \"ORGANIZATION\"},\n        {\"name\": \"International Business Machines\", \"type\": \"ORGANIZATION\"},\n        {\"name\": \"Watson AI\",                       \"type\": \"PRODUCT\"},\n    ],\n    edges=[\n        {\"source\": \"IBM\",    \"target\": \"Watson AI\", \"relation\": \"MAKES\"},\n        {\"source\": \"I.B.M.\", \"target\": \"Watson AI\", \"relation\": \"MAKES\"},\n    ],\n)\n```\n\n- `id` is optional: auto-generated from the name when omitted (`\"IBM Corp\"` becomes id `\"ibm_corp\"`)\n- Extra fields land in `node.attributes` (`{\"founded\": 1911}` becomes `node.attributes[\"founded\"]`)\n- Edge keys accept `source` / `source_id` and `target` / `target_id` interchangeably\n\n### Fluent builder\n\n```python\nfrom nodecanon import GraphBuilder\n\ngraph = (\n    GraphBuilder()\n    .add_node(\"IBM\",      type=\"ORGANIZATION\", founded=1911)\n    .add_node(\"I.B.M.\",   type=\"ORGANIZATION\")\n    .add_node(\"Watson AI\", type=\"PRODUCT\")\n    .add_edge(\"IBM\",    \"Watson AI\", \"MAKES\")\n    .add_edge(\"I.B.M.\", \"Watson AI\", \"MAKES\")\n    .build()\n)\n```\n\n- `add_node` is idempotent: calling it twice with the same name is a no-op\n- `add_edge` accepts node names or node ids; referenced nodes that do not exist yet are auto-created\n- Keyword arguments on `add_node` go into `attributes`\n\n### Direct construction\n\n```python\nfrom nodecanon import KGGraph, KGNode, KGEdge\n\ngraph = KGGraph(\n    nodes=[\n        KGNode(id=\"n1\", name=\"IBM\",    type=\"ORGANIZATION\"),\n        KGNode(id=\"n2\", name=\"I.B.M.\", type=\"ORGANIZATION\"),\n    ],\n    edges=[\n        KGEdge(source_id=\"n1\", target_id=\"n2\", relation=\"SAME_AS\"),\n    ],\n)\n```\n\n---\n\n## Resolving\n\n```python\nfrom nodecanon import Resolver\n\nresult = Resolver().resolve(graph)\n```\n\nThe first call downloads `all-MiniLM-L6-v2` (~90 MB) and caches it locally. Subsequent calls use the cached model.\n\n### Persist embeddings across runs\n\nOn large graphs, re-embedding the same nodes on every run is wasteful. Pass `cache_dir` to reuse embeddings:\n\n```python\nfrom nodecanon import Resolver\nfrom nodecanon.core.scoring import NodeScorer\n\nresolver = Resolver(\n    scorer=NodeScorer(cache_dir=\".nodecanon/embeddings\")\n)\nresult = resolver.resolve(graph)\n```\n\nThe cache is keyed by node content hash. If a node changes, its embedding is automatically recomputed.\n\n### Custom weights and threshold\n\n```python\nfrom nodecanon import Resolver\nfrom nodecanon.core.scoring import NodeScorer\nfrom nodecanon.core.matching import RuleBasedMatcher\n\nscorer = NodeScorer(\n    weights={\n        \"name_similarity\":        0.35,\n        \"semantic_similarity\":    0.30,\n        \"type_agreement\":         0.20,\n        \"neighbor_overlap\":       0.10,\n        \"description_similarity\": 0.05,\n    }\n)\n\n# Stricter threshold for high-precision requirements\nmatcher = RuleBasedMatcher(threshold=0.85)\n\nresolver = Resolver(scorer=scorer, matcher=matcher)\nresult = resolver.resolve(graph)\n```\n\n### LLM-assisted matching for ambiguous pairs\n\n```python\nfrom nodecanon.core.matching import LLMAssistedMatcher, RuleBasedMatcher\n\nllm_matcher = LLMAssistedMatcher(\n    rule_matcher=RuleBasedMatcher(threshold=0.75),\n    ambiguous_low=0.65,\n    ambiguous_high=0.80,\n    provider=\"anthropic\",\n    model=\"claude-haiku-4-5-20251001\",\n)\n\nresolver = Resolver(matcher=llm_matcher)\nresult = resolver.resolve(graph)\n```\n\nThe LLM is called only for pairs that fall in the ambiguous zone. Clear matches and clear non-matches are decided locally.\n\n### Fast mode: no embeddings\n\nFor graphs where topology signal is strong and speed matters:\n\n```python\nfast_weights = {\n    \"name_similarity\":        0.43,\n    \"semantic_similarity\":    0.00,\n    \"type_agreement\":         0.29,\n    \"neighbor_overlap\":       0.29,\n    \"description_similarity\": 0.00,\n}\nresolver = Resolver(\n    scorer=NodeScorer(weights=fast_weights, cache_dir=None),\n    matcher=RuleBasedMatcher(threshold=0.72, weights=fast_weights),\n)\n```\n\nFast mode runs in under 0.1 seconds on 64 nodes. F1 on the synthetic benchmark: 0.974.\n\n---\n\n## Reading results\n\n### Summary report\n\n```python\nprint(result.merge_report())\n# Merged 847 nodes into 312 canonical nodes\n# Absorbed 535 alias nodes\n# Removed 1,203 redundant edges\n# Flagged 14 conflicts for human review\n```\n\n### Iterate canonical nodes\n\n```python\nfor node in result.graph.nodes:\n    if node._merged_from:\n        print(f\"{node.name!r} absorbed: {node._merged_from}\")\n```\n\n### Explain a specific merge decision\n\n```python\nprint(result.explain(\"ibm_canonical_id\"))\n```\n\n```\nCanonical node: 'IBM' (id: n1)\n\nMerged from 3 nodes:\n  . \"IBM\" (id: n1)\n  . \"I.B.M.\" (id: n2)\n  . \"IBM Corporation\" (id: n3)\n\nMerge evidence:\n  name_similarity:        0.890  (weight 0.3)\n  semantic_similarity:    0.940  (weight 0.25)\n  type_agreement:         1.000  (weight 0.2)\n  neighbor_overlap:       1.000  (weight 0.2)\n  description_similarity: 0.000  (weight 0.05)\n  weighted score:         0.921\n\nMerge strategy: rule_based\n```\n\n### Review conflicts\n\nType-incompatible pairs are flagged as `MergeConflict` rather than silently merged:\n\n```python\nfor i, conflict in enumerate(result.conflicts):\n    print(f\"[{i}] {conflict.node_id_a} vs {conflict.node_id_b}\")\n    print(f\"     Reason: {conflict.conflict_reason}\")\n    print(f\"     Score:  {conflict.score.weighted_sum():.3f}\")\n```\n\n---\n\n## Editing results after resolution\n\nAll editing methods return a new `ResolveResult`. The original is never mutated. Corrections can be chained.\n\n### Reject a merge\n\n```python\n# The resolver merged \"Python\" (language) with \"Python\" (snake) -- undo it\ncorrected = result.reject_merge(\"python_canonical_id\")\n\n# Restore only specific aliases, not all of them\ncorrected = result.reject_merge(\"python_canonical_id\", restore=[\"python_snake_id\"])\n```\n\nAfter rejecting, the canonical node reverts to its pre-merge form and the restored aliases are re-added as independent nodes. Edges stay on the canonical and cannot be automatically split back.\n\n### Force a merge\n\n```python\n# The resolver did not merge \"Alphabet Inc\" and \"Google\" -- do it manually\ncorrected = result.force_merge(\"alphabet_id\", \"google_id\")\n\n# Three-way force merge\ncorrected = result.force_merge(\"id_a\", \"id_b\", \"id_c\")\n```\n\n### Accept a flagged conflict\n\n```python\n# See all conflicts\nfor i, c in enumerate(result.conflicts):\n    print(f\"[{i}] {c.node_id_a} + {c.node_id_b}: {c.conflict_reason}\")\n\n# Accept conflict at index 0 and merge the pair\ncorrected = result.accept_conflict(0)\n```\n\n### Chain corrections\n\n```python\nfinal = (\n    result\n    .reject_merge(\"wrong_merge_id\")\n    .force_merge(\"alphabet_id\", \"google_id\")\n    .accept_conflict(0)\n)\n```\n\n---\n\n## Adapters\n\n### Microsoft GraphRAG\n\n```bash\npip install nodecanon[graphrag]\n```\n\n```python\nfrom nodecanon.adapters.graphrag import GraphRAGAdapter\nfrom nodecanon import Resolver\n\ngraph = GraphRAGAdapter.from_directory(\"./graphrag_output/\")\nresult = Resolver().resolve(graph)\n```\n\nReads `entities.parquet` and `relationships.parquet`. Supports both v1 and v2 GraphRAG output layouts.\n\n### LlamaIndex PropertyGraphIndex\n\n```bash\npip install nodecanon[llamaindex]\n```\n\n```python\nfrom nodecanon.adapters.llamaindex import LlamaIndexAdapter\nfrom nodecanon import Resolver\n\nadapter = LlamaIndexAdapter()\ngraph = adapter.load(my_property_graph_index)\nresult = Resolver().resolve(graph)\n\n# Write back to the index\nadapter.save(result.graph, my_property_graph_index)\n```\n\n### LightRAG\n\n```bash\npip install nodecanon[lightrag]\n```\n\n```python\nfrom nodecanon.adapters.lightrag import LightRAGAdapter\nfrom nodecanon import Resolver\n\ngraph = LightRAGAdapter.from_working_dir(\"./lightrag_data/\")\nresult = Resolver().resolve(graph)\nLightRAGAdapter.save(result.graph, \"./lightrag_data/\")\n```\n\nReads `graph_chunk_entity_relation.graphml` from the LightRAG working directory.\n\n### nano-graphrag\n\nnano-graphrag stores its entity-relation graph in the same GraphML format as LightRAG. No extra install is needed beyond networkx (already a core dependency).\n\n```python\nfrom nodecanon.adapters.nanographrag import NanoGraphRAGAdapter\nfrom nodecanon import Resolver\n\n# From a working directory (after nano-graphrag has finished indexing)\ngraph = NanoGraphRAGAdapter.from_working_dir(\"./nano_output/\")\nresult = Resolver().resolve(graph)\nNanoGraphRAGAdapter.save(result.graph, \"./nano_output/\")\n\n# From a live GraphRAG instance (in-memory, no disk I/O)\ngraph = NanoGraphRAGAdapter.from_instance(rag)\nresult = Resolver().resolve(graph)\n```\n\n### NetworkX\n\n```python\nfrom nodecanon.adapters.networkx import NetworkXAdapter\nfrom nodecanon import Resolver\nimport networkx as nx\n\nG = nx.read_graphml(\"my_graph.graphml\")\ngraph = NetworkXAdapter.from_networkx(G)\n\nresult = Resolver().resolve(graph)\n\nG_resolved = NetworkXAdapter.to_networkx(result.graph)\n```\n\n### Neo4j (full roundtrip)\n\n```bash\npip install nodecanon[neo4j]\n```\n\nLoad from a live Neo4j instance, resolve, and write back. The write-back is non-destructive: canonical nodes are updated in place, alias nodes gain `_is_alias: true` and an `IS_ALIAS_OF` relationship. Nothing is deleted.\n\n```python\nfrom neo4j import GraphDatabase\nfrom nodecanon.adapters.neo4j import Neo4jAdapter\nfrom nodecanon import Resolver\n\ndriver = GraphDatabase.driver(\"bolt://localhost:7687\", auth=(\"neo4j\", \"password\"))\n\n# Load\ngraph = Neo4jAdapter.from_neo4j(driver, node_label=\"Entity\")\n\n# Resolve\nresult = Resolver().resolve(graph)\n\n# Write back\nstats = Neo4jAdapter.to_neo4j(driver, result)\nprint(stats)\n# {\"nodes_upserted\": 312, \"aliases_annotated\": 535, \"edges_merged\": 1203}\n\ndriver.close()\n```\n\nExport to a Cypher file instead (no live connection required):\n\n```python\nfrom pathlib import Path\nfrom nodecanon.adapters.neo4j import Neo4jAdapter\n\nNeo4jAdapter().dump(result.graph, Path(\"resolved.cypher\"))\n```\n\n```bash\ncypher-shell -u neo4j -p password \u003c resolved.cypher\n```\n\n---\n\n## CLI\n\n```bash\n# Resolve a GraphRAG output directory\nnodecanon resolve ./graphrag_output/ --output ./resolved/\n\n# Inspect the resolved graph\nnodecanon inspect ./resolved/\n\n# Explain a specific merge decision\nnodecanon explain \u003cnode_id\u003e ./resolved/\n```\n\n---\n\n## Benchmark\n\n### Real-world: DBpedia entity aliases\n\nGround truth from DBpedia `wikiPageRedirects`. When Wikipedia redirects \"I.B.M.\" to \"IBM\", that redirect is an entity alias. We download 287 company and person pairs filtered to genuine name variants (similarity \u003e= 50%), build a graph from real DBpedia properties (founders, parent companies, employer relations) as topology anchors, and measure against that ground truth.\n\n| Condition | Pairs | Precision | Recall | F1 |\n|-----------|-------|-----------|--------|-----|\n| With topology (shared DBpedia anchors) | 71 | **1.000** | **0.986** | **0.993** |\n| Name-only, fast mode | 216 | 0.771 | 0.282 | 0.413 |\n| Name-only, full mode | 216 | 0.930 | 0.230 | 0.369 |\n\nWhen your GraphRAG output has shared neighbors between duplicate nodes (the typical case when the same entity is mentioned across multiple text chunks), nodecanon achieves near-perfect precision and recall with no API calls.\n\nThe name-only rows cover structurally hard cases: subsidiary names (\"Egmont Imagination\" vs \"Egmont Group\"), different-language translations (\"Royal Dutch\" vs \"Royal Netherlands\"), and short forms without shared graph context. These are candidates for `LLMAssistedMatcher`.\n\n```bash\npython benchmarks/dbpedia_benchmark.py --fast     # downloads from DBpedia, fast mode\npython benchmarks/dbpedia_benchmark.py            # full mode with sentence-transformers\npython benchmarks/dbpedia_benchmark.py --offline  # reuse cached data\n```\n\n### Synthetic benchmark\n\n64 nodes across 12 canonical entity clusters with realistic name variants, 93 edges. Covers easy (IBM / IBM Corp), medium (Samuel Altman / S. Altman), hard (LLM / large language model), and abbreviation cases (NVDA / NVIDIA).\n\n| Mode | Precision | Recall | F1 | Time |\n|------|-----------|--------|-----|------|\n| Fast (no embeddings) | **1.000** | **0.949** | **0.974** | \u003c 0.1s |\n| Full (sentence-transformers) | 1.000 | 0.949+ | 0.974+ | ~5s |\n\nCurated real-world alias test (28 entity clusters, actual organization / person / concept aliases, topology-equipped):\n\n| Precision | Recall | F1 |\n|-----------|--------|-----|\n| **0.990** | **0.783** | **0.874** |\n\n```bash\npython benchmarks/run_benchmark.py --fast\npython benchmarks/run_benchmark.py\npython benchmarks/battle_test.py --aliases --no-wikidata\npython benchmarks/battle_test.py --fb15k --sample 2000\n```\n\n---\n\n## FAQ\n\n**Does nodecanon work if my graph has no edges?**\n\nYes. Name similarity and semantic similarity still fire. You will not get the topology signal (`neighbor_overlap` stays at 0.0), so pairs with similar names but no shared context are harder to merge confidently. Populate edges before resolving when possible.\n\n**Why did it miss an obvious duplicate?**\n\nThree common reasons. First, the pair may not have been blocked: check `AbbreviationBlocker` for acronym-to-full-name pairs without shared tokens. Second, the score may be below threshold: run `result.explain(node_id)` to see the component breakdown and decide whether to lower the threshold or use `force_merge`. Third, the types may be incompatible: the `TypeCompatibilityBlocker` removes them before scoring.\n\n**What happens to edges when nodes merge?**\n\nAll edges from alias nodes redirect to the canonical node. If merging creates parallel edges (same source, target, and relation), they are deduplicated and their weights are summed.\n\n**How do I run nodecanon on the same graph multiple times without re-embedding?**\n\nPass `cache_dir` to `NodeScorer`. Embeddings are cached by content hash and reused automatically on subsequent runs.\n\n**Can I use a different embedding model?**\n\nYes. Subclass `NodeScorer` and override `_embed`. The default is `all-MiniLM-L6-v2` from sentence-transformers because it runs on CPU, downloads once, and is fast enough for production-scale graphs.\n\n**Does it run offline?**\n\nAfter the first run (which downloads the embedding model), yes. The model is cached by sentence-transformers in `~/.cache/torch/sentence_transformers/`. Set `cache_dir` on `NodeScorer` to also persist embeddings across graph runs.\n\n**What is the recommended threshold for high-precision production use?**\n\n0.85 with the default weights. This virtually eliminates false merges at the cost of lower recall on borderline pairs. Use `LLMAssistedMatcher` with `ambiguous_low=0.75, ambiguous_high=0.85` to recover ambiguous pairs via LLM at low cost.\n\n---\n\n## Data model reference\n\n### KGNode\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `id` | `str` | Unique identifier within the graph |\n| `name` | `str` | Surface form of the entity name |\n| `type` | `str or None` | Entity type label (e.g. `\"ORGANIZATION\"`) |\n| `description` | `str or None` | Free-text description |\n| `attributes` | `dict` | Any additional key-value metadata |\n| `source_chunks` | `list[str]` | Source chunk IDs from the extraction pipeline |\n| `_merged_from` | `list[str] or None` | IDs of all nodes merged into this one (set on merge) |\n| `_merge_evidence` | `dict or None` | ScoreVector components that triggered the merge |\n| `_merge_strategy` | `str or None` | `\"rule_based\"`, `\"llm_assisted\"`, or `\"manual\"` |\n| `_resolved_types` | `list[str] or None` | All type labels from merged nodes (union) |\n\n### KGEdge\n\n| Field | Type | Description |\n|-------|------|-------------|\n| `source_id` | `str` | ID of the source node |\n| `target_id` | `str` | ID of the target node |\n| `relation` | `str` | Relationship label |\n| `weight` | `float` | Default 1.0; parallel edges sum their weights on merge |\n| `attributes` | `dict` | Any additional metadata |\n\n### ScoreVector\n\n| Field | Type | Default weight |\n|-------|------|----------------|\n| `name_similarity` | `float` | 0.30 |\n| `semantic_similarity` | `float` | 0.25 |\n| `type_agreement` | `float` | 0.20 |\n| `neighbor_overlap` | `float` | 0.20 |\n| `description_similarity` | `float` | 0.05 |\n\nCall `score.weighted_sum()` for the combined decision score. Pass a `weights` dict to override defaults.\n\n### ResolveResult\n\n| Attribute | Type | Description |\n|-----------|------|-------------|\n| `graph` | `KGGraph` | The resolved graph with canonical nodes |\n| `merge_records` | `list[MergeRecord]` | One record per merged group |\n| `conflicts` | `list[MergeConflict]` | Pairs flagged for human review |\n| `original_node_count` | `int` | Node count before resolution |\n| `original_edge_count` | `int` | Edge count before resolution |\n\n**Methods:**\n\n| Method | Returns | Description |\n|--------|---------|-------------|\n| `merge_report()` | `str` | Human-readable summary of what changed |\n| `explain(node_id)` | `str` | Detailed breakdown of a merge decision |\n| `reject_merge(canonical_id, restore=None)` | `ResolveResult` | Undo a merge |\n| `force_merge(*node_ids)` | `ResolveResult` | Manually merge nodes |\n| `accept_conflict(index)` | `ResolveResult` | Accept a flagged conflict and merge it |\n\n---\n\n## TypeCompatibilityBlocker: built-in type clusters\n\nUnknown types (not in any cluster) default to compatible with everything. The scoring layer handles disambiguation. You can extend the compatibility map:\n\n```python\nfrom nodecanon.core.blocking import TypeCompatibilityBlocker, UnionBlocker\nfrom nodecanon.core.blocking import TokenOverlapBlocker, NGramFingerprintBlocker, AbbreviationBlocker\n\ncustom_compat = {\n    **TypeCompatibilityBlocker.DEFAULT_COMPATIBILITY,\n    \"DRUG\":     {\"DRUG\", \"MEDICATION\", \"PHARMACEUTICAL\", \"COMPOUND\"},\n    \"GENE\":     {\"GENE\", \"PROTEIN\", \"BIOMARKER\"},\n}\n\nresolver = Resolver(\n    blocker=UnionBlocker([\n        TokenOverlapBlocker(),\n        NGramFingerprintBlocker(),\n        AbbreviationBlocker(),\n        TypeCompatibilityBlocker(compatibility_map=custom_compat),\n    ])\n)\n```\n\nBuilt-in clusters:\n\n| Canonical | Compatible labels |\n|-----------|-----------------|\n| ORGANIZATION | COMPANY, CORP, CORPORATION, FIRM, INSTITUTION, STARTUP, AGENCY, ASSOCIATION, FOUNDATION, UNIVERSITY |\n| PERSON | INDIVIDUAL, HUMAN, RESEARCHER, AUTHOR, SCIENTIST |\n| LOCATION | PLACE, CITY, COUNTRY, REGION, GPE, AREA |\n| PRODUCT | SOFTWARE, SERVICE, TOOL, SYSTEM, PLATFORM |\n| EVENT | INCIDENT, OCCURRENCE |\n| CONCEPT | IDEA, TOPIC, THEORY, METHOD, TECHNIQUE |\n\n---\n\n## Known limitations\n\n**Acronym to full name pairs** (e.g. `\"IBM\"` vs `\"International Business Machines\"`) require either strong graph topology overlap or `LLMAssistedMatcher`. At the default threshold with no shared neighbors, the weighted score peaks at roughly 0.72, just below the 0.75 merge threshold. If your graph has many such pairs, lower the threshold, ensure edges are populated before resolving, or enable LLM-assisted matching for the ambiguous zone.\n\n**Rebranding and informal names** (e.g. `\"Google\"` vs `\"Alphabet\"`, `\"Britain\"` vs `\"United Kingdom\"`) score low on name similarity and need semantic or topological evidence. These are the primary driver of missed recall in the real-world alias benchmark.\n\n**Short ambiguous acronyms** (`\"WHO\"`, `\"UN\"`, `\"ML\"`) can false-match unrelated entities if different domains share the same graph. The `TypeCompatibilityBlocker` and high type_agreement weight mitigate this, but verify results when your graph spans multiple domains.\n\n**Very large graphs (\u003e50k nodes)** may hit memory pressure on the embedding matrix. Use `cache_dir` to persist embeddings between runs, and `batch_size` on the scorer to control peak memory.\n\n---\n\n## What it does not do\n\n- **Extract** knowledge graphs from text: that is GraphRAG's job\n- **Require an API key** in default mode: sentence-transformers runs locally on CPU\n- **Silently drop data**: every merge is logged with provenance; type conflicts surface as `MergeConflict`\n- **Modify your original graph**: `resolve()` always returns a new graph\n- **Require a GPU**: all-MiniLM-L6-v2 runs on CPU in roughly 50ms per sentence\n\n---\n\n## Performance targets\n\n| Scale | Blocking | Scoring | Total |\n|-------|----------|---------|-------|\n| 1,000 nodes, 5,000 edges | \u003c 0.5s | \u003c 10s | \u003c 15s |\n| 10,000 nodes, 50,000 edges | \u003c 5s | \u003c 60s | \u003c 2 min |\n\nMemory: peak \u003c 4 GB for 10,000 nodes on an 8 GB laptop.\n\n---\n\n## Contributing\n\nBug reports, feature requests, and pull requests are welcome at [github.com/rasinmuhammed/node-canon](https://github.com/rasinmuhammed/node-canon).\n\nWhen filing a bug, include the output of `result.explain(node_id)` for any merge that behaved unexpectedly. The score breakdown makes root causes much easier to identify.\n\n---\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasinmuhammed%2Fnode-canon","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frasinmuhammed%2Fnode-canon","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frasinmuhammed%2Fnode-canon/lists"}