{"id":27396921,"url":"https://github.com/aimaster-dev/knowledge-graph-construction","last_synced_at":"2026-02-12T05:33:10.688Z","repository":{"id":287563631,"uuid":"964651431","full_name":"aimaster-dev/knowledge-graph-construction","owner":"aimaster-dev","description":"This AI-powered tool extracts Subject-Predicate-Object (SPO) triplets from unstructured text using LLMs, performs entity standardization and relationship inference, and generates interactive, community-detected knowledge graphs with rich visualizations.","archived":false,"fork":false,"pushed_at":"2025-04-12T14:40:43.000Z","size":1782,"stargazers_count":8,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-09-02T13:48:40.941Z","etag":null,"topics":["ai-graphs","deep-learning","entity-linking","graph-analysis","graph-community","knowledge-graphs","langchain","llms","nlp","ollama","openai-api","prompt-engineering","python","pyvis","relationship-inference","semantic-graph","spo-extraction","text-mining","text-to-graph","visualization"],"latest_commit_sha":null,"homepage":"https://aimaster-dev.github.io/knowledge-website/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aimaster-dev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-11T15:00:43.000Z","updated_at":"2025-07-30T20:11:37.000Z","dependencies_parsed_at":"2025-04-14T00:48:55.473Z","dependency_job_id":null,"html_url":"https://github.com/aimaster-dev/knowledge-graph-construction","commit_stats":null,"previous_names":["aimaster-dev/knowledge-graph-construction"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/aimaster-dev/knowledge-graph-construction","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimaster-dev%2Fknowledge-graph-construction","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimaster-dev%2Fknowledge-graph-construction/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimaster-dev%2Fknowledge-graph-construction/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimaster-dev%2Fknowledge-graph-construction/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aimaster-dev","download_url":"https://codeload.github.com/aimaster-dev/knowledge-graph-construction/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aimaster-dev%2Fknowledge-graph-construction/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29359510,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-12T01:03:07.613Z","status":"online","status_checked_at":"2026-02-12T02:00:06.911Z","response_time":55,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-graphs","deep-learning","entity-linking","graph-analysis","graph-community","knowledge-graphs","langchain","llms","nlp","ollama","openai-api","prompt-engineering","python","pyvis","relationship-inference","semantic-graph","spo-extraction","text-mining","text-to-graph","visualization"],"created_at":"2025-04-14T00:48:52.641Z","updated_at":"2026-02-12T05:33:10.682Z","avatar_url":"https://github.com/aimaster-dev.png","language":"HTML","readme":"![ai-knowledge-graph-example](https://github.com/aimaster-dev/knowledge-graph-construction/blob/main/data/ai-knowledge-graph-example.png)\n\n# AI Powered Knowledge Graph Generator\n\nThis system takes an unstructured text document, and uses an LLM of your choice to extract knowledge in the form of Subject-Predicate-Object (SPO) triplets, and visualizes the relationships as an interactive knowledge graph.\nA demo of a knowlege graph created with this project can be found here: [test](https://aimaster-dev.github.io/knowledge-website/)\n\n\n## Features\n\n- **Text Chunking**: Automatically splits large documents into manageable chunks for processing\n- **Knowledge Extraction**: Uses AI to identify entities and their relationships\n- **Entity Standardization**: Ensures consistent entity naming across document chunks\n- **Relationship Inference**: Discovers additional relationships between disconnected parts of the graph\n- **Interactive Visualization**: Creates an interactive graph visualization\n- **Works with Any OpenAI Compatible API Endpoint**: Ollama, LM Studio, OpenAI, vLLM, LiteLLM (provides access to AWS Bedrock, Azure OpenAI, Anthropic and many other LLM services) \n\n## Requirements\n\n- Python 3.11+\n- Required packages (install using `pip install -r requirements.txt` or `uv sync`)\n\n## Quick Start\n\n1. Clone this repository\n2. Install dependencies: `pip install -r requirements.txt`\n3. Configure your settings in `config.toml`\n4. Run the system:\n\n```bash\npython generate-graph.py --input your_text_file.txt --output knowledge_graph.html\n```\n\nOr with UV:\n\n```bash\nuv run generate-graph.py --input your_text_file.txt --output knowledge_graph.html\n```\nOr installing and using as a module:\n\n```bash\npip install --upgrade -e .\ngenerate-graph --input your_text_file.txt --output knowledge_graph.html\n```\n\n## Configuration\n\nThe system can be configured using the `config.toml` file:\n\n```toml\n[llm]\nmodel = \"gemma3\"  # Google open weight model\napi_key = \"sk-1234\"\nbase_url = \"http://localhost:11434/v1/chat/completions\" # Local Ollama instance running locally (but can be any OpenAI compatible endpoint)\nmax_tokens = 8192\ntemperature = 0.2\n\n[chunking]\nchunk_size = 200  # Number of words per chunk\noverlap = 20      # Number of words to overlap between chunks\n\n[standardization]\nenabled = true            # Enable entity standardization\nuse_llm_for_entities = true  # Use LLM for additional entity resolution\n\n[inference]\nenabled = true             # Enable relationship inference\nuse_llm_for_inference = true  # Use LLM for relationship inference\napply_transitive = true    # Apply transitive inference rules\n```\n\n## Command Line Options\n\n- `--input FILE`: Input text file to process\n- `--output FILE`: Output HTML file for visualization (default: knowledge_graph.html)\n- `--config FILE`: Path to config file (default: config.toml)\n- `--debug`: Enable debug output with raw LLM responses\n- `--no-standardize`: Disable entity standardization\n- `--no-inference`: Disable relationship inference\n- `--test`: Generate sample visualization using test data\n\n### Usage message (--help)\n\n```bash\ngenerate-graph --help\nusage: generate-graph [-h] [--test] [--config CONFIG] [--output OUTPUT] [--input INPUT] [--debug] [--no-standardize] [--no-inference]\n\nKnowledge Graph Generator and Visualizer\n\noptions:\n  -h, --help        show this help message and exit\n  --test            Generate a test visualization with sample data\n  --config CONFIG   Path to configuration file\n  --output OUTPUT   Output HTML file path\n  --input INPUT     Path to input text file (required unless --test is used)\n  --debug           Enable debug output (raw LLM responses and extracted JSON)\n  --no-standardize  Disable entity standardization\n  --no-inference    Disable relationship inference\n```\n\n### Example Run\n\n**Command:**\n\n```bash\ngenerate-graph --input data/test.txt --output test-kg.html\n```\n**Console Output:**\n\n```markdown\nUsing input text from file: data/test.txt\n==================================================\nPHASE 1: INITIAL TRIPLE EXTRACTION\n==================================================\nProcessing text in 13 chunks (size: 100 words, overlap: 20 words)\nProcessing chunk 1/13 (100 words)\nProcessing chunk 2/13 (100 words)\nProcessing chunk 3/13 (100 words)\nProcessing chunk 4/13 (100 words)\nProcessing chunk 5/13 (100 words)\nProcessing chunk 6/13 (100 words)\nProcessing chunk 7/13 (100 words)\nProcessing chunk 8/13 (100 words)\nProcessing chunk 9/13 (100 words)\nProcessing chunk 10/13 (100 words)\nProcessing chunk 11/13 (100 words)\nProcessing chunk 12/13 (86 words)\nProcessing chunk 13/13 (20 words)\n\nExtracted a total of 216 triples from all chunks\n\n==================================================\nPHASE 2: ENTITY STANDARDIZATION\n==================================================\nStarting with 216 triples and 201 unique entities\nStandardizing entity names across all triples...\nApplied LLM-based entity standardization for 15 entity groups\nStandardized 201 entities into 181 standard forms\nAfter standardization: 216 triples and 160 unique entities\n\n==================================================\nPHASE 3: RELATIONSHIP INFERENCE\n==================================================\nStarting with 216 triples\nTop 5 relationship types before inference:\n  - enables: 20 occurrences\n  - impacts: 15 occurrences\n  - enabled: 12 occurrences\n  - pioneered: 10 occurrences\n  - invented: 9 occurrences\nInferring additional relationships between entities...\nIdentified 9 disconnected communities in the graph\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 3 new relationships between communities\nInferred 9 new relationships within communities\nInferred 2 new relationships within communities\nInferred 88 relationships based on lexical similarity\nAdded -22 inferred relationships\n\nTop 5 relationship types after inference:\n  - related to: 65 occurrences\n  - advances via Artificial Intelligence: 36 occurrences\n  - pioneered via computing: 26 occurrences\n  - enables via computing: 24 occurrences\n  - enables: 21 occurrences\n\nAdded 370 inferred relationships\nFinal knowledge graph: 564 triples\nSaved raw knowledge graph data to /mnt/c/Users/rmcdermo/Documents/test.json\nProcessing 564 triples for visualization\nFound 161 unique nodes\nFound 355 inferred relationships\nDetected 9 communities using Louvain method\nNodes in NetworkX graph: 161\nEdges in NetworkX graph: 537\nKnowledge graph visualization saved to /mnt/c/Users/rmcdermo/Documents/test.html\nGraph Statistics: {\n  \"nodes\": 161,\n  \"edges\": 564,\n  \"original_edges\": 209,\n  \"inferred_edges\": 355,\n  \"communities\": 9\n}\n\nKnowledge Graph Statistics:\nNodes: 161\nEdges: 564\nCommunities: 9\n\nTo view the visualization, open the following file in your browser:\nfile:///mnt/c/Users/rmcdermo/Documents/industrial-revolution-kg.html\n```\n\n## How It Works\n\n1. **Chunking**: The document is split into overlapping chunks to fit within the LLM's context window\n2. **First Pass - SPO Extraction**: \n   - Each chunk is processed by the LLM to extract Subject-Predicate-Object triplets\n   - Implemented in the `process_with_llm` function\n   - The LLM identifies entities and their relationships within each text segment\n   - Results are collected across all chunks to form the initial knowledge graph\n3. **Second Pass - Entity Standardization**:\n   - Basic standardization through text normalization\n   - Optional LLM-assisted entity alignment (controlled by `standardization.use_llm_for_entities` config)\n   - When enabled, the LLM reviews all unique entities from the graph and identifies groups that refer to the same concept\n   - This resolves cases where the same entity appears differently across chunks (e.g., \"AI\", \"artificial intelligence\", \"AI system\")\n   - Standardization helps create a more coherent and navigable knowledge graph\n4. **Third Pass - Relationship Inference**:\n   - Automatic inference of transitive relationships\n   - Optional LLM-assisted inference between disconnected graph components (controlled by `inference.use_llm_for_inference` config)\n   - When enabled, the LLM analyzes representative entities from disconnected communities and infers plausible relationships\n   - This reduces graph fragmentation by adding logical connections not explicitly stated in the text\n   - Both rule-based and LLM-based inference methods work together to create a more comprehensive graph\n5. **Visualization**: An interactive HTML visualization is generated using the PyVis library\n\nBoth the second and third passes are optional and can be disabled in the configuration to minimize LLM usage or control these processes manually.\n\n## Visualization Features\n\n- **Color-coded Communities**: Node colors represent different communities\n- **Node Size**: Nodes sized by importance (degree, betweenness, eigenvector centrality)\n- **Relationship Types**: Original relationships shown as solid lines, inferred relationships as dashed lines\n- **Interactive Controls**: Zoom, pan, hover for details, filtering and physics controls\n- **Light (default) and Dark mode themes**.\n\n## Project Layout\n\n```\n.\n├── config.toml                     # Main configuration file for the system\n├── generate-graph.py               # Entry point when run directly as a script\n├── pyproject.toml                  # Python project metadata and build configuration\n├── requirements.txt                # Python dependencies for 'pip' users\n├── uv.lock                         # Python dependencies for 'uv' users\n└── src/                            # Source code\n    ├── generate_graph.py           # Main entry point script when run as a module\n    └── knowledge_graph/            # Core package\n        ├── __init__.py             # Package initialization\n        ├── config.py               # Configuration loading and validation\n        ├── entity_standardization.py # Entity standardization algorithms\n        ├── llm.py                  # LLM interaction and response processing\n        ├── main.py                 # Main program flow and orchestration\n        ├── prompts.py              # Centralized collection of LLM prompts\n        ├── text_utils.py           # Text processing and chunking utilities\n        ├── visualization.py        # Knowledge graph visualization generator\n        └── templates/              # HTML templates for visualization\n            └── graph_template.html # Base template for interactive graph\n```\n\n## Program Flow\n\nThis diagram illustrates the program flow.\n\n```mermaid\nflowchart TD\n    %% Main entry points\n    A[main.py - Entry Point] --\u003e B{Parse Arguments}\n    \n    %% Test mode branch\n    B --\u003e|--test flag| C[sample_data_visualization]\n    C --\u003e D[visualize_knowledge_graph]\n    \n    %% Normal processing branch\n    B --\u003e|normal processing| E[load_config]\n    E --\u003e F[process_text_in_chunks]\n    \n    %% Text processing\n    F --\u003e G[chunk_text]\n    G --\u003e H[process_with_llm]\n    \n    %% LLM processing\n    H --\u003e I[call_llm]\n    I --\u003e J[extract_json_from_text]\n    \n    %% Entity standardization phase\n    F --\u003e K{standardization enabled?}\n    K --\u003e|yes| L[standardize_entities]\n    K --\u003e|no| M{inference enabled?}\n    L --\u003e M\n    \n    %% Relationship inference phase\n    M --\u003e|yes| N[infer_relationships]\n    M --\u003e|no| O[visualize_knowledge_graph]\n    N --\u003e O\n    \n    %% Visualization components\n    O --\u003e P[_calculate_centrality_metrics]\n    O --\u003e Q[_detect_communities]\n    O --\u003e R[_calculate_node_sizes]\n    O --\u003e S[_add_nodes_and_edges_to_network]\n    O --\u003e T[_get_visualization_options]\n    O --\u003e U[_save_and_modify_html]\n    \n    %% Subprocesses\n    L --\u003e L1[_resolve_entities_with_llm]\n    N --\u003e N1[_identify_communities]\n    N --\u003e N2[_infer_relationships_with_llm]\n    N --\u003e N3[_infer_within_community_relationships]\n    N --\u003e N4[_apply_transitive_inference]\n    N --\u003e N5[_infer_relationships_by_lexical_similarity]\n    N --\u003e N6[_deduplicate_triples]\n    \n    %% File outputs\n    U --\u003e V[HTML Visualization]\n    F --\u003e W[JSON Data Export]\n    \n    %% Prompts usage\n    Y[prompts.py] --\u003e H\n    Y --\u003e L1\n    Y --\u003e N2\n    Y --\u003e N3\n    \n    %% Module dependencies\n    subgraph Modules\n        main.py\n        config.py\n        text_utils.py\n        llm.py\n        entity_standardization.py\n        visualization.py\n        prompts.py\n    end\n    \n    %% Phases\n    subgraph Phase 1: Triple Extraction\n        G\n        H\n        I\n        J\n    end\n    \n    subgraph Phase 2: Entity Standardization\n        L\n        L1\n    end\n    \n    subgraph Phase 3: Relationship Inference\n        N\n        N1\n        N2\n        N3\n        N4\n        N5\n        N6\n    end\n    \n    subgraph Phase 4: Visualization\n        O\n        P\n        Q\n        R\n        S\n        T\n        U\n    end\n```\n\n## Program Flow Description\n\n1. **Entry Point**: The program starts in `main.py` which parses command-line arguments.\n\n2. **Mode Selection**:\n   - If `--test` flag is provided, it generates a sample visualization\n   - Otherwise, it processes the input text file\n\n3. **Configuration**: Loads settings from `config.toml` using `config.py`\n\n4. **Text Processing**:\n   - Breaks text into chunks with overlap using `text_utils.py`\n   - Processes each chunk with the LLM to extract triples\n   - Uses prompts from `prompts.py` to guide the LLM's extraction process\n\n5. **Entity Standardization** (optional):\n   - Standardizes entity names across all triples\n   - May use LLM for entity resolution in ambiguous cases\n   - Uses specialized prompts from `prompts.py` for entity resolution\n\n6. **Relationship Inference** (optional):\n   - Identifies communities in the graph\n   - Infers relationships between disconnected communities\n   - Applies transitive inference and lexical similarity rules\n   - Uses specialized prompts from `prompts.py` for relationship inference\n   - Deduplicates triples\n\n7. **Visualization**:\n   - Calculates centrality metrics and community detection\n   - Determines node sizes and colors based on importance\n   - Creates an interactive HTML visualization using PyVis\n   - Customizes the HTML with templates\n\n8. **Output**:\n   - Saves the knowledge graph as both HTML and JSON\n   - Displays statistics about nodes, edges, and communities\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimaster-dev%2Fknowledge-graph-construction","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faimaster-dev%2Fknowledge-graph-construction","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faimaster-dev%2Fknowledge-graph-construction/lists"}