{"id":40790173,"url":"https://github.com/ibm/docling-graph","last_synced_at":"2026-01-26T07:09:51.808Z","repository":{"id":325063510,"uuid":"1098317259","full_name":"IBM/docling-graph","owner":"IBM","description":"Transform unstructured documents into validated, rich and queryable knowledge graphs.","archived":false,"fork":false,"pushed_at":"2026-01-20T23:13:28.000Z","size":32792,"stargazers_count":16,"open_issues_count":4,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-21T07:11:20.622Z","etag":null,"topics":["ai","convert","docling","document-processing","knowledge-graph"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/IBM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":"GOVERNANCE.md","roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-11-17T14:35:47.000Z","updated_at":"2026-01-20T23:13:33.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/IBM/docling-graph","commit_stats":null,"previous_names":["ibm/docling-graph"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/IBM/docling-graph","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Fdocling-graph","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Fdocling-graph/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Fdocling-graph/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Fdocling-graph/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/IBM","download_url":"https://codeload.github.com/IBM/docling-graph/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/IBM%2Fdocling-graph/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28641293,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T18:04:35.752Z","status":"ssl_error","status_checked_at":"2026-01-21T18:03:55.054Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","convert","docling","document-processing","knowledge-graph"],"created_at":"2026-01-21T20:01:46.676Z","updated_at":"2026-01-21T20:01:47.507Z","avatar_url":"https://github.com/IBM.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\u003cbr\u003e\n  \u003ca href=\"https://github.com/IBM/docling-graph\"\u003e\n    \u003cimg loading=\"lazy\" alt=\"Docling Graph\" src=\"docs/assets/logo.png\" width=\"280\"/\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n# Docling Graph\n\n[![Docs](https://img.shields.io/badge/docs-live-brightgreen)](https://ibm.github.io/docling-graph)\n[![Docling](https://img.shields.io/badge/Docling-VLM-red)](https://github.com/docling-project/docling)\n[![PyPI version](https://img.shields.io/pypi/v/docling-graph)](https://pypi.org/project/docling-graph/)\n[![Python 3.10 | 3.11 | 3.12](https://img.shields.io/badge/Python-3.10%20%7C%203.11%20%7C%203.12-blue)](https://www.python.org/downloads/)\n[![uv](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/uv/main/assets/badge/v0.json)](https://github.com/astral-sh/uv)\n[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)\n[![NetworkX](https://img.shields.io/badge/NetworkX-3.0+-red)](https://networkx.org/)\n[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://pydantic.dev)\n[![Typer](https://img.shields.io/badge/Typer-CLI-purple)](https://typer.tiangolo.com/)\n[![Rich](https://img.shields.io/badge/Rich-terminal-purple)](https://github.com/Textualize/rich)\n[![vLLM](https://img.shields.io/badge/vLLM-compatible-brightgreen)](https://vllm.ai/)\n[![Ollama](https://img.shields.io/badge/Ollama-compatible-brightgreen)](https://ollama.ai/)\n[![LF AI \u0026 Data](https://img.shields.io/badge/LF%20AI%20%26%20Data-003778?logo=linuxfoundation\u0026logoColor=fff\u0026color=0094ff\u0026labelColor=003778)](https://lfaidata.foundation/projects/)\n[![License MIT](https://img.shields.io/github/license/IBM/docling-graph)](https://opensource.org/licenses/MIT)\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/11598/badge)](https://www.bestpractices.dev/projects/11598)\n\nDocling-Graph converts documents into validated **Pydantic** objects and then into a **directed knowledge graph**, with exports to CSV or Cypher and both static and interactive visualizations.\n\nThis transformation of unstructured documents into validated knowledge graphs with precise semantic relationships—essential for complex domains like **chemistry, finance, and physics** where AI systems must understand exact entity connections (e.g., chemical compounds and their reactions, financial instruments and their dependencies, physical properties and their measurements) rather than approximate text vectors, **enabling explainable reasoning over technical document collections**.\n\nThe toolkit supports two extraction families: **local VLM** via Docling and **LLM-based extraction** via local (vLLM, Ollama) or API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated by a flexible, config-driven pipeline.\n\n\n\n## Key Capabilities\n\n- **🧠 Extraction**:\n  - Local `VLM` (Docling's information extraction pipeline - ideal for small documents with key-value focus)  \n  - `LLM` (local via vLLM/Ollama or remote via Mistral/OpenAI/Gemini/IBM WatsonX API)\n  - `Hybrid Chunking` Leveraging Docling's segmentation with semantic LLM chunking for more context-aware extraction\n  - `Page-wise` or `whole-document` conversion strategies for flexible processing\n- **🔨 Graph Construction**:\n  - Markdown to Graph: Convert validated Pydantic instances to a `NetworkX DiGraph` with rich edge metadata and stable node IDs\n  - Smart Merge: Combine multi-page documents into a single Pydantic instance for unified processing\n  - Modular graph module with enhanced type safety and configuration\n- **📦 Export**:\n  - `Docling Document` exports (JSON format with full document structure)\n  - `Markdown` exports (full document and per-page options)\n  - `CSV` compatible with `Neo4j` admin import  \n  - `Cypher` script generation for bulk ingestion\n  - `JSON` export for general-purpose graph data\n- **📊 Visualization**:\n  - Interactive `HTML` visualization in full-page browser view with enhanced node/edge exploration\n  - Detailed `MARKDOWN` report with graph nodes content and edges\n\n### Coming Soon\n\n* 🪜 **Multi-Stage Extraction:** Define `extraction_stage` in templates to control multi-pass extraction.\n* 🧩 **Interactive Template Builder:** Guided workflows for building Pydantic templates.\n* 🧬 **Ontology-Based Templates:** Match content to the best Pydantic template using semantic similarity.\n* ✍🏻 **Flexible Inputs:** Accepts `text`, `markdown`, and `DoclingDocument` directly.\n* ⚡ **Batch Optimization:** Faster GPU inference with better memory handling.\n* 💾 **Graph Database Integration:** Export data straight into `Neo4j`, `ArangoDB`, and similar databases.\n\n\n\n## Initial Setup\n\n### Requirements\n\n- Python 3.10 or higher\n- UV package manager\n\n### Installation\n\n#### 1. Clone the Repository\n\n```bash\ngit clone https://github.com/IBM/docling-graph\ncd docling-graph\n```\n\n#### 2. Install Dependencies\n\nChoose the installation option that matches your use case:\n\n| Option          | Command                   | Description                                                                |\n| :---            | :---                      | :---                                                                       |\n| **Minimal**     | `uv sync`                 | Includes core VLM features (Docling), **no** LLM inference                 |\n| **Full**        | `uv sync --extra all`     | Includes **all** features, VLM, and all local/remote LLM providers         |\n| **Local LLM**   | `uv sync --extra local`   | Adds support for vLLM and Ollama (requires GPU for vLLM)                   |\n| **Remote API**  | `uv sync --extra remote`  | Adds support for Mistral, OpenAI, Gemini, and IBM WatsonX APIs            |\n| **WatsonX**     | `uv sync --extra watsonx` | Adds support for IBM WatsonX foundation models (Granite, Llama, Mixtral)   |\n\n\n#### 3. OPTIONAL - GPU Support (PyTorch)\n\nFollow the steps in [this guide](docs/guides/setup_with_gpu_support.md) to install PyTorch with NVIDIA GPU (CUDA) support.\n\n\n\n### API Key Setup (for Remote Inference)\n\nIf you're using remote/cloud inference, set your API keys for the providers you plan to use:\n\n```bash\nexport OPENAI_API_KEY=\"...\"        # OpenAI\nexport MISTRAL_API_KEY=\"...\"       # Mistral\nexport GEMINI_API_KEY=\"...\"        # Google Gemini\nexport WATSONX_API_KEY=\"...\"       # IBM WatsonX\nexport WATSONX_PROJECT_ID=\"...\"    # IBM WatsonX Project ID\nexport WATSONX_URL=\"...\"           # IBM WatsonX URL (optional, defaults to US South)\n```\n\nOn Windows, replace `export` with `set` in Command Prompt or `$env:` in PowerShell.\n\nAlternatively, add them to your `.env` file.\n\n**Note:** For IBM WatsonX setup and available models, see the [WatsonX Integration Guide](docs/guides/watsonx_integration.md).\n\n\n\n## Getting Started\n\nDocling Graph is primarily driven by its **CLI**, but you can easily integrate the core pipeline into Python scripts.\n\n### 1. Python Example\n\nTo run a conversion programmatically, you define a configuration dictionary and pass it to the `run_pipeline` function. This example uses a **remote LLM API** in a `many-to-one` mode for a single multi-page document:\n\n```python\nfrom docling_graph import run_pipeline, PipelineConfig\nfrom docs.examples.templates.rheology_research import Research  # Pydantic model to use as an extraction template\n\n# Create typed config\nconfig = PipelineConfig(\n    source=\"docs/examples/data/research_paper/rheology.pdf\",\n    template=Research,\n    backend=\"llm\",\n    inference=\"remote\",\n    processing_mode=\"many-to-one\",\n    provider_override=\"mistral\",              # Specify your preferred provider and ensure its API key is set\n    model_override=\"mistral-medium-latest\",   # Specify your preferred LLM model\n    use_chunking=True,                        # Enable docling's hybrid chunker\n    llm_consolidation=False,                  # If False, programmatically merge batch-extracted dictionaries\n    output_dir=\"outputs/battery_research\"\n)\n\ntry:\n    run_pipeline(config)\n    print(f\"\\nExtraction complete! Graph data saved to: {config.output_dir}\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n```\n\n\n### 2. CLI Example\n\nUse the command-line interface for quick conversions and inspections. The following command runs the conversion using the local VLM backend and outputs a graph ready for Neo4j import:\n\n#### 2.1. Initialize Configuration\n\nA wizard will walk you through setting up the right configfor your use case.\n\n```bash\nuv run docling-graph init\n```\n\nNote: This command may take a little longer to start on the first run, as it checks for installed dependencies.\n\n\n#### 2.2. Run Conversion\n\nYou can use: `docling-graph convert --help` to see the full list of available options and usage details\n\n```bash\n# uv run docling-graph convert \u003cSOURCE_FILE_PATH\u003e --template \"\u003cTEMPLATE_DOTTED_PATH\u003e\" [OPTIONS]\n\nuv run docling-graph convert \"docs/examples/data/research_paper/rheology.pdf\" \\\n    --template \"docs.examples.templates.rheology_research.Research\" \\\n    --output-dir \"outputs/battery_research\"  \\\n    --processing-mode \"many-to-one\" \\\n    --use-chunking \\\n    --no-llm-consolidation \n```\n\n#### 2.3. Run Conversion\n\n```bash\n# uv run docling-graph inspect \u003cCONVERT_OUTPUT_PATH\u003e [OPTIONS]\n\nuv run docling-graph inspect outputs/battery_research\n```\n\n\n\n## Pydantic Templates\n\nTemplates are the foundation of Docling Graph, defining both the **extraction schema** and the resulting **graph structure**.\n\n  * Use `is_entity=True` in `model_config` to explicitly mark a class as a graph node.\n  * Leverage `model_config.graph_id_fields` to create stable, readable node IDs (natural keys).\n  * Use the `Edge()` helper to define explicit relationships between entities.\n\n**Example:**\n\n```python\nfrom pydantic import BaseModel, Field\nfrom typing import Optional\n\nclass Person(BaseModel):\n    \"\"\"Person entity with stable ID based on name and DOB.\"\"\"\n    model_config = {\n        'is_entity': True,\n        'graph_id_fields': ['last_name', 'date_of_birth']\n    }\n    \n    first_name: str = Field(description=\"Person's first name\")\n    last_name: str = Field(description=\"Person's last name\")\n    date_of_birth: str = Field(description=\"Date of birth (YYYY-MM-DD)\")\n```\n\nReference Pydantic [templates](docs/examples/templates) are available to help you get started quickly.\n\nFor complete guidance, see: [Pydantic Templates for Knowledge Graph Extraction](docs/guides/create_pydantic_templates_for_kg_extraction.md)\n\n\n\n## Documentation\n\n* *Work In Progress...*\n\n\n\n## Examples\n\nGet hands-on with Docling Graph [examples](docs/examples/scripts) to convert documents into knowledge graphs through `VLM` or `LLM`-based processing.\n\n## License\n\nMIT License - see [LICENSE](LICENSE) for details.\n\n\n\n## Acknowledgments\n\n- Powered by [Docling](https://github.com/docling-project/docling) for advanced document processing.\n- Uses [Pydantic](https://pydantic.dev) for data validation.\n- Graph generation powered by [NetworkX](https://networkx.org/).\n- Visualizations powered by [Cytoscape.js](https://js.cytoscape.org/).\n- CLI powered by [Typer](https://typer.tiangolo.com/) and [Rich](https://github.com/Textualize/rich).\n\n\n\n## IBM ❤️ Open Source AI\n\nDocling Graph has been brought to you by IBM.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fibm%2Fdocling-graph","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fibm%2Fdocling-graph","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fibm%2Fdocling-graph/lists"}