https://github.com/ibm/docling-graph
Transform unstructured documents into validated, rich and queryable knowledge graphs.
https://github.com/ibm/docling-graph
ai convert docling document-processing knowledge-graph
Last synced: 28 days ago
JSON representation
Transform unstructured documents into validated, rich and queryable knowledge graphs.
- Host: GitHub
- URL: https://github.com/ibm/docling-graph
- Owner: IBM
- License: mit
- Created: 2025-11-17T14:35:47.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-01-20T23:13:28.000Z (about 1 month ago)
- Last Synced: 2026-01-21T07:11:20.622Z (about 1 month ago)
- Topics: ai, convert, docling, document-processing, knowledge-graph
- Language: Python
- Homepage:
- Size: 31.3 MB
- Stars: 16
- Watchers: 0
- Forks: 4
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Governance: GOVERNANCE.md
Awesome Lists containing this project
README
# Docling Graph
[](https://ibm.github.io/docling-graph)
[](https://github.com/docling-project/docling)
[](https://pypi.org/project/docling-graph/)
[](https://www.python.org/downloads/)
[](https://github.com/astral-sh/uv)
[](https://github.com/astral-sh/ruff)
[](https://networkx.org/)
[](https://pydantic.dev)
[](https://typer.tiangolo.com/)
[](https://github.com/Textualize/rich)
[](https://vllm.ai/)
[](https://ollama.ai/)
[](https://lfaidata.foundation/projects/)
[](https://opensource.org/licenses/MIT)
[](https://www.bestpractices.dev/projects/11598)
Docling-Graph converts documents into validated **Pydantic** objects and then into a **directed knowledge graph**, with exports to CSV or Cypher and both static and interactive visualizations.
This transformation of unstructured documents into validated knowledge graphs with precise semantic relationships—essential for complex domains like **chemistry, finance, and physics** where AI systems must understand exact entity connections (e.g., chemical compounds and their reactions, financial instruments and their dependencies, physical properties and their measurements) rather than approximate text vectors, **enabling explainable reasoning over technical document collections**.
The toolkit supports two extraction families: **local VLM** via Docling and **LLM-based extraction** via local (vLLM, Ollama) or API providers (Mistral, OpenAI, Gemini, IBM WatsonX), all orchestrated by a flexible, config-driven pipeline.
## Key Capabilities
- **🧠 Extraction**:
- Local `VLM` (Docling's information extraction pipeline - ideal for small documents with key-value focus)
- `LLM` (local via vLLM/Ollama or remote via Mistral/OpenAI/Gemini/IBM WatsonX API)
- `Hybrid Chunking` Leveraging Docling's segmentation with semantic LLM chunking for more context-aware extraction
- `Page-wise` or `whole-document` conversion strategies for flexible processing
- **🔨 Graph Construction**:
- Markdown to Graph: Convert validated Pydantic instances to a `NetworkX DiGraph` with rich edge metadata and stable node IDs
- Smart Merge: Combine multi-page documents into a single Pydantic instance for unified processing
- Modular graph module with enhanced type safety and configuration
- **📦 Export**:
- `Docling Document` exports (JSON format with full document structure)
- `Markdown` exports (full document and per-page options)
- `CSV` compatible with `Neo4j` admin import
- `Cypher` script generation for bulk ingestion
- `JSON` export for general-purpose graph data
- **📊 Visualization**:
- Interactive `HTML` visualization in full-page browser view with enhanced node/edge exploration
- Detailed `MARKDOWN` report with graph nodes content and edges
### Coming Soon
* 🪜 **Multi-Stage Extraction:** Define `extraction_stage` in templates to control multi-pass extraction.
* 🧩 **Interactive Template Builder:** Guided workflows for building Pydantic templates.
* 🧬 **Ontology-Based Templates:** Match content to the best Pydantic template using semantic similarity.
* ✍🏻 **Flexible Inputs:** Accepts `text`, `markdown`, and `DoclingDocument` directly.
* ⚡ **Batch Optimization:** Faster GPU inference with better memory handling.
* 💾 **Graph Database Integration:** Export data straight into `Neo4j`, `ArangoDB`, and similar databases.
## Initial Setup
### Requirements
- Python 3.10 or higher
- UV package manager
### Installation
#### 1. Clone the Repository
```bash
git clone https://github.com/IBM/docling-graph
cd docling-graph
```
#### 2. Install Dependencies
Choose the installation option that matches your use case:
| Option | Command | Description |
| :--- | :--- | :--- |
| **Minimal** | `uv sync` | Includes core VLM features (Docling), **no** LLM inference |
| **Full** | `uv sync --extra all` | Includes **all** features, VLM, and all local/remote LLM providers |
| **Local LLM** | `uv sync --extra local` | Adds support for vLLM and Ollama (requires GPU for vLLM) |
| **Remote API** | `uv sync --extra remote` | Adds support for Mistral, OpenAI, Gemini, and IBM WatsonX APIs |
| **WatsonX** | `uv sync --extra watsonx` | Adds support for IBM WatsonX foundation models (Granite, Llama, Mixtral) |
#### 3. OPTIONAL - GPU Support (PyTorch)
Follow the steps in [this guide](docs/guides/setup_with_gpu_support.md) to install PyTorch with NVIDIA GPU (CUDA) support.
### API Key Setup (for Remote Inference)
If you're using remote/cloud inference, set your API keys for the providers you plan to use:
```bash
export OPENAI_API_KEY="..." # OpenAI
export MISTRAL_API_KEY="..." # Mistral
export GEMINI_API_KEY="..." # Google Gemini
export WATSONX_API_KEY="..." # IBM WatsonX
export WATSONX_PROJECT_ID="..." # IBM WatsonX Project ID
export WATSONX_URL="..." # IBM WatsonX URL (optional, defaults to US South)
```
On Windows, replace `export` with `set` in Command Prompt or `$env:` in PowerShell.
Alternatively, add them to your `.env` file.
**Note:** For IBM WatsonX setup and available models, see the [WatsonX Integration Guide](docs/guides/watsonx_integration.md).
## Getting Started
Docling Graph is primarily driven by its **CLI**, but you can easily integrate the core pipeline into Python scripts.
### 1. Python Example
To run a conversion programmatically, you define a configuration dictionary and pass it to the `run_pipeline` function. This example uses a **remote LLM API** in a `many-to-one` mode for a single multi-page document:
```python
from docling_graph import run_pipeline, PipelineConfig
from docs.examples.templates.rheology_research import Research # Pydantic model to use as an extraction template
# Create typed config
config = PipelineConfig(
source="docs/examples/data/research_paper/rheology.pdf",
template=Research,
backend="llm",
inference="remote",
processing_mode="many-to-one",
provider_override="mistral", # Specify your preferred provider and ensure its API key is set
model_override="mistral-medium-latest", # Specify your preferred LLM model
use_chunking=True, # Enable docling's hybrid chunker
llm_consolidation=False, # If False, programmatically merge batch-extracted dictionaries
output_dir="outputs/battery_research"
)
try:
run_pipeline(config)
print(f"\nExtraction complete! Graph data saved to: {config.output_dir}")
except Exception as e:
print(f"An error occurred: {e}")
```
### 2. CLI Example
Use the command-line interface for quick conversions and inspections. The following command runs the conversion using the local VLM backend and outputs a graph ready for Neo4j import:
#### 2.1. Initialize Configuration
A wizard will walk you through setting up the right configfor your use case.
```bash
uv run docling-graph init
```
Note: This command may take a little longer to start on the first run, as it checks for installed dependencies.
#### 2.2. Run Conversion
You can use: `docling-graph convert --help` to see the full list of available options and usage details
```bash
# uv run docling-graph convert --template "" [OPTIONS]
uv run docling-graph convert "docs/examples/data/research_paper/rheology.pdf" \
--template "docs.examples.templates.rheology_research.Research" \
--output-dir "outputs/battery_research" \
--processing-mode "many-to-one" \
--use-chunking \
--no-llm-consolidation
```
#### 2.3. Run Conversion
```bash
# uv run docling-graph inspect [OPTIONS]
uv run docling-graph inspect outputs/battery_research
```
## Pydantic Templates
Templates are the foundation of Docling Graph, defining both the **extraction schema** and the resulting **graph structure**.
* Use `is_entity=True` in `model_config` to explicitly mark a class as a graph node.
* Leverage `model_config.graph_id_fields` to create stable, readable node IDs (natural keys).
* Use the `Edge()` helper to define explicit relationships between entities.
**Example:**
```python
from pydantic import BaseModel, Field
from typing import Optional
class Person(BaseModel):
"""Person entity with stable ID based on name and DOB."""
model_config = {
'is_entity': True,
'graph_id_fields': ['last_name', 'date_of_birth']
}
first_name: str = Field(description="Person's first name")
last_name: str = Field(description="Person's last name")
date_of_birth: str = Field(description="Date of birth (YYYY-MM-DD)")
```
Reference Pydantic [templates](docs/examples/templates) are available to help you get started quickly.
For complete guidance, see: [Pydantic Templates for Knowledge Graph Extraction](docs/guides/create_pydantic_templates_for_kg_extraction.md)
## Documentation
* *Work In Progress...*
## Examples
Get hands-on with Docling Graph [examples](docs/examples/scripts) to convert documents into knowledge graphs through `VLM` or `LLM`-based processing.
## License
MIT License - see [LICENSE](LICENSE) for details.
## Acknowledgments
- Powered by [Docling](https://github.com/docling-project/docling) for advanced document processing.
- Uses [Pydantic](https://pydantic.dev) for data validation.
- Graph generation powered by [NetworkX](https://networkx.org/).
- Visualizations powered by [Cytoscape.js](https://js.cytoscape.org/).
- CLI powered by [Typer](https://typer.tiangolo.com/) and [Rich](https://github.com/Textualize/rich).
## IBM ❤️ Open Source AI
Docling Graph has been brought to you by IBM.