https://github.com/dzooli/knowledge-builder

An AI-assisted knowledge graph builder based on Paperless-ngx, Ollama and Neo4j.
https://github.com/dzooli/knowledge-builder
Last synced: 18 days ago
JSON representation
An AI-assisted knowledge graph builder based on Paperless-ngx, Ollama and Neo4j.
Host: GitHub
URL: https://github.com/dzooli/knowledge-builder
Owner: dzooli
License: mit
Created: 2025-09-02T02:34:17.000Z (about 1 month ago)
Default Branch: master
Last Pushed: 2025-09-07T03:45:36.000Z (28 days ago)
Last Synced: 2025-09-07T05:41:31.076Z (28 days ago)
Language: Python
Homepage:
Size: 199 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # Knowledge Builder

## Description

This project is an automated ETL pipeline: knowledge is extracted from documents OCR‑ed by Paperless‑ngx using the

Ollama LLM, then loaded into a Neo4j graph via the official Neo4j Memory MCP server. The loading is performed by a

LangChain ReAct Agent; tool calls are delegated to the LLM itself. Optionally, the raw text can also be exported to an

Obsidian vault. The importer includes a scheduler for periodic execution, graceful shutdown with no overlapping runs,

and verbose logging using loguru.

## ✨ Components

- Paperless‑ngx – OCR and storage for screenshots/documents

- paperless‑token (automated) – builds a tiny image from Paperless and automatically creates/gets the DRF API token,

  saving it to a shared file

- Ollama – local LLM (model: `gpt-oss:20b`, temperature=0) built from a custom Dockerfile; the model is baked at image

  build time

- **Importer (Modular Architecture)** – A refactored, maintainable Python application with clean separation of concerns:

    - **Configuration Layer** – Centralized environment variable management

    - **Data Models** – Pydantic models for type safety and validation

    - **Utility Modules** – Text processing, JSON parsing, tool call handling

    - **Connectors** – Clean interfaces to Paperless API and Neo4j Memory MCP

    - **Processing Engine** – Document pipeline, AI agent orchestration, state management

    - **Services** – Bootstrap coordination and scheduled execution

- Neo4j – graph database + web UI (Browser)

- Memory MCP server – `mcp-neo4j-memory` invoked via STDIO by the importer (no separate service required)

- Scheduler – Executes the importer periodically (default: every 5 minutes), prevents overlapping runs

- Loguru Logging – Thread‑safe, rotating logs for better diagnostics (10 MB rotation)

## 📂 Directory Structure

```

neo4j-stack/

  docker-compose.yml      # Neo4j separate Compose

paperless/                # Paperless-ngx data & media

  data/

  media/

importer/

  src/                    # Modular Python application structure

    config.py             # Centralized configuration management

    models.py             # Data structures and Pydantic models

    main.py               # Application entry point

    utils/                # Utility modules

      __init__.py

      text_utils.py       # Text processing utilities

      json_parser.py      # JSON parsing and extraction

      tool_call_extractor.py  # Tool call extraction from LLM responses

      tool_call_normalizer.py # Tool call parameter normalization

    connectors/           # External service integrations

      __init__.py

      neo4j_connector.py  # Neo4j Memory MCP client management

      paperless_connector.py  # Paperless API integration

    processing/           # Core business logic

      __init__.py

      document_processor.py    # Main ETL pipeline orchestration

      agent_orchestrator.py    # AI agent processing with Neo4j

      state_manager.py         # Processing state and idempotency

    services/             # Supporting services

      __init__.py

      bootstrap.py        # Service availability and bootstrapping

      scheduler.py        # Scheduled execution coordination

  Dockerfile

ollama/

  Dockerfile              # builds Ollama image and pre-creates `gpt-oss:20b`

  Modelfile               # FROM gpt-oss:20b; PARAMETER temperature 0.1

paperless-token/

  Dockerfile              # minimal image derived from Paperless to create/get token

  entrypoint.sh           # waits for DB; runs manage.py drf_create_token; fallback script

bootstrap/

  get_token.py            # Django fallback (create/get DRF token)

  paperless_token.txt     # shared file: token written here

  token_init.sh           # legacy: not used with paperless-token image

data/

  state.json, importer.log, obsidian/

scripts/

  test_extract_entities.py

  test_tool_parsing.py

inbox/                    # mounted into Paperless consume directory

docker-compose.yml        # Paperless, paperless-token, Ollama, Importer

```

## ✅ Prerequisites

- Docker + Docker Compose

- Free ports: 7474 (Neo4j UI), 7687 (Neo4j Bolt), 8900 (Paperless UI), 11435 (host → Ollama:11434)

- GPU (for Ollama): enable GPU support in Docker (e.g., WSL2 + NVIDIA on Windows, nvidia‑container‑toolkit on Linux)

- Linux host mapping: Compose includes `extra_hosts: host.docker.internal:host-gateway`

## 🚀 Quickstart

1) Start Neo4j (separate compose)

```bash

# from project root

docker compose --env-file ./.env -f neo4j-stack/docker-compose.yml up -d

```

Neo4j UI: http://localhost:7474  

Login: user=`NEO4J_USERNAME`, pass=`NEO4J_PASSWORD`

2) Start the KB stack (Paperless, token bootstrap, Ollama, Importer)

```bash

docker compose up -d --build

```

3) Token bootstrap (automated)

The `paperless-token-init` job:

- waits for the Paperless DB file,

- runs `python3 manage.py drf_create_token $PAPERLESS_ADMIN_USER`,

- robustly extracts the token (fallback to Django helper if needed),

- writes it to `./bootstrap/paperless_token.txt`.

Check logs and the file:

```bash

docker compose logs -f paperless-token-init

cat ./bootstrap/paperless_token.txt

```

The importer reads this file automatically.

If the file contains `PENDING`, wait for migrations to finish and restart the token job:

```bash

docker compose restart paperless-token-init

```

## 🔧 Configuration (key envs)

- Root `.env` (shared)

```

NEO4J_USERNAME=neo4j

NEO4J_PASSWORD=

PAPERLESS_INBOX=./inbox

PAPERLESS_ADMIN_PASSWORD=

```

- Paperless (ensure admin creds align with your first run)

```

PAPERLESS_ADMIN_USER=admin

PAPERLESS_ADMIN_PASSWORD=

PAPERLESS_REDIS=redis://redis:6379

```

- Importer → Paperless & token file

```

PAPERLESS_URL=http://paperless:8000

PAPERLESS_TOKEN_FILE=/bootstrap/paperless_token.txt

```

- Neo4j (importer connection)

```

NEO4J_URL=bolt://host.docker.internal:7687

# Compatibility also supported by importer: NEO4J_URI, NEO4J_USER, NEO4J_PASS

```

- Ollama

```

# Model name created during the Ollama image build

OLLAMA_MODEL=gpt-oss:20b

# Host access: http://127.0.0.1:11435 (container listens on http://ollama:11434)

```

- Importer runtime

```

MEMORY_MCP_CMD=/app/.venv/bin/mcp-neo4j-memory

STATE_PATH=/data/state.json

VAULT_DIR=/data/obsidian

SCHEDULE_TIME=5              # minutes

CHUNK_SIZE=5000

LOG_FILE=/data/importer.log

FORCE_REPROCESS=0            # set to 1 only for an initial full run/test

OBSIDIAN_EXPORT=0            # set to 1 to export chunk markdown files to VAULT_DIR

# Optional verbose logging controls (defaults shown in code)

LOG_CHUNK_FULL=0

LOG_CHUNK_PREVIEW_MAX=2000

LOG_TOOL_PREVIEW_MAX=1500

LOG_LLM_OUTPUT_MAX=4000

```

Notes:

- In the provided `docker-compose.yml`, `FORCE_REPROCESS` is set to `1` to exercise the pipeline on first startup;

  change it to `0` for incremental runs.

- The Memory MCP server (`mcp-neo4j-memory`) is launched by the importer via STDIO; no separate service is required.

## 📊 Program Operation

The Knowledge Builder follows a clear ETL (Extract, Transform, Load) pipeline with scheduled execution and graceful

shutdown. The diagram below shows startup, document processing, and shutdown, including idempotency checks and

chunk‑level processing.

```mermaid

flowchart TD

    A[Start Importer] --> B[Bootstrap Services]

    B --> C{Check Services}

    C -->|Paperless| D[Wait for HTTP Service]

    C -->|Neo4j| E[Wait for Neo4j Connection]

    C -->|Ollama| F[Wait for Ollama API]

    D --> G[Get Paperless Token]

    E --> G

    F --> G

    G --> H[Initialize Scheduler]

    H --> I[Run First Import]

    I --> J[Start Periodic Scheduler]

    J --> K[Schedule Check Every 5 Minutes]

    K --> L{Time to Run?}

    L -->|No| K

    L -->|Yes| M{Previous Run Active?}

    M -->|Yes| N[Skip - Log Warning]

    N --> K

    M -->|No| O[Acquire Run Lock]

    O --> P[Fetch Documents from Paperless]

    P --> Q{For Each Document}

    Q --> R[Prepare Document Work]

    R --> S{Should Process?}

    S -->|Skip - ID ≤ last_id| T[Next Document]

    S -->|Skip - No text| U[Update State]

    S -->|Skip - Hash unchanged| U

    S -->|Process| V[Extract & Chunk Text]

    V --> W{For Each Chunk}

    W --> X[Initialize MCP Client]

    X --> Y[Create LangChain Tools]

    Y --> Z[Initialize Ollama LLM]

    Z --> AA[Create ReAct Agent]

    AA --> BB[Generate Prompt]

    BB --> CC[Agent Processes Chunk]

    CC --> DD{Agent Actions}

    DD -->|Search| EE[search_memories/find_by_name]

    DD -->|Create| FF[create_entities]

    DD -->|Relate| GG[create_relations]

    DD -->|Observe| HH[add_observations]

    DD -->|Delete| II[delete_*]

    EE --> JJ[MCP Tool Call to Neo4j]

    FF --> JJ

    GG --> JJ

    HH --> JJ

    II --> JJ

    JJ --> KK{More Actions?}

    KK -->|Yes| DD

    KK -->|No| LL[Close MCP Client]

    LL --> MM{Obsidian Export Enabled?}

    MM -->|Yes| NN[Write Markdown File]

    MM -->|No| OO[Next Chunk]

    NN --> OO

    OO --> PP{More Chunks?}

    PP -->|Yes| W

    PP -->|No| QQ[Finalize Document]

    QQ --> RR[Update State & Hash]

    RR --> T

    T --> SS{More Documents?}

    SS -->|Yes| Q

    SS -->|No| TT[Release Run Lock]

    U --> T

    TT --> UU[Log Completion]

    UU --> K

    VV[Signal Handler] -->|SIGINT/SIGTERM| WW[Set Stop Event]

    WW --> XX[Graceful Shutdown]

    XX --> YY[Finish Current Run]

    YY --> ZZ[Exit]

```

### Key Operational Features

- Service Bootstrap: waits for Paperless, Neo4j, and Ollama to be available

- Token Management: automated by `paperless-token-init` (token at `bootstrap/paperless_token.txt`)

- Scheduled Execution: runs every 5 minutes (configurable) with overlap prevention

- Document Processing: state tracking avoids reprocessing unchanged documents

- Chunk Processing: splits large documents into manageable chunks (default 5000 chars)

- Agent‑Driven: LangChain ReAct agent makes autonomous decisions about Neo4j operations

- Evidence Linking: every touched/created entity is linked to a per‑chunk Evidence node

- Graceful Shutdown: responds to signals and finishes current work before exiting

- Thread Safety: uses locks to prevent concurrent runs and thread‑safe logging

- Error Resilience: continues processing even if individual documents fail

## 🏗️ Architecture Benefits

The importer has been refactored from a monolithic script into a well-structured, modular application:

### **Separation of Concerns**

- **Configuration**: All environment variables centralized in `config.py`

- **Data Models**: Type-safe Pydantic models in `models.py`

- **Utilities**: Reusable text, JSON, and tool processing functions

- **Connectors**: Clean abstractions for external service integration

- **Processing**: Core business logic separated by responsibility

- **Services**: Supporting functionality like bootstrapping and scheduling

### **Maintainability & Testing**

- **Smaller Modules**: Each file has a single, well-defined responsibility

- **Reduced Coupling**: Dependencies are explicit through imports

- **Enhanced Testability**: Individual components can be tested in isolation

- **Easier Extension**: New features can be added without modifying existing modules

### **Code Organization**

- **Logical Grouping**: Related functionality is packaged together

- **Clear Dependencies**: Import relationships are explicit and hierarchical

- **Consistent Structure**: Follows Python packaging best practices

## 🔍 Testing & Dev Aids

- Tail importer logs:

```bash

docker compose logs -f importer

```

## 🧯 Troubleshooting

- `paperless_token.txt` shows PENDING

```bash

docker compose restart paperless-token-init

```

- No token logs: check the token service logs

```bash

docker compose logs -f paperless-token-init

```

- Ollama model not created: the model is baked into the custom image during build

```bash

docker compose build ollama && docker compose up -d ollama

```

- Neo4j not available: ensure `neo4j-stack` is running and `NEO4J_*` creds are correct

- Importer reprocesses endlessly: set `FORCE_REPROCESS=0` (it’s `1` in the sample compose for first‑run)

## 📜 License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dzooli/knowledge-builder

Awesome Lists containing this project

README