https://github.com/karmaniverous/jeeves-watcher

Filesystem watcher that keeps a Qdrant vector store in sync with document changes. Config-driven rules engine, semantic search API, and CLI.
https://github.com/karmaniverous/jeeves-watcher

cli document-indexing embeddings filesystem-watcher gemini langchain qdrant rag semantic-search typescript vector-store

Last synced: 12 days ago
JSON representation

Filesystem watcher that keeps a Qdrant vector store in sync with document changes. Config-driven rules engine, semantic search API, and CLI.

Host: GitHub
URL: https://github.com/karmaniverous/jeeves-watcher
Owner: karmaniverous
License: bsd-3-clause
Created: 2026-02-20T09:35:53.000Z (2 months ago)
Default Branch: main
Last Pushed: 2026-02-26T16:21:37.000Z (about 2 months ago)
Last Synced: 2026-02-26T19:44:24.588Z (about 2 months ago)
Topics: cli, document-indexing, embeddings, filesystem-watcher, gemini, langchain, qdrant, rag, semantic-search, typescript, vector-store
Language: HTML
Homepage: https://docs.karmanivero.us/jeeves-watcher/
Size: 3.34 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

README

          # Jeeves Watcher 🎩

Filesystem watcher that keeps a Qdrant vector store in sync with document changes.

## Overview

`jeeves-watcher` monitors a configured set of directories for file changes, extracts text content, generates embeddings, and maintains a synchronized Qdrant vector store for semantic search. It automatically:

- **Watches** directories for file additions, modifications, and deletions

- **Extracts** text from various formats (Markdown, PDF, DOCX, HTML, JSON, plain text)

- **Chunks** large documents for optimal embedding

- **Embeds** content using configurable providers (Google Gemini, mock for testing)

- **Syncs** to Qdrant for fast semantic search

- **Enriches** metadata via rules and API endpoints

### Architecture

![System Architecture](packages/service/assets/system-architecture.png)

For detailed architecture documentation, see [packages/service/guides/architecture.md](packages/service/guides/architecture.md).

## Quick Start

### Installation

```bash

npm install -g @karmaniverous/jeeves-watcher

```

### Initialize Configuration

Create a new configuration file in your project:

```bash

jeeves-watcher init

```

This generates a `jeeves-watcher.config.json` file with sensible defaults.

### Configure

Edit `jeeves-watcher.config.json` to specify:

- **Watch paths**: Directories to monitor

- **Embedding provider**: Google Gemini or mock (for testing)

- **Qdrant connection**: URL and collection name

- **Inference rules**: Automatic metadata enrichment based on file patterns

Example minimal configuration:

```json

{

  "watch": {

    "paths": ["./docs"],

    "ignored": ["**/node_modules/**", "**/.git/**"]

  },

  "embedding": {

    "provider": "gemini",

    "model": "gemini-embedding-001",

    "apiKey": "${GOOGLE_API_KEY}"

  },

  "vectorStore": {

    "url": "http://localhost:6333",

    "collectionName": "my_docs"

  }

}

```

### Start Watching

```bash

jeeves-watcher start

```

The watcher will:

1. Index all existing files in watched directories

2. Monitor for changes

3. Update Qdrant automatically

## CLI Commands

| Command | Description |

| --- | --- |

| `jeeves-watcher start` | Start the filesystem watcher (foreground) |

| `jeeves-watcher init` | Initialize a new configuration file |

| `jeeves-watcher status` | Show watcher status |

| `jeeves-watcher reindex` | Reindex all watched files |

| `jeeves-watcher rebuild-metadata` | Rebuild metadata files from Qdrant payloads |

| `jeeves-watcher search ` | Search the vector store |

| `jeeves-watcher enrich ` | Enrich document metadata with key-value pairs |

| `jeeves-watcher validate` | Validate the configuration |

| `jeeves-watcher service` | Manage the watcher as a system service |

| `jeeves-watcher scan` | Scan the vector store with filter-only queries |

| `jeeves-watcher config` | Query effective config via JSONPath |

| `jeeves-watcher issues` | Show indexing issues and errors |

| `jeeves-watcher helpers` | Show loaded map and template helpers |

| `jeeves-watcher config-apply` | Validate, write, and reload configuration from file |

## Configuration

### Environment Variable Substitution

Config strings support `${VAR_NAME}` syntax for environment variable injection:

```json

{

  "embedding": {

    "apiKey": "${GOOGLE_API_KEY}"

  }

}

```

If `GOOGLE_API_KEY` is set in the environment, the value is substituted at config load time. Set templates in inference rules use Handlebars `{{...}}` syntax (e.g. `{{frontmatter.title}}`), which is distinct from the `${...}` environment variable syntax used in config values like `embedding.apiKey`.

### Watch Paths

```json

{

  "watch": {

    "paths": ["./docs", "./notes"],

    "ignored": ["**/node_modules/**", "**/*.tmp"]

  }

}

```

- **`paths`**: Array of glob patterns or directories to watch

- **`ignored`**: Array of patterns to exclude

- **`respectGitignore`**: (default: `true`) Skip processing files ignored by `.gitignore` in git repositories. Nested `.gitignore` files are respected within their subtree.

- **`moveDetection`**: (optional) Correlate unlink+add events as file moves to avoid re-embedding. `enabled` (default: `true`), `bufferMs` (default: `2000`) — how long to buffer unlink events before treating as deletes.

### Embedding Provider

#### Google Gemini

```json

{

  "embedding": {

    "provider": "gemini",

    "model": "gemini-embedding-001",

    "apiKey": "${GOOGLE_API_KEY}"

  }

}

```

### Vector Store

```json

{

  "vectorStore": {

    "url": "http://localhost:6333",

    "collectionName": "my_collection"

  }

}

```

### Inference Rules

Automatically enrich metadata based on file patterns using declarative JSON Schemas:

```json

{

  "schemas": {

    "base": {

      "type": "object",

      "properties": {

        "domain": {

          "type": "string",

          "description": "Content domain"

        }

      }

    }

  },

  "inferenceRules": [

    {

      "name": "meeting-classifier",

      "description": "Classify files under meetings directory",

      "match": {

        "properties": {

          "file": {

            "type": "object",

            "properties": {

              "path": { "type": "string", "glob": "**/meetings/**" }

            }

          }

        }

      },

      "schema": [

        "base",

        {

          "properties": {

            "domain": { "set": "meetings" },

            "category": { "type": "string", "set": "notes" }

          }

        }

      ]

    }

  ]

}

```

**New in v0.5.0:** Inference rules now use `schema` arrays that reference global named schemas. Type coercion automatically converts string interpolation results to declared types (integer, number, boolean, array, object). See [Inference Rules Guide](packages/service/guides/inference-rules.md) for details.

### Chunking

Chunking settings are configured under `embedding`:

```json

{

  "embedding": {

    "chunkSize": 1000,

    "chunkOverlap": 200

  }

}

```

### Enrichment Store

Enrichment metadata (from `POST /metadata` or `watcher_enrich`) is stored in a SQLite database at `/enrichments.sqlite`. Enrichments survive full reindexes. Composable merge: scalar fields overwrite, array fields union+deduplicate with inference rule output.

```json

{

  "stateDir": ".jeeves-metadata"

}

```

## API Endpoints

The watcher provides a REST API (default port: 1936):

| Endpoint | Method | Description |

| --- | --- | --- |

| `/status` | GET | Health check, uptime, and collection stats |

| `/search` | POST | Semantic search (`{ query: string, limit?: number, filter?: object }`) |

| `/render` | POST | Render a file through inference rules (`{ path: string }`) (v0.8.0+) |

| `/search/facets` | GET | Schema-derived search facet definitions with live values (v0.8.0+) |

| `/metadata` | POST | Update document metadata with schema validation (`{ path: string, metadata: object }`) |

| `/reindex` | POST | Scoped reindex with blast area plan (`issues`, `rules`, `full`, `path`, `prune` + `dryRun`). `path` accepts `string \| string[]`. |

| `/rebuild-metadata` | POST | Rebuild metadata files from Qdrant |

| `/config` | GET | Full resolved effective config; optional `?path=` filter. Rules include `source` attribution. |

| `/config/schema` | GET | JSON Schema of merged virtual document (v0.5.0+) |

| `/walk` | POST | Filesystem walk with glob intersection (`{ globs: string[] }`). Returns `{ paths, matchedCount, scannedRoots }`. |

| `/config/match` | POST | Test paths against inference rules (`{ paths: string[] }`) (v0.5.0+) |

| `/issues` | GET | Current embedding failures and processing errors (v0.5.0+) |

| `/rules/register` | POST | Register virtual inference rules from an external source |

| `/rules/unregister` | DELETE | Remove all virtual rules from a source (`{ source }`) |

| `/rules/unregister/:source` | DELETE | Remove all virtual rules from a named source |

| `/scan` | POST | Filter-only point query with cursor pagination (`{ filter, limit?, cursor?, fields?, countOnly? }`) |

| `/config/validate` | POST | Validate a configuration without applying (`{ config?, testPaths? }`) |

| `/config/apply` | POST | Validate, write, and reload configuration (`{ config }`) |

| `/rules/reapply` | POST | Re-apply inference rules to files matching globs (`{ globs }`) |

| `/points/delete` | POST | Delete points matching a Qdrant filter (`{ filter }`) |

### Example: Search

```bash

curl -X POST http://localhost:1936/search \

  -H "Content-Type: application/json" \

  -d '{"query": "machine learning algorithms", "limit": 5}'

```

### Example: Search With Filter

```bash

curl -X POST http://localhost:1936/search \

  -H "Content-Type: application/json" \

  -d '{

    "query": "error handling",

    "limit": 10,

    "filter": {

      "must": [{ "key": "domain", "match": { "value": "backend" } }]

    }

  }'

```

### Example: Update Metadata

```bash

curl -X POST http://localhost:1936/metadata \

  -H "Content-Type: application/json" \

  -d '{

    "path": "/path/to/document.md",

    "metadata": {

      "priority": "high",

      "category": "research"

    }

  }'

```

## OpenClaw Plugin

This repo includes an OpenClaw plugin (`packages/openclaw`) that exposes the jeeves-watcher API as native agent tools:

| Tool                   | Description                                    |

| ---------------------- | ---------------------------------------------- |

| `watcher_status`       | Service health, uptime, and collection stats   |

| `watcher_search`       | Semantic search across indexed documents       |

| `watcher_enrich`       | Set or update document metadata                |

| `watcher_config`       | Query the effective runtime config via JSONPath |

| `watcher_walk`         | Walk watched filesystem paths with glob intersection |

| `watcher_validate`     | Validate a watcher configuration               |

| `watcher_config_apply` | Apply a new configuration                      |

| `watcher_reindex`      | Trigger a scoped reindex with blast area plan   |

| `watcher_scan`         | Filter-only point query with cursor pagination |

| `watcher_issues`       | List indexing issues and errors                |

The plugin integrates with [`@karmaniverous/jeeves`](https://www.npmjs.com/package/@karmaniverous/jeeves) core to manage workspace content (TOOLS.md, SOUL.md, AGENTS.md) via a `ComponentWriter` that refreshes every 71 seconds. See the [OpenClaw Integration Guide](packages/openclaw/guides/openclaw-integration.md) for details.

Plugin configuration supports `apiUrl` (defaults to `http://127.0.0.1:1936`) and `configRoot` (defaults to `j:/config`).

## Supported File Formats

- **Markdown** (`.md`, `.markdown`) — with YAML frontmatter support

- **PDF** (`.pdf`) — text extraction

- **DOCX** (`.docx`) — Microsoft Word documents

- **HTML** (`.html`, `.htm`) — content extraction (scripts/styles removed)

- **JSON** (`.json`) — with smart text field detection

- **Plain Text** (`.txt`, `.text`)

## License

BSD-3-Clause

---

Built for you with ❤️ on Bali by [Jason Williscroft](https://github.com/karmaniverous) & [Jeeves](https://github.com/jgs-jeeves).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/karmaniverous/jeeves-watcher

Awesome Lists containing this project

README