An open API service indexing awesome lists of open source software.

https://github.com/intentweave/intentweave

Semantic knowledge extraction for code and docs — CARI local index + LLM knowledge graph
https://github.com/intentweave/intentweave

ast cari code-analysis copilot developer-tools knowledge-graph mcp neo4j sqlite typescript

Last synced: 2 months ago
JSON representation

Semantic knowledge extraction for code and docs — CARI local index + LLM knowledge graph

Awesome Lists containing this project

README

          

# IntentWeave

**Semantic knowledge extraction platform** — build queryable knowledge graphs from documents and code,
with a zero-cost code-aware retrieval index for everyday use.

IntentWeave provides two complementary systems:

1. **CARI (Code-Aware Retrieval Index)** — Builds a lightweight SQLite index from your code's AST,
document keywords, and git history. No LLM calls, no external services, no cost. Produces ranked
file retrieval, cross-layer connection discovery, CI drift detection, and **interactive architecture
visualization** with automatically inferred layers, communities, and dependency analysis.

2. **Knowledge Graph (KG)** — Uses LLMs to extract entities, decisions, and relationships from
natural-language documents. Persists to Neo4j for rich semantic queries, impact analysis, and
documentation health checks.

Both are available through CLI, MCP tools (GitHub Copilot), REST API, and a React UI.

[![License](https://img.shields.io/badge/license-Apache--2.0-blue.svg)](LICENSE)

---

## Quick Start

### CARI — Zero-Cost Index (no LLM, no Neo4j)

```bash
npm install -g @intentweave/cli
cd /path/to/your/project

iw init
iw index build # < 3 seconds for most projects
iw index retrieve "authentication" # ranked file retrieval
iw index connections "AuthService" # cross-layer connection discovery
iw index check --changed src/auth.ts # CI drift detection
iw index report # coverage, staleness, hidden couplings
```

### Architecture Analysis & Visualization

```bash
# Auto-infer architectural layers from your import graph
iw index layers-infer

# Validate imports against inferred layer boundaries
iw index layers-check

# Generate a standalone interactive HTML architecture report
iw index export --html

# With LLM-generated layer and directory names (optional)
iw index export --html --provider openai --model gpt-4o-mini
```

The HTML report renders a **layered, spatial architecture view**:

- Files positioned in their inferred architectural tier (foundation at bottom, entry points at top)
- Node size proportional to transitive dependents — bigger = higher impact
- Colour-coded community clusters via label-propagation detection, with **three switchable modes**:
structural (imports + co-changes), semantic (full co-occurrence), temporal (git co-changes only)
- Import edges with layer violations drawn as red reverse-arrows
- Three views: **Layers** (tiered layout), **Communities** (force-directed), **Dependencies** (root-focused)
- Vertical slice detection — click a community to highlight its cross-layer feature slice
- Hierarchical sub-layering within architectural tiers
- Optional LLM pass names layers ("HTTP Layer", "Data Access") and directories ("CLI Subcommands",
"Pipeline Stages") with architectural descriptions
- Zero server dependency — shareable as a single self-contained HTML file

#### Layers View — Auto-Inferred Architectural Tiers

![Layers View](docs/screenshots/layers-view.png)

Files arranged into automatically inferred layers with LLM-generated names and descriptions.
Node size reflects transitive dependents; colours indicate community clusters.

#### Communities View — Force-Directed Graph

![Communities View](docs/screenshots/communities-view.png)

Force-directed layout revealing community clusters, doc-code links, and import relationships.

#### Dependencies View — Root-Focused Dependency Tree

![Dependencies View](docs/screenshots/dependencies-view.png)

Explore the full dependency tree from any root file, colour-coded by risk level.

Two depth modes:

- `--depth structured` (default) — headings, bold text, code spans only. Fast and precise.
- `--depth full` — adds body text scanning with IDF noise filtering. +72% more annotations.

### Knowledge Graph — LLM Semantic Extraction

### Install from npm

```bash
npm install -g @intentweave/cli
iw --help
```

Or use `npx` without installing:

```bash
npx @intentweave/cli run docs/*.md --track open -i -v
```

### First project setup

```bash
cd /path/to/your/project

# Initialize workspace
iw init

# Start Neo4j (requires Docker)
docker run -d --name neo4j \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/intentweave \
neo4j:5

# Run the extraction pipeline on your docs
export NEO4J_PASSWORD=intentweave
export OPENAI_API_KEY=sk-...
iw run docs/*.md --track open --provider openai -i --persist -v

# Query the knowledge graph
iw query "What are the main components?"
```

> **Full CLI documentation:** [docs/CLI-USAGE.md](docs/CLI-USAGE.md)

### From source (development)

```bash
git clone https://github.com/intentweave/intentweave.git
cd intentweave
pnpm install && pnpm build

# Use the dev wrapper (no build needed for changes)
./iw.sh run docs/*.md --track open -i -v
```

### Start the Server

```bash
cd apps/server
cp .env.example .env # edit NEO4J_PASSWORD and OPENAI_API_KEY

pnpm dev
# → 🧠 IntentWeave server listening on http://0.0.0.0:3000
# → 📖 API docs: http://localhost:3000/docs
# → ❤️ Health: http://localhost:3000/health
```

---

## REST API

All endpoints live under `/api/`. The server runs on port 3000 by default.

### Query the Knowledge Graph

**Natural language** (requires `OPENAI_API_KEY`):

```bash
curl -X POST http://localhost:3000/api/query \
-H 'Content-Type: application/json' \
-H 'x-session-id: my-project' \
-d '{"question": "What decisions were made about the database?"}'
```

```json
{
"results": [
{
"decision": "Neo4j",
"type": "decision",
"predicate": "DECIDED_FOR",
"target": "graph database"
}
],
"cypher": "MATCH (a:Canon)-[r:CANON_REL {predicate: \"DECIDED_FOR\"}]->(b:Canon) WHERE ...",
"summary": "- **Neo4j** was decided for as the graph database\n- ...",
"count": 3
}
```

**Raw Cypher** (no LLM needed):

```bash
curl -X POST http://localhost:3000/api/query \
-H 'Content-Type: application/json' \
-d '{"cypher": "MATCH (n:Canon:Entity) RETURN n.name, n.type LIMIT 10"}'
```

### Build RAG Context

**Topic-based** (requires `OPENAI_API_KEY`):

```bash
curl -X POST http://localhost:3000/api/context \
-H 'Content-Type: application/json' \
-H 'x-session-id: my-project' \
-d '{"topic": "authentication architecture"}'
```

**Entity-seeded** (no LLM needed):

```bash
curl -X POST http://localhost:3000/api/context \
-H 'Content-Type: application/json' \
-H 'x-session-id: my-project' \
-d '{"entity": "React", "hops": 3}'
```

**Dump all** entities:

```bash
curl -X POST http://localhost:3000/api/context \
-H 'x-session-id: my-project' \
-H 'Content-Type: application/json' \
-d '{"all": true}'
```

### List Entities

```bash
# All entities in a session
curl 'http://localhost:3000/api/entities?session=my-project'

# Filter by type
curl 'http://localhost:3000/api/entities?session=my-project&type=decision&limit=20'

# Search by name
curl 'http://localhost:3000/api/entities?session=my-project&search=auth'
```

### Run Extraction Pipeline

```bash
curl -X POST http://localhost:3000/api/run \
-H 'Content-Type: application/json' \
-d '{
"files": ["docs/*.md"],
"track": "open",
"provider": "openai",
"incremental": true,
"persist": true,
"verbose": true
}'
```

Returns 202 with a run summary including `runId`, artifact count, entity/relationship totals, and duration.

### Persist to Neo4j

```bash
# Persist latest run
curl -X POST http://localhost:3000/api/persist \
-H 'Content-Type: application/json' \
-d '{"latest": true}'

# Persist specific run
curl -X POST http://localhost:3000/api/persist \
-H 'Content-Type: application/json' \
-d '{"runId": "run-2026-03-08-abc12345"}'
```

### Impact Analysis

```bash
curl -X POST http://localhost:3000/api/impact \
-H 'Content-Type: application/json' \
-H 'x-session-id: my-project' \
-d '{"files": ["src/auth.ts"], "hops": 2}'
```

### Documentation Health

```bash
curl -X POST http://localhost:3000/api/doc-health \
-H 'Content-Type: application/json' \
-H 'x-session-id: my-project' \
-d '{"files": ["docs/ARCHITECTURE.md"]}'
```

### Graph Schema

```bash
curl http://localhost:3000/api/schema
```

Returns canonical predicates, entity types, and relationship documentation.

---

## CLI

```bash
# Run extraction pipeline
iw run docs/*.md --track open --provider openai -i -v

# Query the knowledge graph (natural language)
iw query "What are the main components?"

# Query with raw Cypher
iw query --cypher "MATCH (n:Canon:Entity) RETURN n.name, n.type LIMIT 20"

# Build RAG context
iw context "authentication architecture" -s my-project

# Entity-seeded context
iw context -e "React" --hops 3 -s my-project

# Impact analysis
iw impact src/auth.ts -s my-project

# Documentation health check (CARI default — no Neo4j needed)
iw doc-health
iw doc-health --neo4j -s my-project # full KG mode

# Cross-layer code linking
iw xlink . --session my-project --persist

# Persist to Neo4j
iw persist --latest -v

# --- CARI (no LLM, no Neo4j) ---

# Build the lightweight index
iw index build
iw index build --depth full # include body text with IDF filtering

# Query the index
iw index retrieve "authentication" # ranked file retrieval
iw index connections "AuthService" # cross-layer connections + gaps
iw index check --changed src/auth.ts # CI drift detection
iw index report # corpus-wide health dashboard

# Incremental update (only changed files)
iw index update
```

Additional CARI queries are available as CLI subcommands, MCP tools, and via the programmatic API:

| CLI Command | MCP Tool | What It Does |
| ------------------------------------------ | -------------------------- | -------------------------------------------------------------- |
| `iw index clones` | `cari_clones` | Exact code clone detection (identical body hash) |
| `iw index structural-clones` | `cari_structural_clones` | Type 2 clones (same control flow, different identifiers) |
| `iw index circular-imports` | `cari_circular_imports` | Detect import cycles (A → B → C → A) |
| `iw index unused-exports` | `cari_unused_exports` | Exported symbols never imported anywhere |
| `iw index hotspot-priority` | `cari_hotspot_priority` | High-churn + low-doc files ranked by documentation urgency |
| `iw index todos` | `cari_todos` | TODO/FIXME/HACK/XXX inventory with file, line, and kind |
| `iw index module-coverage` | `cari_module_coverage` | Documentation coverage % per directory |
| `iw index orphaned-sections` | `cari_orphaned_sections` | Doc sections where all mentions are unresolved |
| `iw index doc-completeness` | `cari_doc_completeness` | Per-doc score: covered vs. total exports from referenced files |
| `iw index cross-group-drift` | `cari_cross_group_drift` | Entity coverage conflicts between doc groups |
| `iw index mentions-of ` | `cari_mentions_of` | Find doc mentions of a code or external entity |
| `iw index annotations-for ` | `cari_annotations_for` | List all annotations for a documentation file |
| `iw index test-coverage` | `cari_test_coverage` | Map test files to source files, find untested exports |
| `iw index hubs` | `cari_hubs` | God-node / hub analysis (degree centrality) |
| `iw index communities` | `cari_communities` | Community detection (structural / semantic / temporal modes) |
| `iw index surprises` | `cari_surprises` | Surprising connection ranking (composite score) |
| `iw index rationale` | `cari_rationale` | WHY/NOTE/IMPORTANT/DESIGN rationale inventory |
| `iw index terminology` | `cari_terminology` | Terminology inconsistency detection |
| `iw index dep-depth` | `cari_dep_depth` | Transitive import depth + fan-in/fan-out risk |
| `iw index boundary-violations` | `cari_boundary_violations` | Cross-package internal import detection |
| `iw index layers-infer` | `cari_layers_infer` | Auto-infer architectural layers from import graph |
| `iw index layers-check` | `cari_layers_check` | Validate imports against layer configuration |
| `iw index export --html` | — | Generate standalone interactive architecture report |
| `iw index export --html --provider openai` | `cari_layers_name` | LLM-generated layer & directory names for the report |

> See [docs/CLI-USAGE.md](docs/CLI-USAGE.md) for the full command reference, workflows, and troubleshooting.

### MCP (GitHub Copilot Integration)

IntentWeave exposes MCP tools for use in VS Code Copilot:

| Tool | Purpose | Key Parameters |
| --------------- | -------------------------------- | ------------------------------- |
| `kg_query` | Natural language or Cypher query | `question`, `cypher?`, `limit?` |
| `kg_context` | Build RAG context from graph | `topic?`, `entity?`, `hops?` |
| `kg_entities` | List/search entities | `type?`, `search?`, `limit?` |
| `kg_impact` | Semantic impact analysis | `files`, `hops?` |
| `kg_doc_health` | Documentation freshness | `files?` |
| `kg_schema` | Graph schema description | _(none)_ |

**CARI tools** (no Neo4j or LLM needed):

| Tool | Purpose | Key Parameters |
| -------------------------- | ------------------------------------------- | -------------------------------- |
| `cari_retrieve` | Ranked file retrieval by topic or symbol | `query`, `scope?`, `limit?` |
| `cari_connections` | Cross-layer connection discovery + gaps | `entity`, `include?`, `limit?` |
| `cari_check` | CI drift detection for changed files | `changed`, `severity?` |
| `cari_clones` | Exact code clone detection | _(none)_ |
| `cari_structural_clones` | Type 2 clone detection | _(none)_ |
| `cari_circular_imports` | Import cycle detection | _(none)_ |
| `cari_unused_exports` | Unused exported symbols | `limit?` |
| `cari_hotspot_priority` | High-churn low-doc file ranking | `limit?` |
| `cari_todos` | TODO/FIXME/HACK/XXX inventory | `kind?`, `limit?` |
| `cari_module_coverage` | Documentation coverage % per directory | _(none)_ |
| `cari_orphaned_sections` | Doc sections with all-ungrounded mentions | _(none)_ |
| `cari_doc_completeness` | Per-doc completeness vs. referenced exports | _(none)_ |
| `cari_cross_group_drift` | Cross-group entity coverage conflicts | _(none)_ |
| `cari_mentions_of` | Entity → doc mentions | `entityId`, `minConfidence?` |
| `cari_annotations_for` | File → all annotations | `filePath`, `minConfidence?` |
| `cari_test_coverage` | Test→source mapping + gaps | `limit?` |
| `cari_hubs` | God-node / hub analysis | `limit?` |
| `cari_communities` | Community detection (3 modes) | `mode?`, `resolution?`, `limit?` |
| `cari_surprises` | Surprising connection ranking | `limit?` |
| `cari_rationale` | WHY/NOTE/IMPORTANT/DESIGN inventory | `kind?`, `limit?` |
| `cari_terminology` | Terminology inconsistency detection | `limit?` |
| `cari_dep_depth` | Transitive import depth analysis | `limit?` |
| `cari_boundary_violations` | Package boundary violation detection | _(none)_ |
| `cari_layers_infer` | Auto-infer architectural layers | _(none)_ |
| `cari_layers_check` | Validate imports against layer config | `allowSkipLayer?` |
| `cari_layers_name` | LLM-generated layer & directory names | `provider`, `model?`, `api_key?` |

Start the MCP server:

```bash
iw mcp --session my-project -v
```

VS Code auto-discovers via `.vscode/mcp.json`:

```json
{
"servers": {
"intentweave-kg": {
"command": "npx",
"args": ["@intentweave/cli", "mcp", "--session", "my-project", "-v"]
}
}
}
```

---

## Architecture

```
apps/
server/ → Runnable server (composes core + open)

packages/
core/ → @intentweave/core — types, predicates, interfaces
analyzer/ → @intentweave/analyzer — pipeline engine (IN→FX→KX→GX)
index/ → @intentweave/index — CARI SQLite index (annotator, IDF, queries)
cli/ → @intentweave/cli — `iw` commands + MCP server
server-core/ → @intentweave/server-core — Fastify + Neo4j + middleware
server-open/ → @intentweave/server-open — open track API routes
profiles/ → @intentweave/profiles — extraction profile packs
ast-extractor/ → @intentweave/ast-extractor — tree-sitter TS/JS extraction
swift-parser/ → @intentweave/swift-parser — tree-sitter Swift extraction
python-parser/ → @intentweave/python-parser — tree-sitter Python extraction
```

### Server Plugin Architecture

The server is built on a layered plugin model:

```
┌──────────────────────────────────────────┐
│ @intentweave/server-core │
│ Fastify 5 + Neo4j pool + context MW │
│ Health + SSE + OpenAPI (Swagger) │
└──────────┬───────────────────────────────┘

┌──────────▼───────────────────────────────┐
│ @intentweave/server-open │
│ POST /api/query — KG query (NL+Cypher)│
│ POST /api/context — RAG context │
│ POST /api/run — pipeline execution │
│ POST /api/persist — Neo4j persistence │
│ POST /api/impact — impact analysis │
│ POST /api/doc-health — doc freshness │
│ GET /api/entities — entity listing │
│ GET /api/schema — graph schema │
│ POST /api/xlink — code linking │
└──────────────────────────────────────────┘
```

---

## Pipeline

### Open Track (IN → FX → KX → GX)

Schema-free knowledge extraction:

1. **IN** — Chunk documents (semantic markdown splitting, ~16k chars/chunk)
2. **FX** — Free extraction (LLM extracts raw triples per chunk, parallel)
3. **KX** — Canonicalization (normalize entities + predicates, batch of 40)
4. **GX** — Global merge (cross-document entity deduplication)

### Features

- **Incremental caching** — SHA-256 content-addressed, skip unchanged files
- **Fast keyword scanning** — parallel file I/O (64 concurrent reads), combined regex pre-filter, single-pass `indexOf` matching, early termination. Scans 3500+ files in seconds, not minutes
- **Batch failure detection** — 3 consecutive failures = abort
- **Network resilience** — two-phase retry, batch cooldown
- **Token/cost estimation** — before committing to LLM calls
- **Delta persistence** — only write changes to Neo4j
- **Profile packs** — domain-specific extraction rules

---

## Configuration

### Environment Variables

| Variable | Default | Description |
| ------------------- | ----------------------- | ------------------------------------- |
| `NEO4J_URI` | `bolt://localhost:7687` | Neo4j bolt URI |
| `NEO4J_USERNAME` | `neo4j` | Neo4j username |
| `NEO4J_PASSWORD` | _(required)_ | Neo4j password |
| `NEO4J_DATABASE` | `neo4j` | Neo4j database name |
| `IW_SESSION` | `default` | Default session ID |
| `IW_WORKSPACE_ROOT` | _(optional)_ | Workspace root (enables run/persist) |
| `OPENAI_API_KEY` | _(optional)_ | OpenAI key (enables NL query + topic) |
| `IW_LLM_MODEL` | `gpt-4o-mini` | LLM model for NL queries |
| `PORT` | `3000` | Server port |
| `HOST` | `0.0.0.0` | Server host |
| `LOG_LEVEL` | `info` | Log level |
| `CORS_ORIGIN` | `*` | CORS origin(s), comma-separated |

---

## Development

```bash
pnpm install # Install all packages
pnpm build # Build all (uses Turbo)
pnpm test # Run all tests (1200+ tests)
pnpm dev # Dev mode with hot reload
pnpm typecheck # Type check all packages
pnpm format # Format with Prettier
pnpm format:check # Verify formatting
```

### Publishing

All `@intentweave/*` packages are publishable to npm:

```bash
# Build everything first
pnpm build

# Publish all packages (pnpm resolves workspace:* → real versions)
pnpm -r publish --access public

# Or publish individual packages
pnpm --filter @intentweave/cli publish --access public
```

### Project Stats

- **11 packages** + 1 app
- **1200+ tests**, all passing
- **TypeScript 5.6**, ESM, strict mode
- **Fastify 5**, Neo4j 5, SQLite (better-sqlite3), Turbo, pnpm workspaces
- **27 CARI query modes** + interactive HTML architecture report with multi-view community modes
- **33 MCP tools** for GitHub Copilot integration

---

## License

Apache-2.0 — see [LICENSE](LICENSE)

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md). All contributions require signing the [CLA](CLA.md).