https://github.com/jaytoone/ctx

Trigger-Driven Dynamic Context Loading for Code-Aware LLM Agents
https://github.com/jaytoone/ctx
bm25 claude-code claude-code-plugin context-retrieval hooks llm-tools memory
Last synced: about 2 months ago
JSON representation
Trigger-Driven Dynamic Context Loading for Code-Aware LLM Agents
Host: GitHub
URL: https://github.com/jaytoone/ctx
Owner: jaytoone
License: mit
Created: 2026-03-24T08:39:21.000Z (3 months ago)
Default Branch: master
Last Pushed: 2026-04-27T06:30:18.000Z (about 2 months ago)
Last Synced: 2026-04-27T06:33:11.219Z (about 2 months ago)
Topics: bm25, claude-code, claude-code-plugin, context-retrieval, hooks, llm-tools, memory
Language: Python
Size: 2.21 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          # CTX: Trigger-Driven Dynamic Context Loading for Code-Aware LLM Agents

[![PyPI version](https://img.shields.io/pypi/v/ctx-retriever)](https://pypi.org/project/ctx-retriever/)

[![PyPI downloads](https://img.shields.io/pypi/dm/ctx-retriever)](https://pypi.org/project/ctx-retriever/)

[![Python](https://img.shields.io/badge/python-3.9%2B-blue)](https://pypi.org/project/ctx-retriever/)

[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE)

[![HuggingFace Demo](https://img.shields.io/badge/HuggingFace-Demo-orange)](https://huggingface.co/spaces/jaytoone/ctx-demo)

[![Publish to PyPI](https://github.com/jaytoone/CTX/actions/workflows/publish.yml/badge.svg)](https://github.com/jaytoone/CTX/actions/workflows/publish.yml)

CTX classifies developer queries into four trigger types and routes each to a specialized retrieval pipeline. For dependency-sensitive queries, CTX traverses the codebase import graph to resolve transitive relationships that keyword and embedding methods miss. It achieves **1.9x higher Token-Efficiency Score** than BM25 while using only **5.2% of tokens**, and **outperforms BM25 on held-out external codebases** (Flask, FastAPI, Requests — mean R@5 +0.163).

> **Key insight**: code import graphs encode structural dependency information that text-based RAG cannot capture. CTX achieves Recall@5 = 1.0 on implicit dependency queries vs 0.4 for BM25.

## Install

```bash

pip install ctx-retriever

```

Or from source:

```bash

git clone https://github.com/jaytoone/CTX

cd CTX

pip install -e .

```

## Quick Start

```python

from ctx_retriever.retrieval.adaptive_trigger import AdaptiveTriggerRetriever

# Point at any codebase directory

retriever = AdaptiveTriggerRetriever("/path/to/your/project")

# Retrieve relevant files for any natural-language query

result = retriever.retrieve(

    query_id="my_query",

    query_text="how does authentication work?",

    k=5

)

for filepath in result.retrieved_files:

    print(filepath, result.scores[filepath])

```

## Claude Code Hook (Recommended)

CTX runs as a set of Claude Code hooks that inject relevant past decisions, docs, and code into every prompt. Install is one command:

```bash

pip install ctx-retriever

ctx-install                     # register CTX hooks in ~/.claude/settings.json

```

**That's it.** Restart Claude Code and hooks fire on every prompt.

### What ctx-install does (atomic, backup-first)

1. Verifies the 4 CTX hook files exist at `~/.claude/hooks/` (chat-memory, bm25-memory, memory-keyword-trigger, g2-fallback)

2. Reads `~/.claude/settings.json`, takes a timestamped backup (`settings.json.bak.`)

3. Merges the CTX hook registrations into the existing `hooks` dict **without overwriting your other hooks** (dedupes by command string — safe to re-run)

4. Atomically writes the new settings.json (temp-file-then-rename — never leaves partial state on disk)

5. Smoke-tests by firing `bm25-memory.py` once with a dummy prompt and confirming `last-injection.json` gets written

### Other subcommands

```bash

ctx-install --dry-run           # show what would change, touch nothing

ctx-install status              # verify hook file presence + settings.json registration + last fire

ctx-install --uninstall         # remove CTX hook registrations (hook files left in place)

```

### Manual install (legacy — only needed if `ctx-install` fails)

```bash

# 1. Copy hook files to ~/.claude/hooks/

# 2. Register each in ~/.claude/settings.json under the appropriate event key

```

Example settings block (what ctx-install writes for you):

```json

{

  "hooks": {

    "UserPromptSubmit": [

      { "hooks": [{ "type": "command", "command": "python3 $HOME/.claude/hooks/chat-memory.py" }] },

      { "hooks": [{ "type": "command", "command": "python3 $HOME/.claude/hooks/bm25-memory.py --rich" }] },

      { "hooks": [{ "type": "command", "command": "python3 $HOME/.claude/hooks/memory-keyword-trigger.py" }] }

    ],

    "PostToolUse": [

      { "matcher": "Grep",

        "hooks": [{ "type": "command", "command": "python3 $HOME/.claude/hooks/g2-fallback.py" }] }

    ]

  }

}

```

**What you get in each prompt:**

```

[CTX] Trigger: EXPLICIT_SYMBOL | Query: AuthService | Confidence: 0.70 | Intent: judge from prompt

Code files (3/847 total):

• src/auth/service.py [score=1.000]

• src/auth/middleware.py [score=0.823]

• tests/test_auth.py [score=0.741]

(Use the prompt intent to decide how to treat this context.)

```

## Validate on your own transcripts

Before installing, you can measure what CTX *would* give you on your own Claude Code transcripts — no install, no signup, no upload:

```bash

python3 benchmarks/ctx_validate.py --days 7

```

stdlib-only; reads `~/.claude/projects/*/.jsonl` locally and emits a Wilson-95-CI markdown report:

```

- Text match rate:   26.9% [23.2%, 31.1%] ±4.0pp  (n=201)

- Tool-use match:    11.1% [8.6%, 14.2%]  ±2.8pp

- Union (either):    32.8% [28.7%, 37.1%] ±4.2pp

Per response-type:

  prose:       51.2% ±10.3pp  (n=86)

  tool_heavy:  26.2% ±8.2pp   (n=107)

  mixed:       25.0% ±26.0pp  (n=8)

```

**What this measures** — distinctive terms from each user prompt, substring-matched against the assistant's response text AND tool_use parameters (file_path/command/pattern). On turns where CTX's hooks would surface related context, this rate approximates the *ceiling* of plausible utility. It is NOT a direct CTX measurement — install CTX and compare against live `utility_measured` telemetry for the actual delta. Use it to decide "is this signal worth pursuing?" before committing to install.

Live dashboard (after install):

![CTX Telemetry Dashboard](iter5-full.png)

The dashboard visualizes utility in four stacked views — pooled rate with 95% CI, per-block breakdown (g1/g2_docs/g2_prefetch), by response type (prose/mixed/tool_heavy), and by item age (0-7d / 7-30d / 30d+). The knowledge graph below it lights up decisions in coral when Claude actually used them in the last 7 days; dead-weight decisions (no recent references) appear muted — pruning candidates.

## Hook Performance

CTX adds no LLM calls — latency is purely algorithmic (BM25 + BFS indexing):

| Project | Language | Files | Hook Latency |

|---------|----------|-------|-------------|

| Small project | Python | ~88 | ~40ms |

| Medium project | Python | ~215 | ~165ms |

| Large project | TypeScript | ~651 | ~270ms |

| Very large | any | >2000 | skipped (auto-excluded) |

The hook is skipped for prompts <15 chars, slash commands, `[noctx]` tags, and codebases with <3 files.

**Control tags** you can add to any prompt:

| Tag | Effect |

|-----|--------|

| `[noctx]` | Disable CTX for this prompt |

| `[fix]` | Fix/Replace mode — adds anti-anchoring reminder so Claude doesn't copy the existing (potentially wrong) implementation |

`[fix]` is also auto-triggered when the prompt starts with `fix:`, `bug:`, `refactor:`, or `replace:`.

## Trigger Types

| Trigger | When Used | Mechanism |

|---------|-----------|-----------|

| `EXPLICIT_SYMBOL` | Query names a class/function | Symbol index lookup |

| `SEMANTIC_CONCEPT` | Query describes a concept | BM25 keyword scoring |

| `IMPLICIT_CONTEXT` | Dependency queries ("what uses X") | BFS import graph traversal |

| `TEMPORAL_HISTORY` | Recent changes / history | Session file tracker |

## Results

### Synthetic Benchmark (50 files, 166 queries)

| Strategy | Recall@5 | Token Usage | TES |

|----------|----------|-------------|-----|

| Full Context | 0.075 | 100.0% | 0.019 |

| BM25 | 0.982 | 18.7% | 0.410 |

| Dense TF-IDF | 0.973 | 21.0% | 0.406 |

| GraphRAG-lite | 0.523 | 24.0% | 0.218 |

| LlamaIndex | 0.972 | 20.1% | 0.405 |

| Chroma Dense | 0.829 | 19.3% | 0.346 |

| Hybrid Dense+CTX | 0.725 | 23.6% | 0.303 |

| **CTX (Ours)** | **0.874** | **5.2%** | **0.776** |

**TES** = Recall@5 / ln(1 + files_loaded). Higher = better token efficiency.

### External Codebase Benchmark (Flask, FastAPI, Requests)

CTX outperforms BM25 on all three held-out external codebases in code-to-code structural retrieval:

| Codebase | Files | CTX R@5 | BM25 R@5 | Δ |

|----------|-------|---------|----------|---|

| Flask | 79 | **0.545** | 0.347 | **+0.198** |

| FastAPI | 928 | **0.328** | 0.174 | **+0.154** |

| Requests | 35 | **0.626** | 0.489 | **+0.137** |

| **Mean** | — | **0.500** | 0.337 | **+0.163** |

*Bootstrap 95% CI: external mean [0.441, 0.550]*

### COIR External Benchmark (CodeSearchNet Python)

| Strategy | Recall@1 | Recall@5 | MRR |

|----------|----------|----------|-----|

| Dense Embedding (MiniLM) | 0.960 | 1.000 | 0.978 |

| Hybrid Dense+CTX | 0.930 | 0.950 | 0.940 |

| BM25 | 0.920 | 0.980 | 0.946 |

| CTX Adaptive Trigger | 0.720 | 0.740 | 0.728 |

### Downstream LLM Evaluation

CTX context injected into developer prompts improves LLM task quality across two models:

| Scenario | WITH CTX | WITHOUT CTX | Δ |

|----------|----------|-------------|---|

| G1 (session memory recall) | 1.000 | 0.110 | **+0.890** |

| G2 (CTX-specific knowledge) | 0.688 | 0.000 | **+0.688** |

G1: CTX persistent memory enables perfect cross-session recall (vs 11% without). G2: CTX context eliminates hallucination on CTX-specific API queries.

### Key Findings

- CTX achieves **1.9x higher TES** than BM25 with only 5.2% token usage

- CTX achieves **perfect Recall@5 (1.0)** on IMPLICIT_CONTEXT dependency queries

- CTX **outperforms BM25 on all 3 external codebases** in code-to-code retrieval (mean +0.163 R@5)

- CTX context improves downstream LLM task quality: **G1 +0.890**, **G2 +0.688**

- Trigger classifier achieves **100% accuracy** (all 4 types F1=1.00) on synthetic benchmark

- CTX Adaptive Trigger achieves **R@5=0.740 on COIR** (improved from 0.380 via BM25 hybrid + CamelCase fix)

- Hybrid Dense+CTX achieves R@5=0.950 on COIR — best of both worlds

- No single strategy dominates all dimensions — workload determines optimal choice

## When to Use CTX

**CTX excels when:**

- You need dependency-aware retrieval: `IMPLICIT_CONTEXT` queries (e.g., "what uses AuthService?") achieve perfect Recall@5 (1.0) via BFS import graph traversal

- Working with a **known codebase** with established symbol/import structure — code-to-code retrieval outperforms BM25 on real projects (Flask: +0.198, FastAPI: +0.154, Requests: +0.137)

- Token budget is critical — CTX uses only **5.2% of tokens** vs 18.7% for BM25 (TES: 1.9x higher)

- Queries name **explicit symbols** (class names, function names) — EXPLICIT_SYMBOL trigger routes directly to symbol index

**CTX is not designed for:**

- **Text-to-code semantic search** (COIR-style): finding code from natural-language descriptions. CTX R@5=0.740 vs BM25=0.980 on CodeSearchNet Python — still a gap; for best results use Dense Embedding or Hybrid Dense+CTX instead

- **Large unseen codebases** (>500 files, no prior indexing): heuristic symbol extraction degrades at scale; consider AST-based indexers

- **Natural-language concept queries** without code keywords: SEMANTIC_CONCEPT trigger falls back to BM25, losing CTX's structural advantage

## Running Experiments

```bash

# Synthetic benchmark

python run_experiment.py --dataset-size small --strategy all

# Real codebase

python run_experiment.py --dataset-source real --project-path /path/to/project --strategy all

# COIR external benchmark

python run_coir_eval.py --n-queries 100

# Ablation study

python run_experiment.py --dataset-size small --mode ablation

```

Results are written to `benchmarks/results/`.

## Project Structure

```

CTX/

  src/

    retrieval/            # Retrieval strategies (8 total)

      adaptive_trigger.py # CTX core: trigger-driven retrieval

      hybrid_dense_ctx.py # Hybrid: dense seed + graph expansion

      bm25_retriever.py   # BM25 sparse retrieval

      dense_retriever.py  # TF-IDF dense retrieval

      chroma_retriever.py # ChromaDB + sentence-transformers

      graph_rag.py        # GraphRAG-lite baseline

      llamaindex_retriever.py # LlamaIndex AST-aware chunking

      full_context.py     # Full context baseline

    trigger/              # Trigger classifier (4 types)

    evaluator/            # Benchmark runner, metrics, COIR

    data/                 # Dataset generation, real codebase loader

  hooks/

    ctx_real_loader.py    # Claude Code UserPromptSubmit hook

    ctx_session_tracker.py # PostToolUse session tracker

  benchmarks/

    results/              # Experiment results and reports

  docs/

    claude_code_integration.md  # Claude Code setup guide

    paper/                # Paper draft (markdown + LaTeX)

```

## Telemetry (opt-in, local-only)

CTX can log retrieval quality metrics locally to help you understand how well the context injection is working.

**Opt in:**

```bash

export CTX_TELEMETRY=1          # enable for this shell

# or: touch ~/.claude/ctx-telemetry.enabled   # persist across shells

```

**View your data:**

```bash

ctx-telemetry                   # summary + flywheel health verdict (causal r, upgrade hint)

ctx-telemetry last              # last 10 session turns

ctx-telemetry calibrate         # citation bias + causal r-analysis (v1.5)

ctx-telemetry tune              # compute auto-tune params → ctx-auto-tune.json

ctx-telemetry cluster [-p DIR]  # detect tech stack → project_type_hint in ctx-auto-tune.json

ctx-telemetry consent           # Stage 2 upload consent status

ctx-telemetry upload            # Stage 2 dry-run preview

ctx-telemetry clear             # delete all local telemetry logs

```

Sample `ctx-telemetry` output:

```

CTX Retrieval Telemetry — 42 session-turn records (schema v1.6)

...

Flywheel health [n=42]: causal-r=+0.35 | upgrade=✓ HYBRID | kw=43%

```

**Auto-tune (flywheel):** After `ctx-telemetry tune` runs with ≥15 records, CTX automatically adjusts retrieval parameters based on your usage patterns (e.g., top_k reduction for query types with lower citation rates). The active tuning state is shown in CTX's context header: `> **CTX auto-tune** [n=42, hybrid✓]`.

With ≥10 v1.5 records, `tune` also computes a causal signal: Pearson r between BM25 top retrieval score and citation rate. High r (>0.30) means quality-driven citations — HYBRID upgrade is worthwhile. Low r (<0.10) suggests position bias may be dominant — validate before upgrading. This is stored as `hybrid_upgrade_hint` in `ctx-auto-tune.json`.

**Project cluster detection (Stage 3 prerequisite):** `ctx-telemetry cluster` scans your project's source files, matches term frequencies against tech-stack signature profiles (python_ml, python_backend, nextjs_react, rust_systems, go_backend), and writes `project_type_hint` to `ctx-auto-tune.json`. This is a local-first proxy for the Stage 3 `project_type_id` cluster — enabling cold-start pre-warming without requiring cross-user data. Example output:

```

python_ml            ██████████████████████████████    80.0%  (18 keywords matched)

python_backend       ███████                           19.0%  (13 keywords matched)

Project type: python_ml  (confidence: HIGH)

```

### What is collected (schema v1.6)

All data stays on your machine at `~/.claude/ctx-retrieval-events.jsonl`. Nothing is uploaded.

| Field | Type | Description |

|-------|------|-------------|

| `user_id` | string(16) | SHA256(machine-id + install-month)[:16] — anonymous, changes on reinstall |

| `session_id_hash` | string(16) | SHA256(session_id)[:16] — non-reversible |

| `ts_unix_hour` | int | Unix timestamp truncated to hour |

| `hook_source` | enum | G1 / G2_DOCS / G2_CODE / CM |

| `query_type` | enum | KEYWORD / SEMANTIC / TEMPORAL |

| `retrieval_method` | enum | HYBRID / BM25 / UNKNOWN |

| `candidates_returned` | int | Number of candidates before ranking |

| `total_injected` | int | Items injected into context |

| `total_cited` | int | Items referenced by the AI response |

| `utility_rate` | float | cited / injected — retrieval precision proxy |

| `session_turn_index` | int | Turn index within the current session |

| `vec_daemon_up` | bool | Whether semantic layer was active |

| `bge_daemon_up` | bool | Whether cross-encoder reranker was active |

| `duration_ms` | int | Per-block retrieval latency |

| `top_score_bm25` | float\|null | Max BM25 score — causal calibration signal (v1.5) |

| `top_score_dense` | float\|null | Max cosine similarity score (v1.5) |

### What is NOT collected

- ❌ No query text, response text, or code content

- ❌ No file names, commit messages, or project paths

- ❌ No email, device name, or personally identifiable information

- ❌ No network requests — Stage 1 is local-only

### Privacy design

- `user_id` = SHA256(machine-id + month-boundary) — not linkable to email or name; changes on reinstall

- Timestamps truncated to **hour** (not minute)

- All content stripped — only counts, rates, method names, and latency

- Follows [Sourcegraph's numeric-only telemetry](https://sourcegraph.com/docs/admin/telemetry) pattern

**Stage 2 (not yet implemented):** opt-in upload of k-anonymized `session_aggregate` rows via `ctx-telemetry consent`. Rows with fewer than 5 users per (date × project_type) window are suppressed before any upload.

## Paper

- Paper draft: [`docs/paper/CTX_paper_draft.md`](docs/paper/CTX_paper_draft.md)

- arXiv: TBD

- EMNLP 2026 submission: TBD

## License

MIT
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/jaytoone/ctx

Awesome Lists containing this project

README