An open API service indexing awesome lists of open source software.

https://github.com/moonrunnerkc/ruleprobe

Verify whether AI coding agents follow the instruction files they're given
https://github.com/moonrunnerkc/ruleprobe

agents-md ai ast benchmark claude-md cli coding-agent cursorrules developer-tools instruction-adherence typescript verification

Last synced: 2 months ago
JSON representation

Verify whether AI coding agents follow the instruction files they're given

Awesome Lists containing this project

README

          


RuleProbe



Verify whether AI coding agents actually follow the instruction files they're given.



npm version
build status
license
TypeScript
Node.js >= 18
GitHub stars

## Why

Every AI coding agent reads an instruction file. None of them prove they followed it.

You write `CLAUDE.md` or `AGENTS.md` with specific rules: camelCase variables, no `any` types, named exports only, test files for every source file. The agent says "Done." But did it actually follow them? Your code review catches some violations, misses others, and doesn't scale.

RuleProbe reads the same instruction file, extracts the machine-verifiable rules, and checks agent output against each one. Binary pass/fail, with file paths and line numbers as evidence. No LLM evaluation, no judgment calls. Deterministic and reproducible.

## Quick Start

```bash
npm install -g ruleprobe
```

Or run it directly:

```bash
npx ruleprobe --help
```

> **Note:** The examples below reflect the current development HEAD (53 matchers, 9 categories). The published npm v0.1.0 shipped with 15 matchers. A new release will follow.

**Parse an instruction file** to see what rules RuleProbe can extract. This is real output from parsing the repo's included example instruction file:

```bash
ruleprobe parse docs/example-instructions.md
```

```
Extracted 32 rules:

forbidden-no-any-type-2
Category: forbidden-pattern
Verifier: ast
Pattern: no-any (*.ts)
Source: "- No any types anywhere in the codebase"

error-no-empty-catch-6
Category: error-handling
Verifier: ast
Pattern: no-empty-catch (*.ts)
Source: "- No empty catch blocks; always handle or rethrow errors"

naming-kebab-case-files-17
Category: naming
Verifier: filesystem
Pattern: kebab-case (filenames)
Source: "- File names: kebab-case (e.g., user-service.ts, api-handler.ts)"

dependency-pinned-versions-34
Category: dependency
Verifier: filesystem
Pattern: pinned-dependencies (package.json)
Source: "- All dependencies pinned to exact versions, no ^ or ~ ranges"
...
```

**Verify agent output** against those rules. This is ruleprobe verifying its own source code:

```bash
ruleprobe verify docs/example-instructions.md ./src --format text
```

```
RuleProbe Adherence Report
Agent: unknown | Model: unknown | Task: manual

Rules: 32 total | 23 passed | 9 failed | Score: 72%

FAIL error-handling/error-no-empty-catch-6
commands/run.ts:148 - found: empty catch block
utils/safe-path.ts:116 - found: empty catch block
verifier/ast-verifier.ts:248 - found: empty catch block
PASS forbidden-pattern/forbidden-no-any-type-2
PASS structure/structure-strict-mode-1
PASS structure/structure-named-exports-only-3
PASS naming/naming-kebab-case-files-17
FAIL naming/naming-camelcase-variables-18
verifier/treesitter-loader.ts:75 - found: ParserCtor
verifier/treesitter-loader.ts:76 - found: LanguageRef
PASS naming/naming-pascalcase-types-20
PASS test-requirement/test-files-exist-25
FAIL structure/structure-no-barrel-files-24
ast-checks/index.ts:5 - found: barrel file with 24 re-exports
llm/index.ts:7 - found: barrel file with 9 re-exports
PASS import-pattern/import-no-path-aliases-28
PASS forbidden-pattern/forbidden-no-console-log-4
PASS structure/structure-max-file-length-22
PASS structure/structure-jsdoc-required-21
PASS dependency/dependency-pinned-versions-34
...

By Category:
naming: 2/4 (50%)
forbidden-pattern: 4/4 (100%)
structure: 4/5 (80%)
import-pattern: 4/4 (100%)
test-requirement: 2/2 (100%)
error-handling: 1/2 (50%)
type-safety: 2/4 (50%)
code-style: 2/5 (40%)
dependency: 2/2 (100%)
```

Every failure includes the file, line number, and what was found. No ambiguity.

## What It Does

**Parse.** Reads 6 instruction file formats (CLAUDE.md, AGENTS.md, .cursorrules, copilot-instructions.md, GEMINI.md, .windsurfrules) and extracts rules that can be checked mechanically. Subjective instructions like "write clean code" are reported as unparseable so you know what was skipped.

**Verify.** Runs each extracted rule against a directory of agent-generated code. Checks use AST parsing via ts-morph, file system inspection, and regex pattern matching. No LLM evaluation at any stage by default; results are deterministic and identical across runs.

**LLM Extract (opt-in).** Pass `--llm-extract` to send unparseable lines through an OpenAI-compatible API for a second extraction pass. LLM-extracted rules are labeled with `extractionMethod: 'llm'` and `confidence: 'medium'`, and default to warning severity. Requires `OPENAI_API_KEY` env var. No LLM dependency is installed by default.

**Compare.** Point RuleProbe at outputs from two or more agents and get a side-by-side comparison table showing which rules each one followed. Useful for evaluating agents on the same task, or tracking adherence over time.

**GitHub Action.** Ships as a composite action you can drop into any repo. Runs `ruleprobe verify` on every PR, posts results as a comment, and optionally outputs reviewdog rdjson format for inline annotations. No API keys needed beyond `GITHUB_TOKEN`.

## Configuration

RuleProbe auto-discovers a config file in the working directory (or any parent). You can also pass `--config ` explicitly. Supported file names, in priority order:

- `ruleprobe.config.ts`
- `ruleprobe.config.js`
- `ruleprobe.config.json`
- `.ruleproberc.json`

A config file lets you add custom rules, override extracted rules, or exclude rules entirely:

```typescript
// ruleprobe.config.ts
import { defineConfig } from 'ruleprobe';

export default defineConfig({
// Add rules that the parser can't extract from your instruction file
rules: [
{
id: 'custom-no-lodash',
category: 'import-pattern',
description: 'Ban lodash imports',
verifier: 'regex',
pattern: { type: 'banned-import', target: '*.ts', expected: 'lodash', scope: 'file' },
},
],

// Change severity or expected values on extracted rules
overrides: [
{ ruleId: 'naming-camelcase', severity: 'warning' },
{ ruleId: 'structure-max-file-length', expected: '500' },
],

// Remove rules you don't want checked
exclude: ['forbidden-no-console-log'],
});
```

`defineConfig()` is a no-op passthrough that provides type checking in TypeScript configs. JSON configs work without it.

Custom rules use the same verifier types (`ast`, `regex`, `filesystem`) and pattern types as extracted rules. Any pattern type listed in the Supported Rule Types table works as a custom rule pattern.

## CLI Reference

### `ruleprobe parse `

Extract rules from an instruction file.

```bash
ruleprobe parse CLAUDE.md --format json
ruleprobe parse AGENTS.md --show-unparseable
ruleprobe parse AGENTS.md --llm-extract --show-unparseable
```

`--format json|text` controls output format. `--show-unparseable` includes lines that couldn't be converted to rules. `--llm-extract` sends unparseable lines to an OpenAI-compatible API for additional extraction (requires `OPENAI_API_KEY`).

### `ruleprobe verify `

Check agent output against extracted rules.

```bash
ruleprobe verify CLAUDE.md ./output --format text
ruleprobe verify AGENTS.md ./output --agent claude --model opus-4 --format json --output report.json
ruleprobe verify AGENTS.md ./output --format markdown --severity error
ruleprobe verify AGENTS.md ./output --format rdjson
ruleprobe verify AGENTS.md ./output --config ruleprobe.config.ts
ruleprobe verify AGENTS.md ./output --llm-extract
ruleprobe verify AGENTS.md ./output --rubric-decompose
ruleprobe verify AGENTS.md ./output --project tsconfig.json
```

`--agent` and `--model` tag the report metadata. `--severity error|warning|all` filters results. `--output` writes to a file instead of stdout. `--format rdjson` produces reviewdog-compatible diagnostics. `--config` loads a specific config file (otherwise auto-discovered). `--llm-extract` runs unparseable lines through an LLM for additional rule extraction. `--rubric-decompose` uses an LLM to break subjective instructions into weighted concrete checks (tagged with `extractionMethod: 'rubric'` and `confidence: 'low'`). Both `--llm-extract` and `--rubric-decompose` require `OPENAI_API_KEY`. `--project` enables type-aware AST checks (implicit any, unused exports, unresolved imports) using the specified tsconfig.json.

Exit codes: `0` all rules passed, `1` violations found, `2` execution error.

### `ruleprobe compare `

Compare multiple agent outputs against the same rules.

```bash
ruleprobe compare AGENTS.md ./claude-output ./copilot-output --agents claude,copilot --format markdown
```

### `ruleprobe tasks` / `ruleprobe task `

List available task templates or output a specific task prompt. Three templates ship with v0.1.0: `rest-endpoint`, `utility-module`, `react-component`.

```bash
ruleprobe tasks
ruleprobe task rest-endpoint
```

### `ruleprobe run `

Invoke an AI agent on a task template, verify the output, and print the report in one step. Requires `@anthropic-ai/claude-agent-sdk` and `ANTHROPIC_API_KEY` for SDK mode. Alternatively, use `--watch` to point at a directory where you (or another agent) will write output manually.

```bash
# SDK mode: invoke Claude, verify, report
ruleprobe run CLAUDE.md --task rest-endpoint --agent claude-code --model sonnet --format text

# Watch mode: wait for output in a directory, then verify
ruleprobe run CLAUDE.md --watch ./agent-output --timeout 300 --format json
```

Options: `--task`, `--agent`, `--model`, `--format`, `--output-dir`, `--watch`, `--timeout`, `--allow-symlinks`, `--config`.

## GitHub Action

Drop this into `.github/workflows/ruleprobe.yml`:

```yaml
name: RuleProbe
on: [pull_request]
jobs:
check-rules:
runs-on: ubuntu-latest
permissions:
contents: read
pull-requests: write
steps:
- uses: actions/checkout@v4
- uses: moonrunnerkc/ruleprobe@v1
with:
instruction-file: AGENTS.md
output-dir: src
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
```

That's it. No API keys, no LLM calls, deterministic results, runs in seconds.

> **Note:** `@v1` tracks the latest v1.x release. Pin to a specific tag (e.g., `@v1.0.0`) for reproducible builds.

Full options

```yaml
- uses: moonrunnerkc/ruleprobe@v1
with:
instruction-file: AGENTS.md
output-dir: src
agent: ci
model: unknown
format: text
severity: all
fail-on-violation: "true"
post-comment: "true"
reviewdog-format: "false"
```

| Input | Default | Description |
|-------|---------|-------------|
| `instruction-file` | (required) | Path to instruction file |
| `output-dir` | `src` | Directory containing code to verify |
| `agent` | `ci` | Agent identifier for report metadata |
| `model` | `unknown` | Model identifier for report metadata |
| `format` | `text` | Report format: text, json, or markdown |
| `severity` | `all` | Filter: error, warning, or all |
| `fail-on-violation` | `true` | Fail the check on any violation |
| `post-comment` | `true` | Post results as a PR comment |
| `reviewdog-format` | `false` | Also output rdjson for reviewdog |

Outputs: `score`, `passed`, `failed`, `total` (available to downstream steps).

## Programmatic API

Five functions cover the full pipeline:

| Function | Purpose |
|----------|---------|
| `parseInstructionFile(path)` | Parse an instruction file into a `RuleSet` |
| `verifyOutput(ruleSet, dir)` | Run rules against a code directory |
| `generateReport(run, ruleSet, results)` | Build an `AdherenceReport` with summary stats |
| `formatReport(report, format)` | Render as text, JSON, markdown, or rdjson |
| `extractRules(markdown, fileType)` | Extract rules from raw markdown content |
| `defineConfig(config)` | Type-safe config helper for ruleprobe.config.ts |
| `loadConfig(path?, searchDir?)` | Load and validate a config file |
| `applyConfig(ruleSet, config)` | Merge custom rules, overrides, and exclusions into a RuleSet |
| `extractWithLlm(ruleSet, options)` | Run LLM extraction on unparseable lines |
| `createOpenAiProvider(config?)` | Create an OpenAI-compatible LLM provider |

```typescript
import { parseInstructionFile, verifyOutput, generateReport, formatReport } from 'ruleprobe';

const ruleSet = parseInstructionFile('CLAUDE.md');
const results = verifyOutput(ruleSet, './agent-output');
const report = generateReport(
{ agent: 'claude-code', model: 'opus-4', taskTemplateId: 'rest-endpoint',
outputDir: './agent-output', timestamp: new Date().toISOString(), durationSeconds: null },
ruleSet,
results,
);
console.log(formatReport(report, 'text'));
```

**LLM-assisted extraction** (opt-in):

```typescript
import { parseInstructionFile, extractWithLlm, createOpenAiProvider } from 'ruleprobe';

const ruleSet = parseInstructionFile('CLAUDE.md');
const provider = createOpenAiProvider({ model: 'gpt-4o-mini' });
const enhanced = await extractWithLlm(ruleSet, { provider });
// enhanced.rules now includes LLM-extracted rules with extractionMethod: 'llm'
```

## How It Works

```mermaid
flowchart LR
A[Instruction File] --> B[Rule Parser]
B --> C[RuleSet]
D[Agent Output] --> E[Verifier]
C --> E
E --> F[Adherence Report]
```

The parser reads your instruction file and identifies lines that map to deterministic checks (naming conventions, forbidden patterns, structural requirements). Each rule gets a category, a verifier type, and a pattern. The verifier walks the agent's output directory, runs AST checks via ts-morph for code structure rules, file system checks for naming and test file requirements, and regex checks for line length and content patterns. The report collects pass/fail results with evidence for every rule.

## Supported Rule Types

53 built-in matchers across 9 categories:

| Category | Count | Verifier(s) |
|----------|------:|-------------|
| naming | 7 | AST, Filesystem, Tree-sitter |
| forbidden-pattern | 5 | AST, Regex |
| structure | 9 | AST, Filesystem |
| test-requirement | 5 | AST, Filesystem, Regex |
| import-pattern | 6 | AST, Regex |
| error-handling | 2 | AST |
| type-safety | 5 | AST, Regex |
| code-style | 10 | AST, Regex, Tree-sitter |
| dependency | 1 | Filesystem |

Full table with example instructions and check details: [docs/matchers.md](docs/matchers.md)

## Authentication

Most of RuleProbe works offline with no API keys. Two opt-in features use external APIs:

| Feature | Flag(s) | Required env var | When you need it |
|---------|---------|-----------------|------------------|
| LLM rule extraction | `--llm-extract` | `OPENAI_API_KEY` | Extracting rules from unparseable instruction lines |
| Rubric decomposition | `--rubric-decompose` | `OPENAI_API_KEY` | Breaking subjective rules into concrete checks |
| Agent invocation (SDK mode) | `ruleprobe run --agent claude-code` | `ANTHROPIC_API_KEY` | Invoking Claude to generate code, then verifying |
| GitHub Action | `uses: moonrunnerkc/ruleprobe@v1` | `GITHUB_TOKEN` | CI, PR comments |

`parse`, `verify`, `compare`, `tasks`, and `task` work entirely offline. No key needed.

## Tree-sitter Support

Python and Go get naming and function-length checks via tree-sitter WASM grammars. The grammar packages (`tree-sitter-python`, `tree-sitter-go`, `web-tree-sitter`) ship as regular dependencies; no extra install step is required. WASM binaries are loaded at runtime from the installed packages. If loading fails (unsupported platform, missing native build), tree-sitter checks are skipped and other verifiers still run.

## Security

RuleProbe never executes scanned code, never makes network calls (unless you opt in with `--llm-extract`, `--rubric-decompose`, or `ruleprobe run`), and never modifies files in the scanned directory. User-supplied paths are resolved and bounded to the working directory; symlinks outside the project are skipped unless you pass `--allow-symlinks`. All dependencies are pinned to exact versions. See [SECURITY.md](SECURITY.md) for the full model.

## Limitations

What v0.1.0 doesn't do, stated plainly.

- **TypeScript gets the deepest coverage.** ts-morph gives full AST analysis for TypeScript and JavaScript: naming, forbidden patterns, structure, imports, type-safety, and code-style checks. Python and Go get naming and function-length checks via tree-sitter WASM grammars (grammar packages ship as regular dependencies; see the Tree-sitter Support section). Everything else falls back to regex (line length, comments, semicolons). No Rust, Java, or C# AST support yet.
- **Subjective rules stay subjective.** "Write clean code" has no deterministic check. The `--rubric-decompose` flag on the `verify` command uses an LLM to break subjective instructions into weighted concrete checks (max function length, no magic numbers, etc.), tagged with `extractionMethod: 'rubric'` and `confidence: 'low'`. This is a proxy, not a direct evaluation. Lines with no measurable proxy stay in the unparseable array. Requires `OPENAI_API_KEY`.
- **Agent invocation covers Claude SDK and watch mode only.** The `run` command invokes agents via the Claude Agent SDK (requires `ANTHROPIC_API_KEY`) or watches a directory for output. Copilot, Cursor, and other agent SDKs are not integrated; use `--watch` mode for those.
- **Type-aware checks require --project.** Three checks (implicit any, unused exports, unresolved imports) need the TypeChecker, which requires a `tsconfig.json`. Without `--project`, ts-morph parses files in isolation and these checks are skipped.
- **53 matchers, not infinite.** The parser skips lines it can't confidently map to a check. Use `--show-unparseable` to see what was missed, and `--llm-extract` or `--rubric-decompose` to handle the remainder.

## Case Study

See [docs/case-study-v0.1.0.md](docs/case-study-v0.1.0.md) for a comparison of two agents on the rest-endpoint task template against 10 rules.

## Contributing

```bash
git clone https://github.com/moonrunnerkc/ruleprobe.git
cd ruleprobe && npm install
npm test
```

Issues and pull requests welcome at [github.com/moonrunnerkc/ruleprobe](https://github.com/moonrunnerkc/ruleprobe).

## License

[MIT](LICENSE)