https://github.com/matantsach/snapeval

Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free inference.
https://github.com/matantsach/snapeval

agentskills ai-skills ccpi-plugin copilot-cli developer-tools evaluation semantic-testing snapshot-testing

Last synced: 4 months ago
JSON representation

Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free inference.

Host: GitHub
URL: https://github.com/matantsach/snapeval
Owner: matantsach
License: mit
Created: 2026-03-14T22:39:52.000Z (4 months ago)
Default Branch: main
Last Pushed: 2026-03-22T10:30:22.000Z (4 months ago)
Last Synced: 2026-03-22T10:56:55.336Z (4 months ago)
Topics: agentskills, ai-skills, ccpi-plugin, copilot-cli, developer-tools, evaluation, semantic-testing, snapshot-testing
Language: TypeScript
Size: 528 KB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md

Awesome Lists containing this project

README

          # snapeval

Harness-agnostic eval runner for [agentskills.io](https://agentskills.io) skills.

[![CI](https://github.com/matantsach/snapeval/actions/workflows/ci.yml/badge.svg)](https://github.com/matantsach/snapeval/actions/workflows/ci.yml)

[![npm version](https://img.shields.io/npm/v/snapeval.svg)](https://www.npmjs.com/package/snapeval)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)

snapeval runs every eval case **with and without** your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.

```

snapeval — greeter

Baseline = without SKILL.md (raw AI response)

────────────────────────────────────────────────────────────

  #1 formal greeting for Eleanor

    Skill: 100% | Baseline: 33% | 5.2s

  #2 casual greeting for Marcus

    Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s

  #3 pirate greeting for Zoe

    Skill: 100% | Baseline: 67% | 2.5s

────────────────────────────────────────────────────────────

Summary:

  Skill pass rate:    100.0%

  Baseline pass rate: 55.6%

  Improvement:        +44.4%

```

## How it works

1. You write a `SKILL.md` and an `evals.json` with test cases and assertions

2. snapeval runs each eval **twice** — once with your skill loaded, once without (baseline)

3. Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)

4. A benchmark shows where your skill adds value vs. where the raw AI already handles it

## Quick start

### As a Copilot plugin

```bash

copilot plugin install matantsach/snapeval

```

Then in Copilot CLI, just say `evaluate my skill` — the snapeval skill handles the rest.

### Standalone CLI

```bash

git clone https://github.com/matantsach/snapeval.git

cd snapeval && npm install

npx tsx bin/snapeval.ts eval 

```

## Eval format

```

my-skill/

├── SKILL.md

└── evals/

    ├── evals.json

    └── scripts/         ← optional deterministic checks

        └── validate.sh

```

**evals.json:**

```json

{

  "skill_name": "greeter",

  "evals": [

    {

      "id": 1,

      "label": "formal greeting for Eleanor",

      "prompt": "Can you give me a formal greeting for Eleanor?",

      "expected_output": "Returns the formal greeting addressed to Eleanor.",

      "assertions": [

        "Output contains the name Eleanor",

        "Output uses a formal tone",

        "script:validate.sh"

      ]

    }

  ]

}

```

| Field | Required | Description |

|-------|----------|-------------|

| `id` | yes | Unique numeric identifier |

| `prompt` | yes | The user prompt sent to the harness |

| `expected_output` | yes | Human description of the expected behavior |

| `label` | no | Human-readable name shown in terminal output |

| `slug` | no | Filesystem-safe name for the eval directory |

| `assertions` | no | List of assertions to grade (LLM semantic or `script:` prefixed) |

| `files` | no | Input files to attach to the prompt |

### Assertions

**Semantic** — graded by an LLM. Write specific, verifiable statements:

```

"Output contains a YAML block with an 'id' field for each issue"

"Response declines because the pipeline already has unclaimed issues"

```

**Script** — prefix with `script:`. Scripts live in `evals/scripts/`, receive the output directory as `$1`, and pass on exit code 0:

```

"script:validate-json-structure.sh"

```

## CLI reference

### `eval`

Run evals, grade assertions, compute benchmark.

```bash

npx snapeval eval [skill-dir] [options]

```

| Flag | Description | Default |

|------|-------------|---------|

| `--harness ` | Harness adapter | `copilot-sdk` |

| `--inference ` | Inference adapter for grading | `auto` |

| `--workspace ` | Output directory | `../{skill_name}-workspace` |

| `--runs ` | Harness invocations per eval for statistical averaging | `1` |

| `--concurrency ` | Parallel eval cases (1-10) | `1` |

| `--only ` | Run specific eval IDs (e.g. `--only 1,3,5`) | all |

| `--threshold ` | Minimum pass rate 0-1 for exit code 0 | none |

| `--old-skill ` | Compare against old skill version | none |

| `--feedback` | Write feedback.json template for human review | off |

### Exit codes

| Code | Meaning |

|------|---------|

| 0 | Success |

| 1 | Threshold not met (eval ran but pass rate below `--threshold`) |

| 2 | Config/input error (bad JSON, missing fields, invalid flags) |

| 3 | File not found (missing skill dir, evals.json, or script) |

| 4 | Runtime error (harness failure, grading failure, timeout) |

## Output artifacts

Each run creates an iteration directory:

```

workspace/

└── iteration-1/

    ├── benchmark.json       ← aggregate stats with delta

    ├── SKILL.md.snapshot    ← copy of skill used

    └── eval-{slug}/

        ├── with_skill/

        │   ├── outputs/output.txt

        │   ├── timing.json

        │   ├── grading.json

        │   └── transcript.log

        └── without_skill/

            ├── outputs/output.txt

            ├── timing.json

            └── grading.json

```

**benchmark.json** includes metadata: `eval_count`, `eval_ids`, `skill_name`, `runs_per_eval`, `timestamp`.

## CI integration

```yaml

name: Skill Evaluation

on: [pull_request]

jobs:

  eval:

    runs-on: ubuntu-latest

    steps:

      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4

        with:

          node-version: 22

      - run: npm ci

      - run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3

```

Exit code 1 when pass rate falls below threshold — blocks the PR.

## Configuration

Create `snapeval.config.json` in your skill or project root:

```json

{

  "harness": "copilot-sdk",

  "inference": "auto",

  "workspace": "../{skill_name}-workspace",

  "runs": 1,

  "concurrency": 1

}

```

Resolution order: defaults → project config → skill-dir config → CLI flags.

## Harness adapters

| Adapter | Description | Default |

|---------|-------------|---------|

| `copilot-sdk` | Programmatic via `@github/copilot-sdk` with native skill loading | yes |

| `copilot-cli` | Shells out to `copilot` CLI binary | no |

The SDK harness loads skills natively via `skillDirectories`, captures full transcripts, and extracts real token counts from `assistant.usage` events.

## Inference adapters

| Adapter | Description |

|---------|-------------|

| `auto` | Uses `@github/copilot-sdk` by default, falls back to GitHub Models API |

| `copilot-sdk` | `@github/copilot-sdk` programmatic |

| `github-models` | GitHub Models API (requires `GITHUB_TOKEN`) |

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md).

## License

[MIT](LICENSE)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/matantsach/snapeval

Awesome Lists containing this project

README