https://github.com/matantsach/snapeval
Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free inference.
https://github.com/matantsach/snapeval
agentskills ai-skills ccpi-plugin copilot-cli developer-tools evaluation semantic-testing snapshot-testing
Last synced: about 2 months ago
JSON representation
Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free inference.
- Host: GitHub
- URL: https://github.com/matantsach/snapeval
- Owner: matantsach
- License: mit
- Created: 2026-03-14T22:39:52.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-03-22T10:30:22.000Z (about 2 months ago)
- Last Synced: 2026-03-22T10:56:55.336Z (about 2 months ago)
- Topics: agentskills, ai-skills, ccpi-plugin, copilot-cli, developer-tools, evaluation, semantic-testing, snapshot-testing
- Language: TypeScript
- Size: 528 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
Awesome Lists containing this project
README
# snapeval
Harness-agnostic eval runner for [agentskills.io](https://agentskills.io) skills.
[](https://github.com/matantsach/snapeval/actions/workflows/ci.yml)
[](https://www.npmjs.com/package/snapeval)
[](https://opensource.org/licenses/MIT)
snapeval runs every eval case **with and without** your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.
```
snapeval — greeter
Baseline = without SKILL.md (raw AI response)
────────────────────────────────────────────────────────────
#1 formal greeting for Eleanor
Skill: 100% | Baseline: 33% | 5.2s
#2 casual greeting for Marcus
Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s
#3 pirate greeting for Zoe
Skill: 100% | Baseline: 67% | 2.5s
────────────────────────────────────────────────────────────
Summary:
Skill pass rate: 100.0%
Baseline pass rate: 55.6%
Improvement: +44.4%
```
## How it works
1. You write a `SKILL.md` and an `evals.json` with test cases and assertions
2. snapeval runs each eval **twice** — once with your skill loaded, once without (baseline)
3. Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)
4. A benchmark shows where your skill adds value vs. where the raw AI already handles it
## Quick start
### As a Copilot plugin
```bash
copilot plugin install matantsach/snapeval
```
Then in Copilot CLI, just say `evaluate my skill` — the snapeval skill handles the rest.
### Standalone CLI
```bash
git clone https://github.com/matantsach/snapeval.git
cd snapeval && npm install
npx tsx bin/snapeval.ts eval
```
## Eval format
```
my-skill/
├── SKILL.md
└── evals/
├── evals.json
└── scripts/ ← optional deterministic checks
└── validate.sh
```
**evals.json:**
```json
{
"skill_name": "greeter",
"evals": [
{
"id": 1,
"label": "formal greeting for Eleanor",
"prompt": "Can you give me a formal greeting for Eleanor?",
"expected_output": "Returns the formal greeting addressed to Eleanor.",
"assertions": [
"Output contains the name Eleanor",
"Output uses a formal tone",
"script:validate.sh"
]
}
]
}
```
| Field | Required | Description |
|-------|----------|-------------|
| `id` | yes | Unique numeric identifier |
| `prompt` | yes | The user prompt sent to the harness |
| `expected_output` | yes | Human description of the expected behavior |
| `label` | no | Human-readable name shown in terminal output |
| `slug` | no | Filesystem-safe name for the eval directory |
| `assertions` | no | List of assertions to grade (LLM semantic or `script:` prefixed) |
| `files` | no | Input files to attach to the prompt |
### Assertions
**Semantic** — graded by an LLM. Write specific, verifiable statements:
```
"Output contains a YAML block with an 'id' field for each issue"
"Response declines because the pipeline already has unclaimed issues"
```
**Script** — prefix with `script:`. Scripts live in `evals/scripts/`, receive the output directory as `$1`, and pass on exit code 0:
```
"script:validate-json-structure.sh"
```
## CLI reference
### `eval`
Run evals, grade assertions, compute benchmark.
```bash
npx snapeval eval [skill-dir] [options]
```
| Flag | Description | Default |
|------|-------------|---------|
| `--harness ` | Harness adapter | `copilot-sdk` |
| `--inference ` | Inference adapter for grading | `auto` |
| `--workspace ` | Output directory | `../{skill_name}-workspace` |
| `--runs ` | Harness invocations per eval for statistical averaging | `1` |
| `--concurrency ` | Parallel eval cases (1-10) | `1` |
| `--only ` | Run specific eval IDs (e.g. `--only 1,3,5`) | all |
| `--threshold ` | Minimum pass rate 0-1 for exit code 0 | none |
| `--old-skill ` | Compare against old skill version | none |
| `--feedback` | Write feedback.json template for human review | off |
### Exit codes
| Code | Meaning |
|------|---------|
| 0 | Success |
| 1 | Threshold not met (eval ran but pass rate below `--threshold`) |
| 2 | Config/input error (bad JSON, missing fields, invalid flags) |
| 3 | File not found (missing skill dir, evals.json, or script) |
| 4 | Runtime error (harness failure, grading failure, timeout) |
## Output artifacts
Each run creates an iteration directory:
```
workspace/
└── iteration-1/
├── benchmark.json ← aggregate stats with delta
├── SKILL.md.snapshot ← copy of skill used
└── eval-{slug}/
├── with_skill/
│ ├── outputs/output.txt
│ ├── timing.json
│ ├── grading.json
│ └── transcript.log
└── without_skill/
├── outputs/output.txt
├── timing.json
└── grading.json
```
**benchmark.json** includes metadata: `eval_count`, `eval_ids`, `skill_name`, `runs_per_eval`, `timestamp`.
## CI integration
```yaml
name: Skill Evaluation
on: [pull_request]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 22
- run: npm ci
- run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3
```
Exit code 1 when pass rate falls below threshold — blocks the PR.
## Configuration
Create `snapeval.config.json` in your skill or project root:
```json
{
"harness": "copilot-sdk",
"inference": "auto",
"workspace": "../{skill_name}-workspace",
"runs": 1,
"concurrency": 1
}
```
Resolution order: defaults → project config → skill-dir config → CLI flags.
## Harness adapters
| Adapter | Description | Default |
|---------|-------------|---------|
| `copilot-sdk` | Programmatic via `@github/copilot-sdk` with native skill loading | yes |
| `copilot-cli` | Shells out to `copilot` CLI binary | no |
The SDK harness loads skills natively via `skillDirectories`, captures full transcripts, and extracts real token counts from `assistant.usage` events.
## Inference adapters
| Adapter | Description |
|---------|-------------|
| `auto` | Uses `@github/copilot-sdk` by default, falls back to GitHub Models API |
| `copilot-sdk` | `@github/copilot-sdk` programmatic |
| `github-models` | GitHub Models API (requires `GITHUB_TOKEN`) |
## Contributing
See [CONTRIBUTING.md](CONTRIBUTING.md).
## License
[MIT](LICENSE)