{"id":47264464,"url":"https://github.com/matantsach/snapeval","last_synced_at":"2026-04-01T22:23:25.376Z","repository":{"id":344482794,"uuid":"1181994860","full_name":"matantsach/snapeval","owner":"matantsach","description":"Semantic snapshot testing for AI skills. Zero assertions. AI-driven. Free inference.","archived":false,"fork":false,"pushed_at":"2026-03-22T10:30:22.000Z","size":541,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-22T10:56:55.336Z","etag":null,"topics":["agentskills","ai-skills","ccpi-plugin","copilot-cli","developer-tools","evaluation","semantic-testing","snapshot-testing"],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/matantsach.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-14T22:39:52.000Z","updated_at":"2026-03-22T10:30:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/matantsach/snapeval","commit_stats":null,"previous_names":["matantsach/snapeval"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/matantsach/snapeval","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matantsach%2Fsnapeval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matantsach%2Fsnapeval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matantsach%2Fsnapeval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matantsach%2Fsnapeval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/matantsach","download_url":"https://codeload.github.com/matantsach/snapeval/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/matantsach%2Fsnapeval/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31292639,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-01T21:15:39.731Z","status":"ssl_error","status_checked_at":"2026-04-01T21:15:34.046Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentskills","ai-skills","ccpi-plugin","copilot-cli","developer-tools","evaluation","semantic-testing","snapshot-testing"],"created_at":"2026-03-15T03:01:24.229Z","updated_at":"2026-04-01T22:23:25.366Z","avatar_url":"https://github.com/matantsach.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# snapeval\n\nHarness-agnostic eval runner for [agentskills.io](https://agentskills.io) skills.\n\n[![CI](https://github.com/matantsach/snapeval/actions/workflows/ci.yml/badge.svg)](https://github.com/matantsach/snapeval/actions/workflows/ci.yml)\n[![npm version](https://img.shields.io/npm/v/snapeval.svg)](https://www.npmjs.com/package/snapeval)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\nsnapeval runs every eval case **with and without** your skill, grades assertions, and computes a benchmark delta — so you can see exactly what value your skill adds.\n\n```\nsnapeval — greeter\nBaseline = without SKILL.md (raw AI response)\n────────────────────────────────────────────────────────────\n  #1 formal greeting for Eleanor\n    Skill: 100% | Baseline: 33% | 5.2s\n  #2 casual greeting for Marcus\n    Skill: 100% ↑ was 67% | Baseline: 67% | 2.7s\n  #3 pirate greeting for Zoe\n    Skill: 100% | Baseline: 67% | 2.5s\n────────────────────────────────────────────────────────────\nSummary:\n  Skill pass rate:    100.0%\n  Baseline pass rate: 55.6%\n  Improvement:        +44.4%\n```\n\n## How it works\n\n1. You write a `SKILL.md` and an `evals.json` with test cases and assertions\n2. snapeval runs each eval **twice** — once with your skill loaded, once without (baseline)\n3. Assertions are graded by an LLM judge (semantic) and/or shell scripts (deterministic)\n4. A benchmark shows where your skill adds value vs. where the raw AI already handles it\n\n## Quick start\n\n### As a Copilot plugin\n\n```bash\ncopilot plugin install matantsach/snapeval\n```\n\nThen in Copilot CLI, just say `evaluate my skill` — the snapeval skill handles the rest.\n\n### Standalone CLI\n\n```bash\ngit clone https://github.com/matantsach/snapeval.git\ncd snapeval \u0026\u0026 npm install\nnpx tsx bin/snapeval.ts eval \u003cskill-dir\u003e\n```\n\n## Eval format\n\n```\nmy-skill/\n├── SKILL.md\n└── evals/\n    ├── evals.json\n    └── scripts/         ← optional deterministic checks\n        └── validate.sh\n```\n\n**evals.json:**\n\n```json\n{\n  \"skill_name\": \"greeter\",\n  \"evals\": [\n    {\n      \"id\": 1,\n      \"label\": \"formal greeting for Eleanor\",\n      \"prompt\": \"Can you give me a formal greeting for Eleanor?\",\n      \"expected_output\": \"Returns the formal greeting addressed to Eleanor.\",\n      \"assertions\": [\n        \"Output contains the name Eleanor\",\n        \"Output uses a formal tone\",\n        \"script:validate.sh\"\n      ]\n    }\n  ]\n}\n```\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `id` | yes | Unique numeric identifier |\n| `prompt` | yes | The user prompt sent to the harness |\n| `expected_output` | yes | Human description of the expected behavior |\n| `label` | no | Human-readable name shown in terminal output |\n| `slug` | no | Filesystem-safe name for the eval directory |\n| `assertions` | no | List of assertions to grade (LLM semantic or `script:` prefixed) |\n| `files` | no | Input files to attach to the prompt |\n\n### Assertions\n\n**Semantic** — graded by an LLM. Write specific, verifiable statements:\n\n```\n\"Output contains a YAML block with an 'id' field for each issue\"\n\"Response declines because the pipeline already has unclaimed issues\"\n```\n\n**Script** — prefix with `script:`. Scripts live in `evals/scripts/`, receive the output directory as `$1`, and pass on exit code 0:\n\n```\n\"script:validate-json-structure.sh\"\n```\n\n## CLI reference\n\n### `eval`\n\nRun evals, grade assertions, compute benchmark.\n\n```bash\nnpx snapeval eval [skill-dir] [options]\n```\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--harness \u003cname\u003e` | Harness adapter | `copilot-sdk` |\n| `--inference \u003cname\u003e` | Inference adapter for grading | `auto` |\n| `--workspace \u003cpath\u003e` | Output directory | `../{skill_name}-workspace` |\n| `--runs \u003cn\u003e` | Harness invocations per eval for statistical averaging | `1` |\n| `--concurrency \u003cn\u003e` | Parallel eval cases (1-10) | `1` |\n| `--only \u003cids\u003e` | Run specific eval IDs (e.g. `--only 1,3,5`) | all |\n| `--threshold \u003crate\u003e` | Minimum pass rate 0-1 for exit code 0 | none |\n| `--old-skill \u003cpath\u003e` | Compare against old skill version | none |\n| `--feedback` | Write feedback.json template for human review | off |\n\n### Exit codes\n\n| Code | Meaning |\n|------|---------|\n| 0 | Success |\n| 1 | Threshold not met (eval ran but pass rate below `--threshold`) |\n| 2 | Config/input error (bad JSON, missing fields, invalid flags) |\n| 3 | File not found (missing skill dir, evals.json, or script) |\n| 4 | Runtime error (harness failure, grading failure, timeout) |\n\n## Output artifacts\n\nEach run creates an iteration directory:\n\n```\nworkspace/\n└── iteration-1/\n    ├── benchmark.json       ← aggregate stats with delta\n    ├── SKILL.md.snapshot    ← copy of skill used\n    └── eval-{slug}/\n        ├── with_skill/\n        │   ├── outputs/output.txt\n        │   ├── timing.json\n        │   ├── grading.json\n        │   └── transcript.log\n        └── without_skill/\n            ├── outputs/output.txt\n            ├── timing.json\n            └── grading.json\n```\n\n**benchmark.json** includes metadata: `eval_count`, `eval_ids`, `skill_name`, `runs_per_eval`, `timestamp`.\n\n## CI integration\n\n```yaml\nname: Skill Evaluation\non: [pull_request]\n\njobs:\n  eval:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-node@v4\n        with:\n          node-version: 22\n      - run: npm ci\n      - run: npx snapeval eval skills/my-skill --threshold 0.8 --runs 3\n```\n\nExit code 1 when pass rate falls below threshold — blocks the PR.\n\n## Configuration\n\nCreate `snapeval.config.json` in your skill or project root:\n\n```json\n{\n  \"harness\": \"copilot-sdk\",\n  \"inference\": \"auto\",\n  \"workspace\": \"../{skill_name}-workspace\",\n  \"runs\": 1,\n  \"concurrency\": 1\n}\n```\n\nResolution order: defaults → project config → skill-dir config → CLI flags.\n\n## Harness adapters\n\n| Adapter | Description | Default |\n|---------|-------------|---------|\n| `copilot-sdk` | Programmatic via `@github/copilot-sdk` with native skill loading | yes |\n| `copilot-cli` | Shells out to `copilot` CLI binary | no |\n\nThe SDK harness loads skills natively via `skillDirectories`, captures full transcripts, and extracts real token counts from `assistant.usage` events.\n\n## Inference adapters\n\n| Adapter | Description |\n|---------|-------------|\n| `auto` | Uses `@github/copilot-sdk` by default, falls back to GitHub Models API |\n| `copilot-sdk` | `@github/copilot-sdk` programmatic |\n| `github-models` | GitHub Models API (requires `GITHUB_TOKEN`) |\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatantsach%2Fsnapeval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmatantsach%2Fsnapeval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmatantsach%2Fsnapeval/lists"}