{"id":47742505,"url":"https://github.com/mgechev/skillgrade","last_synced_at":"2026-04-08T18:30:36.262Z","repository":{"id":340851680,"uuid":"1167891649","full_name":"mgechev/skillgrade","owner":"mgechev","description":"\"Unit tests\" for your agent skills","archived":false,"fork":false,"pushed_at":"2026-03-23T04:50:14.000Z","size":1489,"stargazers_count":299,"open_issues_count":1,"forks_count":20,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-24T00:57:01.416Z","etag":null,"topics":["agent","claude-code","codex","eval","gemini-cli","skill"],"latest_commit_sha":null,"homepage":"https://blog.mgechev.com/2026/03/14/skillgrade/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mgechev.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-26T19:51:26.000Z","updated_at":"2026-03-23T22:46:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mgechev/skillgrade","commit_stats":null,"previous_names":["mgechev/skill-eval","mgechev/skillgrade"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mgechev/skillgrade","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mgechev%2Fskillgrade","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mgechev%2Fskillgrade/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mgechev%2Fskillgrade/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mgechev%2Fskillgrade/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mgechev","download_url":"https://codeload.github.com/mgechev/skillgrade/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mgechev%2Fskillgrade/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31568525,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-08T14:31:17.711Z","status":"ssl_error","status_checked_at":"2026-04-08T14:31:17.202Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","claude-code","codex","eval","gemini-cli","skill"],"created_at":"2026-04-03T00:00:33.051Z","updated_at":"2026-04-08T18:30:36.257Z","avatar_url":"https://github.com/mgechev.png","language":"TypeScript","funding_links":[],"categories":["TypeScript","Testing and QA Skills"],"sub_categories":[],"readme":"# Skillgrade\n\nThe easiest way to evaluate your [Agent Skills](https://agentskills.io/home). Tests that AI agents correctly discover and use your skills.\n\nSee [examples/](examples/) — [superlint](examples/superlint/) (simple) and [angular-modern](examples/angular-modern/) (TypeScript grader).\n\n![Browser Preview](https://raw.githubusercontent.com/mgechev/skillgrade/main/assets/browser-preview.png)\n\n## Quick Start\n\n**Prerequisites**: Node.js 20+, Docker\n\n```bash\nnpm i -g skillgrade\n```\n\n**1. Initialize** — go to your skill directory (must have `SKILL.md`) and scaffold:\n\n```bash\ncd my-skill/\nGEMINI_API_KEY=your-key skillgrade init    # or ANTHROPIC_API_KEY / OPENAI_API_KEY\n# Use --force to overwrite an existing eval.yaml\n```\n\nGenerates `eval.yaml` with AI-powered tasks and graders. Without an API key, creates a well-commented template.\n\n**2. Edit** — customize `eval.yaml` for your skill (see [eval.yaml Reference](#evalyaml-reference)).\n\n**3. Run**:\n\n```bash\nGEMINI_API_KEY=your-key skillgrade --smoke\n```\n\nThe agent is auto-detected from your API key: `GEMINI_API_KEY` → Gemini, `ANTHROPIC_API_KEY` → Claude, `OPENAI_API_KEY` → Codex. Override with `--agent=claude`.\n\n**4. Review**:\n\n```bash\nskillgrade preview          # CLI report\nskillgrade preview browser  # web UI → http://localhost:3847\n```\n\nReports are saved to `$TMPDIR/skillgrade/\u003cskill-name\u003e/results/`. Override with `--output=DIR`.\n\n## Presets\n\n| Flag | Trials | Use Case |\n|------|--------|----------|\n| `--smoke` | 5 | Quick capability check |\n| `--reliable` | 15 | Reliable pass rate estimate |\n| `--regression` | 30 | High-confidence regression detection |\n\n## Options\n\n| Flag | Description |\n|------|-------------|\n| `--eval=NAME[,NAME]` | Run specific evals by name (comma-separated) |\n| `--grader=TYPE` | Run only graders of a type (`deterministic` or `llm_rubric`) |\n| `--trials=N` | Override trial count |\n| `--parallel=N` | Run trials concurrently |\n| `--agent=gemini\\|claude\\|codex` | Override agent (default: auto-detect from API key) |\n| `--provider=docker\\|local` | Override provider |\n| `--output=DIR` | Output directory (default: `$TMPDIR/skillgrade`) |\n| `--validate` | Verify graders using reference solutions |\n| `--ci` | CI mode: exit non-zero if below threshold |\n| `--threshold=0.8` | Pass rate threshold for CI mode |\n| `--preview` | Show CLI results after running |\n\n## eval.yaml Reference\n\n```yaml\nversion: \"1\"\n\n# Optional: explicit path to skill directory (defaults to auto-detecting SKILL.md)\n# skill: path/to/my-skill\n\ndefaults:\n  agent: gemini          # gemini | claude | codex\n  provider: docker       # docker | local\n  trials: 5\n  timeout: 300           # seconds\n  threshold: 0.8         # for --ci mode\n  grader_model: gemini-3-flash-preview  # default LLM grader model\n  docker:\n    base: node:20-slim\n    setup: |             # extra commands run during image build\n      apt-get update \u0026\u0026 apt-get install -y jq\n  environment:           # container resource limits\n    cpus: 2\n    memory_mb: 2048\n\ntasks:\n  - name: fix-linting-errors\n    instruction: |\n      Use the superlint tool to fix coding standard violations in app.js.\n\n    workspace:                           # files copied into the container\n      - src: fixtures/broken-app.js\n        dest: app.js\n      - src: bin/superlint\n        dest: /usr/local/bin/superlint\n        chmod: \"+x\"\n\n    graders:\n      - type: deterministic\n        setup: npm install typescript    # grader-specific deps (optional)\n        run: npx ts-node graders/check.ts\n        weight: 0.7\n      - type: llm_rubric\n        rubric: |\n          Did the agent follow the check → fix → verify workflow?\n        model: gemini-2.0-flash          # optional model override\n        weight: 0.3\n\n    # Per-task overrides (optional)\n    agent: claude\n    trials: 10\n    timeout: 600\n```\n\nString values (`instruction`, `rubric`, `run`) support **file references** — if the value is a valid file path, its contents are read automatically:\n\n```yaml\ninstruction: instructions/fix-linting.md\nrubric: rubrics/workflow-quality.md\n```\n\n## Graders\n\n### Deterministic\n\nRuns a command and parses JSON from stdout:\n\n```yaml\n- type: deterministic\n  run: bash graders/check.sh\n  weight: 0.7\n```\n\nOutput format:\n\n```json\n{\n  \"score\": 0.67,\n  \"details\": \"2/3 checks passed\",\n  \"checks\": [\n    {\"name\": \"file-created\", \"passed\": true, \"message\": \"Output file exists\"},\n    {\"name\": \"content-correct\", \"passed\": false, \"message\": \"Missing expected output\"}\n  ]\n}\n```\n\n`score` (0.0–1.0) and `details` are required. `checks` is optional.\n\n**Bash example:**\n\n```bash\n#!/bin/bash\npassed=0; total=2\nc1_pass=false c1_msg=\"File missing\"\nc2_pass=false c2_msg=\"Content wrong\"\n\nif test -f output.txt; then\n  passed=$((passed + 1)); c1_pass=true; c1_msg=\"File exists\"\nfi\nif grep -q \"expected\" output.txt 2\u003e/dev/null; then\n  passed=$((passed + 1)); c2_pass=true; c2_msg=\"Content correct\"\nfi\n\nscore=$(awk \"BEGIN {printf \\\"%.2f\\\", $passed/$total}\")\necho \"{\\\"score\\\":$score,\\\"details\\\":\\\"$passed/$total passed\\\",\\\"checks\\\":[{\\\"name\\\":\\\"file\\\",\\\"passed\\\":$c1_pass,\\\"message\\\":\\\"$c1_msg\\\"},{\\\"name\\\":\\\"content\\\",\\\"passed\\\":$c2_pass,\\\"message\\\":\\\"$c2_msg\\\"}]}\"\n```\n\n\u003e Use `awk` for arithmetic — `bc` is not available in `node:20-slim`.\n\n### LLM Rubric\n\nEvaluates the agent's session transcript against qualitative criteria:\n\n```yaml\n- type: llm_rubric\n  rubric: |\n    Workflow Compliance (0-0.5):\n    - Did the agent follow the mandatory 3-step workflow?\n\n    Efficiency (0-0.5):\n    - Completed in ≤5 commands?\n  weight: 0.3\n  model: gemini-2.0-flash    # optional, auto-detected from API key\n```\n\nUses Gemini or Anthropic based on available API key. Override with the `model` field.\n\n### Combining Graders\n\n```yaml\ngraders:\n  - type: deterministic\n    run: bash graders/check.sh\n    weight: 0.7      # 70% — did it work?\n  - type: llm_rubric\n    rubric: rubrics/quality.md\n    weight: 0.3      # 30% — was the approach good?\n```\n\nFinal reward = `Σ (grader_score × weight) / Σ weight`\n\n## CI Integration\n\nUse `--provider=local` in CI — the runner is already an ephemeral sandbox, so Docker adds overhead without benefit.\n\n```yaml\n# .github/workflows/skillgrade.yml\n- run: |\n    npm i -g skillgrade\n    cd skills/superlint\n    GEMINI_API_KEY=${{ secrets.GEMINI_API_KEY }} skillgrade --regression --ci --provider=local\n```\n\nExits with code 1 if pass rate falls below `--threshold` (default: 0.8).\n\n\u003e **Tip**: Use `docker` (the default) for local development to protect your machine. In CI, `local` is faster and simpler.\n\n## Environment Variables\n\n| Variable | Used by |\n|----------|---------|\n| `GEMINI_API_KEY` | Agent execution, LLM grading, `skillgrade init` |\n| `ANTHROPIC_API_KEY` | Agent execution, LLM grading, `skillgrade init` |\n| `OPENAI_API_KEY` | Agent execution (Codex), `skillgrade init` |\n\nVariables are also loaded from `.env` in the skill directory. Shell values override `.env`. All values are **redacted** from persisted session logs.\n\n## Best Practices\n\n- **Grade outcomes, not steps.** Check that the file was fixed, not that the agent ran a specific command.\n- **Instructions must name output files.** If the grader checks for `output.html`, the instruction must tell the agent to save as `output.html`.\n- **Validate graders first.** Use `--validate` with a reference solution before running real evals.\n- **Start small.** 3–5 well-designed tasks beat 50 noisy ones.\n\nFor a comprehensive guide on writing high-quality skills, check out [skills-best-practices](https://github.com/mgechev/skills-best-practices/). You can also install the skill creator skill to help author skills:\n\n```bash\nnpx skills add mgechev/skills-best-practices\n```\n\n## License\n\nMIT\n\n---\n*Inspired by [SkillsBench](https://arxiv.org/html/2602.12670v1) and [Demystifying Evals for AI Agents](https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents).*\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmgechev%2Fskillgrade","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmgechev%2Fskillgrade","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmgechev%2Fskillgrade/lists"}