{"id":50703865,"url":"https://github.com/doramirdor/skillci","last_synced_at":"2026-06-09T10:01:53.880Z","repository":{"id":362960763,"uuid":"1255348374","full_name":"doramirdor/skillci","owner":"doramirdor","description":null,"archived":false,"fork":false,"pushed_at":"2026-06-06T17:46:08.000Z","size":751,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-06T19:17:46.135Z","etag":null,"topics":["agent-config","ci","claude-code","codex","coding-agents","cursor","llm","regression-testing","typescript"],"latest_commit_sha":null,"homepage":"https://doramirdor.github.io/skillci/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/doramirdor.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-31T18:07:50.000Z","updated_at":"2026-06-06T17:46:11.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/doramirdor/skillci","commit_stats":null,"previous_names":["doramirdor/skillci"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/doramirdor/skillci","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doramirdor%2Fskillci","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doramirdor%2Fskillci/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doramirdor%2Fskillci/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doramirdor%2Fskillci/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/doramirdor","download_url":"https://codeload.github.com/doramirdor/skillci/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/doramirdor%2Fskillci/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34101070,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-config","ci","claude-code","codex","coding-agents","cursor","llm","regression-testing","typescript"],"created_at":"2026-06-09T10:01:52.841Z","updated_at":"2026-06-09T10:01:53.874Z","avatar_url":"https://github.com/doramirdor.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"./assets/banner.svg\" alt=\"SkillCI — CI for your AI config\" width=\"820\" /\u003e\n\n\u003cp\u003e\u003cstrong\u003eCI/CD for coding-agent configuration.\u003c/strong\u003e\u003cbr/\u003e\nTest, score, and gate changes to your skills, hooks, rules, and \u003ccode\u003eCLAUDE.md\u003c/code\u003e\u003cbr/\u003e\n\u003cem\u003ebefore\u003c/em\u003e you trust them in Claude Code, Cursor, or Codex.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003e\u003ca href=\"https://doramirdor.github.io/skillci/\"\u003e🌐 Website\u003c/a\u003e\u003c/strong\u003e ·\n\u003ca href=\"./docs/\"\u003eDocs\u003c/a\u003e ·\n\u003ca href=\"./docs/getting-started.md\"\u003eGetting Started\u003c/a\u003e\u003c/p\u003e\n\n\u003cp\u003e\n  \u003cimg alt=\"CI\" src=\"https://github.com/doramirdor/skillci/actions/workflows/ci.yml/badge.svg\"\u003e\n  \u003cimg alt=\"tests\" src=\"https://img.shields.io/badge/tests-225%20passing-22c55e\"\u003e\n  \u003cimg alt=\"typescript\" src=\"https://img.shields.io/badge/TypeScript-strict-3178c6\"\u003e\n  \u003cimg alt=\"node\" src=\"https://img.shields.io/badge/node-%E2%89%A520-339933\"\u003e\n  \u003cimg alt=\"license\" src=\"https://img.shields.io/badge/license-MIT-6366f1\"\u003e\n  \u003cimg alt=\"PRs welcome\" src=\"https://img.shields.io/badge/PRs-welcome-22d3ee\"\u003e\n\u003c/p\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"./assets/demo.gif\" alt=\"SkillCI demo — baseline vs candidate evaluation producing a verdict\" width=\"860\" /\u003e\n\n\u003c/div\u003e\n\n---\n\n## Why\n\nYou wouldn't merge an untested code change. Yet teams ship changes to the\nconfiguration that *steers* their coding agents — `CLAUDE.md`, skills, hooks,\ncursor rules, slash commands — on **vibes**, with zero testing.\n\nA new instruction in a shared `CLAUDE.md` can quietly make the agent slower,\npricier, or worse at the job, and nobody notices for weeks. Agent config is now\na production artifact. **SkillCI is regression testing for it.**\n\n\u003e **Despite the name, SkillCI isn't only about skills.** It evaluates *every*\n\u003e kind of agent config artifact — skills, hooks, rules, instructions\n\u003e (`CLAUDE.md` / `AGENTS.md`), slash commands, and settings — across Claude\n\u003e Code, Cursor, and Codex. \"Skill\" is just the headline artifact.\n\n## What it does\n\n```\n          ┌─ baseline config ─┐                      ┌──────────┐\n tasks ──►│                   ├─► run agent ─► score ─┤ compare  ├─► VERDICT\n          └─ candidate config ┘   (sandboxed)         └──────────┘   improved /\n                                                                     neutral /\n                                                                     regressed\n```\n\n1. A **candidate** is a proposed change to one or more agent config artifacts.\n2. SkillCI runs a suite of sandboxed repo **tasks twice** — once with the\n   **baseline** (trusted) config, once with the **candidate**.\n3. Each run drives a real coding agent **headlessly** in an isolated fixture repo.\n4. Every outcome is scored three ways (below) and aggregated into one composite.\n5. A **comparator** emits a verdict — `improved` / `neutral` / `regressed` — with\n   an explicit list of any regressions.\n6. A **pull request opens ONLY** when the verdict is `improved` with **zero hard\n   regressions**. Otherwise the gate blocks.\n\n### Three scoring dimensions\n\n| Dimension | What it measures |\n| --------- | ---------------- |\n| **Objective** | Deterministic pass/fail: command exit codes, file existence/contents, test suites, expected diffs. |\n| **LLM-as-judge** | Rubric-scored qualitative judgment for what objective checks can't capture. |\n| **Cost / efficiency** | Tokens, tool calls, steps, wall-clock — a cheaper candidate is rewarded; a pricier one penalized. |\n\n## Quickstart (offline, zero cost)\n\n```bash\nnpm install\nnpm run demo      # deterministic end-to-end run via the mock adapter — no network, no keys\nnpm test          # 225 tests, fully offline\nnpm run typecheck\n```\n\nThe demo runs the whole pipeline (discover → sandbox → run → score → compare →\nverdict → PR dry-run) with a deterministic `MockAgentAdapter`. No API key, no\nnetwork.\n\n## Live mode — runs on a Claude subscription, no API key\n\nSkillCI drives **real Claude Code** through the `claude` CLI, so it works on a\nClaude Code **subscription / OAuth session with no `ANTHROPIC_API_KEY`**:\n\n```bash\n# Build the CLI\nnpm run build\n\n# Evaluate a candidate config against the baseline, live\nnode dist/cli/index.js run \\\n  --agent claude-code \\\n  --baseline ./config/baseline \\\n  --candidate ./config/candidate\n```\n\n- **Agent runs** via `claude -p \"\u003cprompt\u003e\" --output-format json --dangerously-skip-permissions`\n  inside a disposable sandbox.\n- **The LLM judge** also runs through `claude -p` when no `ANTHROPIC_API_KEY` is\n  present — so the *entire* pipeline (agent **and** judge) works on a\n  subscription. Set `ANTHROPIC_API_KEY` to route the judge through the Anthropic\n  SDK instead.\n\n\u003e Exit code is a CI gate: **non-zero only when the verdict is `regressed`.**\n\n## Use it as a PR gate (GitHub Action)\n\nDrop this in to evaluate every PR that touches your agent config. A ready-to-copy\ntemplate lives at\n[`examples/skillci-gate.yml`](examples/skillci-gate.yml):\n\n```yaml\nname: skillci-gate\non:\n  pull_request:\n    paths: ['.claude/**', 'CLAUDE.md', '.cursor/**', 'AGENTS.md']\njobs:\n  gate:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v4\n      - uses: actions/setup-node@v4\n        with: { node-version: 20 }\n      - run: npm ci \u0026\u0026 npm run build\n      - run: node dist/cli/index.js run --agent claude-code\n              --baseline \u003ctrusted-config\u003e --candidate \u003cpr-config\u003e\n        env:\n          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}  # optional — enables SDK judge\n```\n\nThe job fails (blocking merge) only when the candidate **regresses**.\n\n## CLI\n\n| Command | What it does |\n| ------- | ------------ |\n| `skillci run` | Baseline-vs-candidate evaluation (or `--demo` for the offline run). |\n| `skillci validate \u003cdir\u003e` | Discover \u0026 validate config artifacts for an agent in a directory. |\n| `skillci tasks` | List available task definitions. |\n\nKey `run` flags: `--agent \u003cclaude-code\\|cursor\\|codex\u003e`, `--baseline \u003cdir\u003e`,\n`--candidate \u003cdir\u003e`, `--tasks \u003cdir\u003e`, `--demo`, `--open-pr`, `--no-color`.\n\n## Supported agents\n\n| Agent | Artifacts | Headless invocation |\n| ----- | --------- | ------------------- |\n| **Claude Code** | skills, hooks, slash commands, `CLAUDE.md`, `.claude/settings.json` | `claude -p \"\u003cprompt\u003e\" --output-format json` *(subscription or API key)* |\n| **Cursor** | `.cursor/rules/*.mdc`, `.cursorrules` | `cursor-agent` CLI *(best-effort)* |\n| **Codex** | `AGENTS.md` / codex config | `codex exec` *(needs `OPENAI_API_KEY`)* |\n\nReal adapters activate only when their CLI/auth is present and **degrade\ngracefully** otherwise; everything falls back to the deterministic mock.\n\n## Architecture\n\nSelf-contained modules, each in its own directory with colocated tests, all\ncompiling against the shared contracts in\n[`src/core/contracts.ts`](src/core/contracts.ts).\n\n| Module | Responsibility |\n| ------ | -------------- |\n| `core` | Canonical type contracts + zod schemas for the whole domain. |\n| `artifacts` | Discover \u0026 normalize agent config into `Artifact` / `ConfigSet`. |\n| `sandbox` | Create isolated fixture workdirs, run commands, compute file diffs. |\n| `agents` | Agent adapters (Claude / Cursor / Codex + deterministic mock). |\n| `tasks` | Load \u0026 validate task suites and fixtures. |\n| `scoring` | Objective checks, LLM judge (SDK **or** `claude -p`), cost telemetry. |\n| `compare` | Aggregate scores, compute deltas, emit the verdict; `shouldPromote()`. |\n| `report` | Render JSON / markdown / colorized terminal reports. |\n| `pr` | Open a GitHub PR (via `gh`) — dry-run by default, gated on verdict. |\n| `cli` / `orchestrator` | Wire it together; the `skillci` commands. |\n\n## The regression gate (fail-closed)\n\nThe promote decision is intentionally conservative. A candidate is **blocked**\nif any of these hold:\n\n- An **objective pass-rate drop** on any task (hard rule).\n- A **dropped task** (present in baseline, missing in candidate).\n- Aggregate composite gain **≤ `minCompositeGain`**.\n- Any `NaN` / non-finite score.\n\nPromotion requires `improved` **and** zero hard regressions.\n\n## Status \u0026 roadmap\n\nSkillCI is an **MVP**, verified end-to-end live (agent + judge) on a Claude\nsubscription. Honest about where it stands:\n\n- ✅ Full pipeline runs live and offline; 225 tests; offline demo deterministic.\n- ✅ Claude Code adapter + dual-backend judge (SDK / CLI) live-verified.\n- 🚧 **Verdict variance** — real agent runs are non-deterministic; multi-run\n  averaging + a cost-tolerance band are needed before unattended auto-promotion.\n- 🚧 Cursor / Codex adapters are implemented but not yet live-validated.\n- 🚧 Larger task/fixture libraries; container sandbox backend; hosted dashboards.\n\n## Documentation\n\nFull guides live in [`docs/`](./docs/):\n\n- [Getting Started](./docs/getting-started.md) — install, offline demo, first live run\n- [Concepts](./docs/concepts.md) — baseline/candidate, verdict model, the gate\n- [CLI Reference](./docs/cli-reference.md) · [Writing Tasks](./docs/writing-tasks.md) · [Scoring](./docs/scoring.md)\n- [Agents \u0026 Auth](./docs/agents-and-auth.md) · [CI Integration](./docs/ci-integration.md)\n- [Architecture](./docs/architecture.md) · [Troubleshooting](./docs/troubleshooting.md)\n\n## Development\n\n```bash\nnpm run dev          # tsx watch\nnpm run test:watch   # vitest watch\nnpm run build        # emit dist/ (compiled CLI bin)\n```\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md).\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdoramirdor%2Fskillci","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdoramirdor%2Fskillci","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdoramirdor%2Fskillci/lists"}