{"id":49762039,"url":"https://github.com/callstackincubator/skillgym","last_synced_at":"2026-05-11T09:54:50.724Z","repository":{"id":351297837,"uuid":"1199335070","full_name":"callstackincubator/skillgym","owner":"callstackincubator","description":"Prove your agent skills work before you ship them.","archived":false,"fork":false,"pushed_at":"2026-05-11T09:52:17.000Z","size":1249,"stargazers_count":47,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-11T09:54:38.294Z","etag":null,"topics":["agents","skills","testing"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/callstackincubator.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-02T08:51:01.000Z","updated_at":"2026-05-11T07:26:59.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/callstackincubator/skillgym","commit_stats":null,"previous_names":["callstackincubator/skillgym"],"tags_count":9,"template":false,"template_full_name":null,"purl":"pkg:github/callstackincubator/skillgym","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/callstackincubator%2Fskillgym","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/callstackincubator%2Fskillgym/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/callstackincubator%2Fskillgym/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/callstackincubator%2Fskillgym/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/callstackincubator","download_url":"https://codeload.github.com/callstackincubator/skillgym/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/callstackincubator%2Fskillgym/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32889971,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-10T13:40:02.631Z","status":"online","status_checked_at":"2026-05-11T02:00:05.975Z","response_time":120,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","skills","testing"],"created_at":"2026-05-11T09:54:49.143Z","updated_at":"2026-05-11T09:54:50.706Z","avatar_url":"https://github.com/callstackincubator.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"![skillgym-banner](https://incubator.callstack.com/skillgym/banner.jpg)\n\n### Benchmark coding-agent skills by running real agent sessions and asserting on normalized execution reports\n\n[![mit licence][license-badge]][license]\n[![npm downloads][npm-downloads-badge]][npm-downloads]\n[![PRs Welcome][prs-welcome-badge]][prs-welcome]\n\n## Why it's useful\n\nWhen you evaluate agent skills manually, it is hard to tell whether the agent actually selected the right skill, used it at the right time, and behaved correctly end to end. `skillgym` gives you a repeatable way to run real sessions, preserve session artifacts, verify results with TypeScript assertions, and catch token regressions with snapshots.\n\n## Supported runners\n\n- OpenCode CLI\n- Codex CLI\n- Claude Code\n- Cursor Agent (Cursor CLI `agent`)\n\n## Quick start\n\nInstall `skillgym` in the project where you want to benchmark agent behavior:\n\n```bash\nnpm install --save-dev skillgym\nyarn add --dev skillgym\npnpm add --save-dev skillgym\nbun add --dev skillgym\n```\n\nCreate `skillgym.config.ts` in your project root, or in a parent directory that the suite can discover upward:\n\n```ts\nimport type { SkillGymConfig } from \"skillgym\";\n\nconst config: SkillGymConfig = {\n  run: {\n    cwd: \".\",\n    outputDir: \"./.skillgym-results\",\n    reporter: \"standard\",\n    schedule: \"serial\",\n    maxParallel: 4,\n    repeat: 3,\n    repeatFailure: 1,\n    maxSteps: 4,\n  },\n  defaults: {\n    timeoutMs: 120_000,\n  },\n  runners: {\n    open-main: {\n      agent: {\n        type: \"opencode\",\n        model: \"openai/gpt-5\",\n      },\n    },\n    code-main: {\n      agent: {\n        type: \"codex\",\n        model: \"gpt-5\",\n      },\n    },\n    cursor-main: {\n      agent: {\n        type: \"cursor-agent\",\n        model: \"composer-2-fast\",\n      },\n    },\n  },\n};\n\nexport default config;\n```\n\nCreate a suite file such as `./skillgym/basic-suite.ts`:\n\n```ts\nimport type { Suite } from \"skillgym\";\nimport { assert } from \"skillgym\";\n\nconst suite: Suite = [\n  {\n    id: \"always-passes\",\n    prompt: \"Say only: skillgym ready\",\n    assert(report, ctx) {\n      assert.match(ctx.finalOutput(), /skillgym ready/);\n    },\n  },\n];\n\nexport default suite;\n```\n\nRun the suite with the package manager you use in that project:\n\n```bash\nnpx skillgym run ./skillgym/basic-suite.ts\nyarn skillgym run ./skillgym/basic-suite.ts\npnpm exec skillgym run ./skillgym/basic-suite.ts\nbunx skillgym run ./skillgym/basic-suite.ts\n```\n\nView CLI help:\n\n```bash\nnpx skillgym help\n```\n\nList bundled agent-facing skills and read the main one:\n\n```bash\nnpx skillgym skills list\nnpx skillgym skills get core\n```\n\nExplain a failed execution by resuming the original runner session from the exact failed artifact directory:\n\n```bash\nnpx skillgym explain ./.skillgym-results/run-1/case-a/open-main/repeat-1\n```\n\nBy default, `skillgym` uses the built-in `standard` reporter.\n\nTypeScript config, suite, and reporter modules work out of the box on Node `\u003e=22.18.0` using Node's built-in TypeScript stripping.\n\nTypeScript runtime limitations:\n\n- `.ts`, `.mts`, and `.cts` modules are supported\n- `.tsx` is not supported\n- runtime `tsconfig` path aliases are not supported\n- use explicit file extensions in relative imports, for example `./helpers.js`\n- use `import type` for type-only imports\n- TypeScript features that need code generation, such as `enum`, are not supported by default\n\n## What you need to run a suite\n\n- a `skillgym.config.*` file with a non-empty `runners` map\n- at least one configured runner with `agent.type` and `agent.model`\n- the corresponding CLI installed and working in your environment\n- a suite file that exports cases\n\nConfig is discovered upward from the suite file directory. CLI flags override config values.\n\nRunner model selection is required per runner in `runners.\u003cname\u003e.agent.model`.\nUse `agent.model` instead of `commandArgs` when you need to select the agent model, especially for Codex where `--model` must be passed to `codex exec` rather than the outer launcher.\n\n## Runners\n\nA runner is one configured agent target. It tells `skillgym` which CLI to launch and which model to use for a run.\n\nEach case runs once per selected runner. For example, 3 cases and 2 runners produce 6 executions.\n\n## Configuration\n\nMost important config properties:\n\n- `run.cwd`: working directory used for shared-workspace executions\n- `run.outputDir`: suite-run artifact directory where artifacts, reports, and preserved workspaces are written\n- `run.reporter`: built-in `standard` reporter or a custom reporter module path\n- `run.schedule`: execution scheduling mode for case x runner pairs\n- `run.maxParallel`: maximum concurrent executions for non-serial schedules, defaulting to available CPU parallelism\n- `run.repeat`: require this many successful repetitions per case x runner execution\n- `run.repeatFailure`: retry the current repetition up to this many additional sessions after failure\n- `run.retryFailed`: compatibility alias for `run.repeatFailure`\n- `run.maxSteps`: best-effort limit on streamed agent steps before skillgym terminates the execution\n- `run.workspace`: default workspace mode for the suite\n- `defaults.timeoutMs`: default per-case timeout\n- `runners.\u003cid\u003e.agent.type`: which agent integration to use, currently `opencode`, `codex`, `claude-code`, or `cursor-agent`\n- `runners.\u003cid\u003e.agent.model`: model passed to that runner\n- `snapshots`: token regression baseline configuration\n\nThe execution unit is one case x runner pair. `skillgym` expands the suite into those pairs, runs them according to `run.schedule`, and writes artifacts for each execution.\n\n`run.schedule` controls execution order:\n\n- `serial`: run every case/runner pair in declaration order\n- `parallel`: run selected case/runner pairs concurrently, capped by `run.maxParallel`\n- `isolated-by-runner`: keep each runner on its own serial queue while different runners may overlap, capped by `run.maxParallel`\n\n`serial` is the default. `parallel` maximizes overlap across the full matrix up to the configured cap. `isolated-by-runner` is a middle ground when you want each runner to stay ordered internally but still allow different runners to overlap.\n\nFor concurrent schedules, `run.maxParallel` defaults to `os.availableParallelism()`. This limits how many SkillGym executions are active at once; it does not pin or limit CPU cores used by an individual agent process.\n\nConcurrent schedules do not copy or isolate the workspace by themselves. Overlapping executions may still interact through the same filesystem state and live runner output unless you use isolated workspaces. OpenCode, Codex, and Claude Code runtime state are isolated per execution under each artifact directory.\n\n`run.repeat` is useful when you want stability sampling instead of a single lucky pass. Each case x runner execution keeps running until it records the requested number of successful classified results, or stops early when one repetition still fails after exhausting `run.repeatFailure` retries.\n\n`run.repeatFailure` retries only the current repetition when it still counts as failed after result classification. SkillGym keeps all repetition and retry artifacts, averages visible duration and token metrics across the final results of completed successful repetitions, and preserves the full nested detail in `results.json`.\n\nArtifacts for repeated executions are grouped under the stable case x runner directory using `repeat-N` directories, with retry sessions nested as `session-N` inside the repetition that needed recovery.\n\n`run.maxSteps` is enforced on a best-effort basis by monitoring each runner's streamed JSONL output. A step is one observed model round, not one token and not necessarily one tool call, but the exact boundary is still runner-defined, so the same prompt may consume different numbers of steps across agents. When the observed step count exceeds the configured limit, skillgym kills the agent process, fails the execution with origin `max-steps`, and preserves raw stdout/stderr artifacts for debugging. No partial session report is produced for that failure.\n\n## Workspaces\n\nA workspace is the directory where an execution runs.\n\n`skillgym` supports two workspace modes:\n\n- `shared`: run directly in one real directory\n- `isolated`: create a fresh temporary workspace per case x runner execution\n\nUse `shared` when you want the agent to work against your real project checkout. Use `isolated` when you want clean filesystem state per execution or need to prepare each execution from a template.\n\nYou can configure workspaces in `skillgym.config.*` with `run.workspace`, or per suite with a named `workspace` export. Suite-level workspace config overrides config-level `run.workspace`.\n\nIsolated workspace example in a suite:\n\n```ts\nexport const workspace = {\n  mode: \"isolated\",\n  templateDir: \"./fixtures/base-project\",\n  bootstrap: {\n    command: \"npm\",\n    args: [\"install\"],\n  },\n};\n```\n\nIn isolated mode, each execution gets its own workspace. `templateDir` copies a starter project into that workspace, and `bootstrap` runs before the agent starts. Successful isolated executions are cleaned up; failed ones are preserved under `outputDir/workspaces` for debugging.\n\nSee [Workspaces](docs/workspaces.md) for the full workspace reference.\n\n## Assertions\n\n`assert` extends Node's `node:assert/strict` helpers, so standard methods like `assert.ok`, `assert.equal`, and `assert.match` still work.\n\n`assert.soft.*` mirrors the same sync assertion surface and collects failures until the current case finishes, so one execution can report multiple assertion mismatches together.\n\nBuilt-in grouped assertions cover:\n\n- `assert.soft.*`\n- `assert.skills.*`\n- `assert.commands.*`\n- `assert.fileReads.*`\n- `assert.toolCalls.*`\n- `assert.output.*`\n\nExample:\n\n```ts\nimport { assert, commandMatcher } from \"skillgym\";\n\nassert.soft.match(report.finalOutput, /ready/i);\nassert.soft.commands.includes(report, \"pnpm test\");\nassert.skills.has(report, \"find-skills\");\nassert.skills.notHas(report, \"upgrading-expo\");\nassert.commands.includes(report, \"npx skills find\");\nassert.commands.includes(report, commandMatcher(\"pnpm\").arg(\"test\").flag(\"--watch\"));\nassert.commands.notIncludes(report, \"npm install\");\nassert.fileReads.includes(report, /find-skills\\/SKILL\\.md$/);\nassert.fileReads.notIncludes(report, /upgrading-expo\\/SKILL\\.md$/);\nassert.toolCalls.has(report, {\n  tool: \"skill\",\n  where: (args) =\u003e (args as { name?: string })?.name === \"find-skills\",\n});\nassert.output.notEmpty(report);\n```\n\n`assert.soft.rejects(...)` and `assert.soft.doesNotReject(...)` are still hard assertions in the first implementation.\n\nGrouped assertions can also persist follow-up questions for later debugging:\n\n```ts\nassert.fileReads.includes(report, /SKILL\\.md$/, {\n  explain: {\n    question: \"Why did you continue without reading SKILL.md first?\",\n  },\n});\n```\n\nWhen at least one explainable assertion fails, skillgym writes `explain.json` into the failed artifact directory. Running `skillgym explain \u003cartifactDir\u003e` resumes the original runner session, asks each persisted question, and writes `explanations.json` with the answers.\n\nFor historical isolated-workspace executions, deferred explain currently depends on the recorded workspace still being resumable by the underlying runner.\n\nSee the [assertion reference](docs/assertions.md).\nSee [Deferred Explain](docs/explain.md).\n\n## Snapshots\n\nSnapshot checks can fail executions when token usage regresses beyond a configured tolerance.\n\n```bash\nnpx skillgym run ./examples/basic-suite.ts --update-snapshots\n```\n\nSee the [snapshot guide](docs/snapshot.md).\n\n## Example suites\n\nThe [skill selection suite](examples/skill-selection-suite.ts) targets a real installed skill (`find-skills`) and checks that the runner loads it before invoking `npx skills find`.\n\n```bash\nnpx skillgym run ./examples/skill-selection-suite.ts\n```\n\nThe [workspace isolation suite](examples/workspace-isolation-suite.ts) demonstrates an isolated workspace with a template directory and bootstrap command:\n\n```bash\nnpx skillgym run ./examples/workspace-isolation-suite.ts\n```\n\nThe [failure classification suite](examples/failure-classification-suite.ts) demonstrates `assert.classify(...)` and `classifyFailure(...)` so reporters can group related failures under one class:\n\n```bash\nnpx skillgym run ./examples/failure-classification-suite.ts\n```\n\n## Docs\n\nThe documentation site is at [incubator.callstack.com/skillgym](https://incubator.callstack.com/skillgym/). Repository docs:\n\n- [Docs Overview](docs/readme.md)\n- [Test Cases](docs/cases.md)\n- [Assertions](docs/assertions.md)\n- [Workspaces](docs/workspaces.md)\n- [Reporters](docs/reporters.md)\n- [Snapshots](docs/snapshot.md)\n\n## Made with ❤️ at Callstack\n\n`skillgym` is an open source project and will always remain free to use. If you think it's cool, please star it 🌟. [Callstack][callstack-readme-with-love] is a group of React and React Native geeks, contact us at [hello@callstack.com](mailto:hello@callstack.com) if you need any help with these or just want to say hi!\n\nLike the project? ⚛️ [Join the team](https://callstack.com/careers/?utm_campaign=Senior_RN\u0026utm_source=github\u0026utm_medium=readme) who does amazing stuff for clients and drives React Native Open Source! 🔥\n\n[callstack-readme-with-love]: https://callstack.com/?utm_source=github.com\u0026utm_medium=referral\u0026utm_campaign=skillgym\u0026utm_term=readme-with-love\n[license-badge]: https://img.shields.io/npm/l/skillgym?style=for-the-badge\n[license]: https://github.com/callstackincubator/skillgym/blob/main/LICENSE\n[npm-downloads-badge]: https://img.shields.io/npm/dm/skillgym?style=for-the-badge\n[npm-downloads]: https://www.npmjs.com/package/skillgym\n[prs-welcome-badge]: https://img.shields.io/badge/PRs-welcome-brightgreen.svg?style=for-the-badge\n[prs-welcome]: https://github.com/callstackincubator/skillgym\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcallstackincubator%2Fskillgym","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcallstackincubator%2Fskillgym","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcallstackincubator%2Fskillgym/lists"}