{"id":51009095,"url":"https://github.com/wan-huiyan/test-effectiveness-auditor","last_synced_at":"2026-06-21T00:30:45.051Z","repository":{"id":360970670,"uuid":"1220144889","full_name":"wan-huiyan/test-effectiveness-auditor","owner":"wan-huiyan","description":"Quantitatively measure how effective a project's automated test suite is at catching real bugs, by replaying historical incidents at pre-fix commits and classifying CI failure history. A Claude Code skill.","archived":false,"fork":false,"pushed_at":"2026-04-24T15:37:23.000Z","size":21,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-28T17:27:09.568Z","etag":null,"topics":["audit","claude-code","claude-code-skill","incident-replay","test-effectiveness","testing"],"latest_commit_sha":null,"homepage":null,"language":"Shell","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/wan-huiyan.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-04-24T15:37:21.000Z","updated_at":"2026-04-24T15:37:27.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/wan-huiyan/test-effectiveness-auditor","commit_stats":null,"previous_names":["wan-huiyan/test-effectiveness-auditor"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/wan-huiyan/test-effectiveness-auditor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wan-huiyan%2Ftest-effectiveness-auditor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wan-huiyan%2Ftest-effectiveness-auditor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wan-huiyan%2Ftest-effectiveness-auditor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wan-huiyan%2Ftest-effectiveness-auditor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/wan-huiyan","download_url":"https://codeload.github.com/wan-huiyan/test-effectiveness-auditor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/wan-huiyan%2Ftest-effectiveness-auditor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34590213,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-20T02:00:06.407Z","response_time":98,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["audit","claude-code","claude-code-skill","incident-replay","test-effectiveness","testing"],"created_at":"2026-06-21T00:30:44.176Z","updated_at":"2026-06-21T00:30:45.045Z","avatar_url":"https://github.com/wan-huiyan.png","language":"Shell","funding_links":[],"categories":[],"sub_categories":[],"readme":"# test-effectiveness-auditor\n\n[![Claude Code](https://img.shields.io/badge/Claude_Code-skill-orange)](https://claude.com/claude-code) [![license](https://img.shields.io/badge/license-MIT-blue)](LICENSE)\n\nQuantitatively answer the question **\"how helpful are our automated tests at catching bugs?\"** — not by proxy metrics like coverage percent or test count, but by replaying real bugs that already happened and checking whether the test suite, as it stood just before the fix, actually failed on the buggy commit.\n\n## The Problem\n\nMost teams measure test health by coverage % (e.g. `pytest --cov`). Coverage tells you which lines executed, not whether any assertion would have *failed* when the behavior was wrong. A line can be 100% covered by a test that would pass under the bug.\n\nThis skill inverts the question: take known bugs, rewind to the pre-fix commit, and observe whether the suite catches them. The honest baseline for \"are tests worth it\" is historical — bugs that made it to production despite the tests are direct evidence of gaps; CI failures that forced a code change before merge are direct evidence of catches. Everything else is speculation.\n\n## What it does\n\nTwo methods, in priority order:\n\n### Method 1 — Historical incident replay (primary)\n1. Mine incident signals from `docs/findings/`, `docs/issues/`, `docs/diagnostics/`, `docs/audits/`, root-level `discovery_*.md` / `incident_*.md` / `postmortem_*.md`, and git log (`fix|bug|revert|hotfix|incident`).\n2. For each incident, resolve `pre_fix_commit = fix_commit^`, create a git worktree at that SHA, run the project's canonical test command, capture exit code + failing test names.\n3. Classify each incident as `caught` / `gap_testable` / `gap_hard` / `ambiguous` / `unrunnable` (see [`references/classification_taxonomy.md`](plugins/test-effectiveness-auditor/references/classification_taxonomy.md)).\n4. Report per-incident table + catch rate + prioritised gap backlog.\n\n### Method 2 — CI history analysis (secondary)\n1. Pull the last 6 months of CI runs via `gh api` (GitHub Actions) or `gcloud builds list` (Cloud Build).\n2. For each PR-blocking failure, classify as `real_catch` / `author_hygiene` / `flake` / `infra`.\n3. Report the \"effective catch rate\" — what fraction of CI failures represented real-bug catches.\n\n### Out of scope (deliberately)\n- Mutation testing\n- Test-layer ablation\n- Fault injection\n- Auto-writing tests\n\nThese are higher-cost methods whose ROI depends on first knowing the Method 1/2 baseline. The skill produces a human-reviewed gap backlog; the human decides which gaps to close.\n\n## Install\n\nStandalone:\n```bash\nclaude plugin install wan-huiyan/test-effectiveness-auditor\n```\n\nOr via the bundle:\n```bash\nclaude plugin install wan-huiyan/claude-ecosystem-hygiene\n```\n\n## Quick Start\n\n```\nYou: audit our test suite — are we actually catching bugs?\n\nClaude: [invokes test-effectiveness-auditor]\n        Phase 1: discovers test command (pytest tests/), incident signals\n                 (8 docs/findings + 195 fix-grep commits)\n        Phase 2: proposes 8 candidate incidents for replay; you confirm\n        Phase 3: replays each in a temp worktree, runs the suite\n        Phase 4: classifies caught / gap_testable / gap_hard\n        Phase 5: pulls CI history (or notes N/A if no CI tests)\n        Phase 6: writes ~/Documents/\u003cproject\u003e_test_effectiveness_audit.md\n```\n\n## Comparison\n\n| | Without the skill | With the skill |\n|--|------------------|----------------|\n| Question answered | \"Coverage is 73%\" | \"0 of 4 documented bugs were caught at pre-fix commit; here are the 3 cheapest tests that would close the cluster.\" |\n| Effort | Run `pytest --cov` | One conversation |\n| Action it produces | A number | A prioritised gap backlog |\n| Honesty about hard-to-catch bugs | None | `gap_hard` classification + recommendation to use contract tests / monitors |\n\n## Scope guarantees\n\n- **Read-only relative to project source.** Creates temp worktrees, runs tests, reads git history. Never edits project code.\n- **Never auto-writes tests.** Recommendations are produced as a human-reviewed backlog; the user decides which gaps to close.\n- **Idempotent.** Running twice with the same project HEAD and same CI run window produces the same report. Keyed by (project head sha, incident id, CI run id).\n- **Conservative on classification.** When the replay is inconclusive (environment error, missing dep, flake), the incident is classified `unrunnable` rather than forced into caught/gap.\n\n## Limitations\n\n- **Replay requires a runnable test environment.** Projects with heavy external-service dependencies or custom Docker test runners may report high `unrunnable` rates and need Method 2 only.\n- **Classification uses heuristic textual matching for `caught`.** False negatives (tests that WOULD have caught but whose names don't textually match the incident) are possible. The skill is conservative: prefer `gap_testable` on ties and let the human re-classify on review.\n- **CI history analysis requires `gh` CLI auth (GitHub) or `gcloud` auth + project id (Cloud Build).** If the user can't provide these, Method 2 is skipped — the report says so honestly.\n- **Sample size matters.** A 5-incident audit is a directional signal, not statistically significant. The report flags this.\n\n## Related\n\n- [memory-hygiene](https://github.com/wan-huiyan/memory-hygiene) — clean up Claude Code's persistent memory + project docs taxonomy\n- [claude-code-ab-harness](https://github.com/wan-huiyan/claude-code-ab-harness) — A/B-test your Claude Code stack\n- [ecosystem-audit](https://github.com/wan-huiyan/claude-ecosystem-hygiene) — audit Claude Code ecosystem health (skills, sessions, ADRs)\n\n## Version History\n\n- v1.0.0 (2026-04-24): Initial release. Method 1 (historical replay) + Method 2 (CI history). Idempotent via per-incident cache.\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwan-huiyan%2Ftest-effectiveness-auditor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwan-huiyan%2Ftest-effectiveness-auditor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwan-huiyan%2Ftest-effectiveness-auditor/lists"}