{"id":51224410,"url":"https://github.com/saagpatel/harness-scorecard","last_synced_at":"2026-06-28T10:00:38.304Z","repository":{"id":367913457,"uuid":"1282704002","full_name":"saagpatel/harness-scorecard","owner":"saagpatel","description":"A read-only A–F maturity grader for coding-agent harnesses (Claude Code + Codex)","archived":false,"fork":false,"pushed_at":"2026-06-28T08:46:02.000Z","size":159,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-28T09:18:13.695Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/saagpatel.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"docs/ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-28T05:22:25.000Z","updated_at":"2026-06-28T08:45:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/saagpatel/harness-scorecard","commit_stats":null,"previous_names":["saagpatel/harness-scorecard"],"tags_count":5,"template":false,"template_full_name":null,"purl":"pkg:github/saagpatel/harness-scorecard","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saagpatel%2Fharness-scorecard","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saagpatel%2Fharness-scorecard/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saagpatel%2Fharness-scorecard/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saagpatel%2Fharness-scorecard/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/saagpatel","download_url":"https://codeload.github.com/saagpatel/harness-scorecard/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/saagpatel%2Fharness-scorecard/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34884278,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-28T02:00:05.809Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-28T10:00:25.288Z","updated_at":"2026-06-28T10:00:38.297Z","avatar_url":"https://github.com/saagpatel.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Harness Scorecard\n\nA read-only linter and **A–F maturity grader for coding-agent harnesses**. Point it at a\nClaude Code or Codex setup — Claude Code's hooks, `permissions`, `rules/*.md`, agents, and\n`CLAUDE.md`, or Codex's `config.toml` (sandbox, approval policy, trust levels), `hooks.json`,\nand `AGENTS.md` — and it returns a graded scorecard: the overall maturity grade, the specific\ngaps, and the guards that are missing, each with rationale. The harness type is auto-detected.\n\n\"Harness engineering\" became a named discipline in 2026 and everyone is assembling harnesses\nwith no way to tell if theirs is any good. The rubric is the product: every check traces to a\n**documented red-team failure mode**, not generic advice.\n\n## What it looks like\n\n```text\n$ harness-scorecard scan examples/sample-harness\n\nHarness Scorecard  v1.0.0\nTarget: examples/sample-harness   (claude-code)\n\n  GRADE:  F        overall 0.28 / 1.00\n  Scored 10 of 10 rubric dimensions (0 specced, pending).\n\n  Capability gates tripped (grade capped):\n    - HS-D5-01 caps at C  (Harness config write/read protected)\n\n  D1  Secret protection \u0026 credential isolation    0.44  [weight 5]\n      [PASS] HS-D1-01  Sensitive credential paths denied for read  [GATE-\u003eD]\n             All core credential paths are denied for read.\n             - covered: ~/.ssh, ~/.aws, ~/.gnupg, 1Password/op, gcloud, .env files\n      [FAIL] HS-D1-02  Sensitive-read Bash backstop\n             No Bash-level backstop for sensitive reads; deny lists cover only the Read tool.\n             fix: Add a PreToolUse Bash hook that re-blocks reads of sensitive files.\n      … (+4 more checks)\n\n  D4  Destructive-action \u0026 git safety    0.63  [weight 5]\n      [PASS] HS-D4-01  Push to protected branch effectively blocked  [GATE-\u003eC]\n             Push to a protected branch is blocked by the effective floor.\n             - hook:git-safety\n             - permissions.deny\n      [PASS] HS-D4-02  Catastrophic deletion blocked\n             Catastrophic deletion is blocked by the effective floor.\n             - hook:block-dangerous-cmds\n             - hook:dangerous\n      [FAIL] HS-D4-03  Destructive DB ops on non-local hosts blocked\n             No effective guard against destructive DB operations on non-local hosts.\n             - defaultMode=bypassPermissions: autoMode.hard_deny is INERT\n             fix: Add a PreToolUse Bash db-guard hook that blocks destructive ops on non-local hosts.\n      … (+2 more checks)\n\n  … (+8 more dimensions)\n```\n\nThat one line — `defaultMode=bypassPermissions: autoMode.hard_deny is INERT` — is the whole\nthesis rendered live: a rich `hard_deny` block earns **nothing** because the mode makes it\ninert. The sample above ([`examples/sample-harness`](examples/sample-harness/settings.json)) is\ndeliberately incomplete to show the findings; run it yourself, or point the tool at your own\n`~/.claude` — a mature harness scores an A.\n\n## What makes the grade real\n\nMost config \"linters\" credit a harness for declaring a rule. This one models the *effective*\nenforcement floor. The headline example:\n\n\u003e `autoMode.hard_deny` is **inert** when `permissions.defaultMode == \"bypassPermissions\"`.\n\nA naive scorer reads a rich `hard_deny` block and awards an A. Harness Scorecard reads the\nmode, discounts the inert block, and grades against what actually fires — `permissions.deny`\nglobs plus the PreToolUse hooks. See [`docs/rubric.md`](docs/rubric.md) for the full model,\nincluding **capability gates** that cap the grade when a critical hole is present (you can't\nscore an A with readable credentials, no matter how many cheap checks pass).\n\nIt's honest about its own limits, too. A harness that funnels every guard through one opaque\ndispatcher script hides its logic from static analysis, so the named-guard checks under-credit\nit. Rather than silently mark it down, the report emits a **caveat** — \"a low score here may be\na static-analysis limit, not a missing guard\" — so the grade is never misread as \"insecure.\"\n\n## Usage\n\n```bash\n# Grade a harness directory (e.g. your ~/.claude)\nharness-scorecard scan ~/.claude\n\n# JSON for tooling, plus a self-contained HTML scorecard\nharness-scorecard scan ~/.claude --format json --html scorecard.html\n\n# SARIF 2.1.0 for CI / GitHub code scanning, failing the run below grade C\nharness-scorecard scan ~/.claude --sarif harness.sarif --min-grade C\n```\n\n`--min-grade {A,B,C,D,F}` sets the bar (default `B`). Exit codes: `0` meets the bar ·\n`1` below the bar · `2` no harness found.\n\n### Grade your whole machine\n\n`fleet` grades several harnesses at once and reports the distribution and the worst offender —\nno fake rolled-up letter (averaging A–F is meaningless). It's the \"every agent harness on this\nbox\" view:\n\n```text\n$ harness-scorecard fleet ~/.claude ~/.codex\n\nHarness Scorecard  fleet  (2 harnesses)\n\n  Grades:  Ax1   Bx0   Cx0   Dx1   Fx0\n  Weakest dimension fleet-wide: D9 Memory / provenance hygiene (avg 0.62)\n  Worst offender: ~/.codex (D, 0.64)\n\n  GRADE  SCORE  TYPE         WEAKEST    HARNESS\n  A      1.00   claude-code  -          ~/.claude\n  D      0.64   codex        D9 0.25    ~/.codex\n```\n\nPass any paths or globs (`fleet ~/.claude ~/Projects/*/.claude`); each harness is graded with\nits own auto-discovered policy. `--min-grade` (default `B`) exits non-zero if **any** harness is\nbelow the bar — drop it in CI to keep a whole team's harnesses above a floor.\n\n### Grade badge\n\nEmit a flat SVG badge (colored A green → F red) for a harness repo's README, then regenerate it\nin CI so it can't drift from reality:\n\n```bash\nharness-scorecard scan ~/.claude --badge harness-grade.svg\n```\n\n```markdown\n![harness grade](harness-grade.svg)\n```\n\n### Track drift over time\n\n`diff` compares two scorecards and reports what changed — which checks flipped, which\ndimension scores moved, and whether a capability gate newly trips. Each argument is either a\nlive harness directory or a saved JSON report (`scan --json`), so the same command covers a CI\nregression gate, a before/after audit, or drift between two snapshots:\n\n```bash\n# Record a baseline, then later fail if the harness grade regresses below it\nharness-scorecard scan ~/.claude --json baseline.json\nharness-scorecard diff baseline.json ~/.claude          # exit 1 if the grade dropped\n\n# Compare two saved snapshots, machine-readable\nharness-scorecard diff old.json new.json --format json\n```\n\nExit codes: `0` no regression (same or better grade) · `1` grade regressed · `2` invalid input.\nGate and dimension moves are reported for context; the **letter grade** is what fails the gate.\n\n### Accept known gaps with a policy file\n\nDrop a `.harness-scorecard.toml` in the harness directory (or pass `--policy`) to record decisions\nthe grader should respect — always surfaced in the report, never silently hidden:\n\n```toml\n[[waiver]]\ncheck = \"HS-D1-03\"\nreason = \"Write-time secret scanning is handled by pre-commit, outside the harness.\"\n\n[dispatcher]\ncredits = [\"HS-D4-03\"]   # checks an opaque dispatcher enforces but static analysis can't see\n```\n\nA **waiver** excludes a finding from the grade (and suppresses its gate cap) but lists it as\n`[WAIV]` with the reason; a stale waiver is flagged, not dropped. The **dispatcher manifest**\nupgrades a declared check from FAIL to PARTIAL — half credit, \"declared, not statically verified.\"\nSee [`examples/harness-scorecard.toml`](examples/harness-scorecard.toml).\n\n## GitHub Action\n\nGrade your harness in CI and upload the findings to code scanning:\n\n```yaml\n- uses: saagpatel/harness-scorecard@v1\n  with:\n    path: .claude\n    min-grade: B\n```\n\nThe action writes SARIF and uploads it (requires `security-events: write`) **even when the grade\nfails the build**, so findings always reach code scanning. Commit a `baseline.json` and pass\n`baseline:` to also fail the job on any grade regression — a PR that weakens the harness can't\nmerge:\n\n```yaml\n- uses: saagpatel/harness-scorecard@v1\n  with:\n    path: .claude\n    baseline: .github/harness-baseline.json   # fail if the grade drops below this\n```\n\nA complete workflow — permissions, weekly scheduling, SARIF upload — is in\n[`examples/github-workflow.yml`](examples/github-workflow.yml).\n\n## Guarantees\n\n- **Read-only.** It never writes to the harness it audits.\n- **Privacy-preserving.** All output redacts secrets, tokens, emails, and absolute home\n  paths. Nothing leaves the machine.\n- **Dependency-free runtime.** The scorer ships stdlib-only — a tool that grades\n  supply-chain hygiene should carry the smallest surface itself.\n\n## Scope (v1)\n\nImplements **all ten rubric dimensions** end-to-end for **both Claude Code and Codex**: secret\nprotection, egress/exfiltration control, tool-surface \u0026 inbound-injection defense,\ndestructive-action \u0026 git safety, harness self-protection \u0026 integrity, verification gates,\nsubagent isolation \u0026 governance, recovery/rollback safety, memory/provenance hygiene, and\nobservability/audit trail (the critical gated trio is D1/D4/D5). Each harness has its own\nadapter and check suite over the shared scoring engine; the bypass-aware effective floor maps\nto Codex's `sandbox_mode = \"danger-full-access\"` + `approval_policy = \"never\"` just as it does\nto Claude Code's `bypassPermissions`. The rubric is versioned and emitted in every report.\n\n## Development\n\n```bash\nuv sync --frozen                                      # install dev tooling from the lockfile\nuv run --no-sync python -m unittest discover -s tests # tests (stdlib runner, zero extra deps)\nuv run --no-sync ruff check src/ tests/               # lint\nuv run --no-sync ty check src/                        # type check\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaagpatel%2Fharness-scorecard","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsaagpatel%2Fharness-scorecard","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsaagpatel%2Fharness-scorecard/lists"}