{"id":42767153,"url":"https://github.com/plaited/agent-eval-harness","last_synced_at":"2026-02-20T01:02:53.263Z","repository":{"id":332845398,"uuid":"1135157745","full_name":"plaited/agent-eval-harness","owner":"plaited","description":"Evaluate AI agents with Unix-style pipeline commands. Schema-driven adapters for any CLI agent, trajectory capture, pass@k metrics, and multi-run comparison.","archived":false,"fork":false,"pushed_at":"2026-01-22T00:21:37.000Z","size":370,"stargazers_count":1,"open_issues_count":1,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-01-22T01:49:59.660Z","etag":null,"topics":["agent-client-protocol","agent-comparison","agent-evaluation","ai-agents","bun","cli","eval-harness","grader","headless-adapter","jsonl","llm-evaluation","pass-at-k","trajectory-capture","typescript","unix-pipeline"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"isc","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/plaited.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-01-15T18:07:11.000Z","updated_at":"2026-01-22T00:21:37.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/plaited/agent-eval-harness","commit_stats":null,"previous_names":["plaited/acp-harness"],"tags_count":19,"template":false,"template_full_name":null,"purl":"pkg:github/plaited/agent-eval-harness","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plaited%2Fagent-eval-harness","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plaited%2Fagent-eval-harness/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plaited%2Fagent-eval-harness/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plaited%2Fagent-eval-harness/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/plaited","download_url":"https://codeload.github.com/plaited/agent-eval-harness/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/plaited%2Fagent-eval-harness/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28885308,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-29T21:06:44.224Z","status":"ssl_error","status_checked_at":"2026-01-29T21:06:42.160Z","response_time":59,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-client-protocol","agent-comparison","agent-evaluation","ai-agents","bun","cli","eval-harness","grader","headless-adapter","jsonl","llm-evaluation","pass-at-k","trajectory-capture","typescript","unix-pipeline"],"created_at":"2026-01-29T21:14:46.483Z","updated_at":"2026-02-20T01:02:53.247Z","avatar_url":"https://github.com/plaited.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# @plaited/agent-eval-harness\n\n[![npm version](https://img.shields.io/npm/v/@plaited/agent-eval-harness.svg)](https://www.npmjs.com/package/@plaited/agent-eval-harness)\n[![CI](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml/badge.svg)](https://github.com/plaited/agent-eval-harness/actions/workflows/ci.yml)\n[![License: ISC](https://img.shields.io/badge/License-ISC-blue.svg)](https://opensource.org/licenses/ISC)\n\nCLI tool for capturing agent trajectories from headless CLI agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents.\n\n## CLI Tool\n\nUse these tools directly via the CLI without installation:\n\n```bash\n# Using built-in headless adapter (recommended - no extra install needed)\nexport ANTHROPIC_API_KEY=sk-...\nbunx @plaited/agent-eval-harness capture prompts.jsonl \\\n  --schema ./schemas/claude-headless.json \\\n  -o results.jsonl\n```\n\n**Prerequisite:** Set your API key. The harness works with any CLI agent that supports JSON output - just provide a schema describing how to interact with it:\n\n```bash\nexport ANTHROPIC_API_KEY=sk-...   # For Claude\nexport GEMINI_API_KEY=...         # For Gemini\n```\n\nCreate adapter schemas for any CLI agent that outputs JSON — see the [Schema Creation Guide](.agents/skills/headless-adapters/references/schema-creation-guide.md).\n\n### Core Commands\n\n| Command | Description |\n|---------|-------------|\n| `capture \u003cprompts\u003e --schema \u003cpath\u003e` | Trajectory capture (full JSONL) |\n| `trials \u003cprompts\u003e --schema \u003cpath\u003e` | Multi-run with pass@k metrics |\n| `summarize \u003cresults\u003e` | Derive compact views from results |\n| `calibrate \u003cresults\u003e` | Sample failures for review |\n| `validate-refs \u003cprompts\u003e` | Check reference solutions |\n| `balance \u003cprompts\u003e` | Analyze test set coverage |\n| `schemas [name]` | Export JSON schemas |\n| `headless --schema \u003cpath\u003e` | Schema-driven adapter for any CLI agent |\n\n### Pipeline Commands (Unix-style)\n\n| Command | Description |\n|---------|-------------|\n| `run \u003cprompts\u003e --schema \u003cpath\u003e` | Execute prompts, output raw results |\n| `extract \u003craw\u003e --schema \u003cpath\u003e` | Parse raw output into trajectories |\n| `grade \u003cresults\u003e --grader \u003cpath\u003e` | Apply grader to extracted results |\n| `format \u003cresults\u003e --style \u003cstyle\u003e` | Convert to markdown, csv, or jsonl |\n| `compare \u003crun1\u003e \u003crun2\u003e...` | Compare runs (aggregate report) |\n\n### Examples\n\n```bash\n# Capture trajectories using headless adapter (recommended)\nbunx @plaited/agent-eval-harness capture prompts.jsonl \\\n  --schema ./schemas/claude-headless.json \\\n  -o results.jsonl\n\n# Parallel capture (4x faster with 4 workers)\nbunx @plaited/agent-eval-harness capture prompts.jsonl \\\n  --schema ./schemas/claude-headless.json \\\n  -j 4 -o results.jsonl\n\n# Run trials for pass@k analysis with debug mode\nbunx @plaited/agent-eval-harness trials prompts.jsonl \\\n  --schema ./schemas/claude-headless.json \\\n  -k 5 --grader ./grader.ts --debug\n\n# Parallel trials (4 prompts running trials concurrently)\nbunx @plaited/agent-eval-harness trials prompts.jsonl \\\n  --schema ./schemas/claude-headless.json \\\n  -k 5 -j 4 --workspace-dir ./workspaces -o trials.jsonl\n\n# Summarize results\nbunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl\n\n# Export schemas\nbunx @plaited/agent-eval-harness schemas CaptureResult --json\n\n# Pipeline workflow (Unix-style composition)\ncat prompts.jsonl | \\\n  bunx @plaited/agent-eval-harness run -s ./schemas/claude-headless.json | \\\n  bunx @plaited/agent-eval-harness extract -s ./schemas/claude-headless.json | \\\n  bunx @plaited/agent-eval-harness grade -g ./grader.ts | \\\n  bunx @plaited/agent-eval-harness format -f markdown \u003e report.md\n\n# Compare runs (built-in strategies: weighted, statistical, custom)\nbunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json\n\n# Compare trials for pass@k reliability analysis (auto-detects format)\nbunx @plaited/agent-eval-harness compare trials1.jsonl trials2.jsonl -o comparison.json\n```\n\n## Skills for AI Agents\n\n**Install skills** for use with AI coding agents:\n\n```bash\nnpx skills add plaited/agent-eval-harness\n# or\nbunx skills add plaited/agent-eval-harness\n```\n\n### Available Skills\n\n#### Agent Eval Harness\n\nCLI tool for capturing agent trajectories, optimized for TypeScript/JavaScript projects using Bun.\n\n**Core Commands:**\n\n| Command | Description |\n|---------|-------------|\n| `capture` | Execute prompts and capture full trajectories |\n| `trials` | Multi-run trials with pass@k/pass^k metrics |\n| `summarize` | Derive compact views from trajectory results |\n| `calibrate` | Sample failures for grader calibration |\n| `validate-refs` | Validate reference solutions against graders |\n| `balance` | Analyze test set coverage distribution |\n| `schemas` | Export Zod schemas as JSON Schema |\n\n**Pipeline Commands (Unix-style):**\n\n| Command | Description |\n|---------|-------------|\n| `run` | Execute prompts, output raw results |\n| `extract` | Parse raw output into trajectories |\n| `grade` | Apply grader to extracted results |\n| `format` | Convert to markdown, csv, or jsonl |\n| `compare` | Compare runs (aggregate report) |\n\n**Use cases:**\n- Capturing trajectories for downstream evaluation (Braintrust, custom scorers)\n- Generating training data (SFT/DPO) with full context\n- Building regression test fixtures for agent behavior\n- Comparing agent responses across configurations\n\n#### Headless Adapters\n\nSchema-driven adapters for headless CLI agent integration.\n\n**Commands:**\n\n| Command | Description |\n|---------|-------------|\n| `headless` | Schema-driven adapter for any CLI agent |\n\n**Use cases:**\n- Wrapping headless CLI agents with schema-driven adapter\n- Finding existing adapters for your agent\n- Creating new schemas for CLI agents\n\n## Input Format\n\n```jsonl\n{\"id\":\"test-001\",\"input\":\"Create a primary button\",\"hint\":\"should contain \u003cbutton\u003e\",\"metadata\":{\"category\":\"ui\"}}\n{\"id\":\"test-002\",\"input\":[\"Create a component\",\"Now add tests\"],\"metadata\":{\"category\":\"multi-turn\"}}\n```\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `id` | Yes | Unique identifier |\n| `input` | Yes | Single prompt (string) or conversation turns (string[]) |\n| `hint` | No | Grader context - what to look for |\n| `reference` | No | Reference solution (for validate-refs) |\n| `metadata` | No | Tags, category, difficulty for filtering |\n| `timeout` | No | Override default timeout for this prompt (ms) |\n\n## Output Format\n\nThe harness outputs full trajectory JSONL (`CaptureResult` schema):\n\n```jsonl\n{\n  \"id\": \"test-001\",\n  \"input\": \"Create a primary button\",\n  \"output\": \"Here's a button component...\",\n  \"hint\": \"should contain \u003cbutton\u003e\",\n  \"trajectory\": [...],\n  \"metadata\": {\"category\": \"ui\", \"trajectoryRichness\": \"full\", \"turnCount\": 1},\n  \"timing\": {\"start\": 1234567890, \"end\": 1234567900, \"total\": 10},\n  \"toolErrors\": false,\n  \"exitInfo\": {\"exitCode\": 0},\n  \"score\": {\"pass\": true, \"score\": 1.0, \"reasoning\": \"Contains hint\"}\n}\n```\n\nKey fields:\n- `toolErrors`: Boolean indicating if any tool calls failed\n- `score`: Grader result (only if `--grader` provided)\n- `trajectory`: Full execution trace (thoughts, messages, tool calls, plans)\n- `metadata.trajectoryRichness`: `\"full\"` | `\"messages-only\"` | `\"minimal\"`\n- `exitInfo`: Process exit information (`exitCode`, `signal`, `timedOut`)\n- `timing.total`: End-to-end duration (ms)\n\n## Graders\n\nGraders score agent outputs. The harness supports two types and two grading approaches:\n\n### Git-Based Outcome Grading (Recommended for Coding Agents)\n\n**Grade outcomes, not paths.** Use git to detect actual environmental changes:\n\n```typescript\nimport type { Grader } from '@plaited/agent-eval-harness/schemas'\nimport { resolve } from 'node:path'\n\nexport const grade: Grader = async ({ output, hint, cwd }) =\u003e {\n  // Validate cwd to prevent command injection\n  const isValidPath = (path: string): boolean =\u003e {\n    const dangerousChars = /[;\u0026|`$(){}[\\]\u003c\u003e'\"\\\\]/\n    if (dangerousChars.test(path)) return false\n    if (path.includes('..') || path.startsWith('-')) return false\n    return true\n  }\n\n  if (!cwd || !isValidPath(cwd)) {\n    return { \n      pass: false, \n      score: 0, \n      reasoning: 'Invalid working directory path' \n    }\n  }\n  \n  const safeCwd = resolve(cwd)\n  \n  // Detect file changes using git\n  const status = await Bun.$`git -C ${safeCwd} status --porcelain`.text()\n  const filesCreated = status\n    .split('\\n')\n    .filter(line =\u003e line.startsWith('??'))\n    .map(line =\u003e line.slice(3).trim())\n  \n  // Run tests to verify outcome\n  const testResult = await Bun.$`cd ${safeCwd} \u0026\u0026 bun test`.nothrow()\n  const testsPassed = testResult.exitCode === 0\n  \n  return {\n    pass: filesCreated.length \u003e 0 \u0026\u0026 testsPassed,\n    score: testsPassed ? 1.0 : 0.0,\n    reasoning: `Files created: ${filesCreated.join(', ')}. Tests: ${testsPassed ? 'pass' : 'fail'}`,\n    outcome: {  // Optional: structured data for analysis\n      filesCreated,\n      testsPassed,\n      type: 'file_creation_with_tests'\n    }\n  }\n}\n```\n\n**Benefits:**\n- Detects actual file changes, test results, build success\n- Works universally in any git repo, any language\n- Returns structured `outcome` data for downstream analysis\n- Zero configuration required\n\n### Output-Based Grading (General Purpose)\n\nFor non-coding tasks or when git is unavailable:\n\n```typescript\nimport type { Grader } from '@plaited/agent-eval-harness/schemas'\n\nexport const grade: Grader = async ({ input, output, hint, trajectory }) =\u003e {\n  const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '')\n  return {\n    pass,\n    score: pass ? 1.0 : 0.0,\n    reasoning: pass ? 'Contains hint content' : 'Missing hint content'\n  }\n}\n```\n\n```bash\nagent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts\n```\n\n### Polyglot Graders (Python, etc.)\n\nAny executable script using stdin/stdout JSON protocol:\n\n```python\n#!/usr/bin/env python3\nimport json\nimport sys\nimport subprocess\nimport re\nimport os\n\ndata = json.load(sys.stdin)\noutput = data[\"output\"].lower()\nhint = (data.get(\"hint\") or \"\").lower()\ncwd = data.get(\"cwd\")\n\n# Validate cwd to prevent command injection\ndef is_valid_path(path):\n    if not path:\n        return False\n    # Reject shell metacharacters\n    if re.search(r'[;\u0026|`$(){}\\[\\]\u003c\u003e\\'\"\\\\]', path):\n        return False\n    # Reject directory traversal and option injection\n    if '..' in path or path.startswith('-'):\n        return False\n    return True\n\n# Git-based grading if cwd is provided\nif cwd:\n    if not is_valid_path(cwd):\n        print(json.dumps({\n            \"pass\": False,\n            \"score\": 0.0,\n            \"reasoning\": \"Invalid working directory path\"\n        }))\n        sys.exit(0)\n    \n    safe_cwd = os.path.abspath(cwd)\n    \n    try:\n        result = subprocess.run(\n            [\"git\", \"-C\", safe_cwd, \"status\", \"--porcelain\"],\n            capture_output=True, text=True, check=True\n        )\n        files_created = [\n            line[3:].strip() \n            for line in result.stdout.split('\\n') \n            if line.startswith('??')\n        ]\n        has_changes = len(files_created) \u003e 0\n        print(json.dumps({\n            \"pass\": has_changes,\n            \"score\": 1.0 if has_changes else 0.0,\n            \"reasoning\": f\"Files created: {', '.join(files_created)}\",\n            \"outcome\": {\"filesCreated\": files_created, \"type\": \"git_check\"}\n        }))\n        sys.exit(0)\n    except subprocess.CalledProcessError:\n        # Fall back to output-based grading\n        pass\n\n# Output-based grading fallback\npass_result = hint in output if hint else True\nprint(json.dumps({\n    \"pass\": pass_result,\n    \"score\": 1.0 if pass_result else 0.0,\n    \"reasoning\": \"Contains hint\" if pass_result else \"Missing hint\"\n}))\n```\n\n```bash\nchmod +x grader.py\nagent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py\n```\n\n**Protocol:**\n- Input (stdin): `{\"input\": \"...\", \"output\": \"...\", \"hint\": \"...\", \"trajectory\": [...], \"cwd\": \"/path/to/dir\"}`\n- Output (stdout): `{\"pass\": true, \"score\": 1.0, \"reasoning\": \"...\", \"outcome\": {...}}`\n- `cwd` and `outcome` are optional fields\n\n## Downstream Integration\n\nThe harness outputs standard JSONL. When graders return the optional `outcome` field, it's merged onto results for powerful downstream analysis:\n\n```bash\n# Filter failures\ncat results.jsonl | jq 'select(.score.pass == false)'\n\n# Extract tool usage patterns\ncat results.jsonl | jq '.trajectory[] | select(.type == \"tool_call\") | .name'\n\n# Analyze outcomes from git-based graders\ncat results.jsonl | jq 'select(.outcome.type == \"test_execution\")'\ncat results.jsonl | jq -s 'map(select(.outcome.testsPassed)) | length'\ncat results.jsonl | jq 'select(.outcome.touchedCriticalFiles == true)'\n\n# Use with your scoring pipeline\ncat results.jsonl | your-scoring-script.ts\n```\n\n### Outcome Field\n\nGit-based graders can return structured `outcome` data:\n\n```jsonl\n{\n  \"id\": \"fix-tests\",\n  \"input\": \"Fix the failing authentication tests\",\n  \"output\": \"I fixed the auth tests by...\",\n  \"score\": {\"pass\": true, \"score\": 1.0, \"reasoning\": \"Tests pass\"},\n  \"outcome\": {\n    \"testsPassed\": true,\n    \"filesModified\": [\"src/auth.ts\", \"src/auth.spec.ts\"],\n    \"exitCode\": 0,\n    \"type\": \"test_execution\"\n  }\n}\n```\n\nThis enables rich analysis across evaluations without re-parsing trajectories.\n\n## Development\n\n```bash\nbun install               # Install dependencies\nbun run check             # Type check + lint + format\nbun test                  # Run unit tests\nbun run test:integration  # Run integration tests (requires API keys)\n\n# Alternative: Run integration tests in Docker\nANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... \\\n  docker compose -f docker-compose.test.yml run --rm test\n```\n\n## Requirements\n\n- **Runtime:** Bun \u003e= 1.2.9\n- **Schema:** JSON schema describing CLI agent interaction (see [Schema Creation Guide](.agents/skills/headless-adapters/references/schema-creation-guide.md))\n- **API Key:** `ANTHROPIC_API_KEY` for Claude, `GEMINI_API_KEY` for Gemini\n\n## License\n\nISC © [Plaited Labs](https://github.com/plaited)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplaited%2Fagent-eval-harness","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fplaited%2Fagent-eval-harness","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fplaited%2Fagent-eval-harness/lists"}