{"id":49080066,"url":"https://github.com/lucharo/eval-claude","last_synced_at":"2026-04-20T12:35:04.062Z","repository":{"id":340980958,"uuid":"1168430691","full_name":"lucharo/eval-claude","owner":"lucharo","description":"Run inspect_ai evals via Claude Code CLI — use your Claude subscription instead of per-token API billing","archived":false,"fork":false,"pushed_at":"2026-04-06T18:41:55.000Z","size":307,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-06T20:27:13.441Z","etag":null,"topics":["ai-safety","benchmarks","claude","claude-code","inspect-ai","llm-evaluation"],"latest_commit_sha":null,"homepage":"https://didtheynerfclaude.luischav.es","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lucharo.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-27T11:33:10.000Z","updated_at":"2026-04-06T18:41:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"0cecd290-b4b2-4763-beb8-2c90de0e0019","html_url":"https://github.com/lucharo/eval-claude","commit_stats":null,"previous_names":["lucharo/eval-claude"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/lucharo/eval-claude","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucharo%2Feval-claude","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucharo%2Feval-claude/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucharo%2Feval-claude/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucharo%2Feval-claude/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lucharo","download_url":"https://codeload.github.com/lucharo/eval-claude/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lucharo%2Feval-claude/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32047417,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-20T11:35:06.609Z","status":"ssl_error","status_checked_at":"2026-04-20T11:34:48.899Z","response_time":94,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-safety","benchmarks","claude","claude-code","inspect-ai","llm-evaluation"],"created_at":"2026-04-20T12:35:03.116Z","updated_at":"2026-04-20T12:35:04.035Z","avatar_url":"https://github.com/lucharo.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# eval-claude\n\nRun [inspect_ai](https://inspect.aisi.org.uk/) evals through the **Claude Code CLI** — use your Claude Pro/Max/Team subscription instead of per-token API billing.\n\nBased on [UKGovernmentBEIS/inspect_ai#2986](https://github.com/UKGovernmentBEIS/inspect_ai/pull/2986), extracted as a standalone pip-installable package.\n\n**Live dashboard:** [lucharo.github.io/eval-claude](https://lucharo.github.io/eval-claude/) — harmonized GPQA Diamond history across 5 Claude models (50 samples, 1 epoch per run).\n\n\u003e **Disclaimer:** the dashboard shows only completed, harmonized runs (n=50, 1 epoch). 27 historical rows were removed: 6 early runs that used a 200-sample config, plus 21 rows where the Claude subscription weekly usage cap or partial backend failures caused generations not to complete. The remaining trend is informative but not exhaustive, and the benchmark is no longer scheduled — workflows are manual-only.\n\n## Why?\n\nModel providers sometimes ship regressions (e.g. [Anthropic's summer incident](https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues), [GPT-4's 2023 decline](https://futurism.com/the-byte/stanford-chatgpt-getting-dumber)). This package lets you run standard benchmarks for virtually free using your existing Claude subscription — no API keys or per-token billing needed.\n\n## Install\n\n```bash\nuv add eval-claude\n```\n\nOr from source:\n\n```bash\nuv sync\n```\n\nThis installs `inspect_ai`, `inspect_evals` (standard benchmark suite), and the `claude-code/` provider.\n\n### Prerequisites\n\n- [Claude Code CLI](https://docs.anthropic.com/en/docs/claude-code): `npm install -g @anthropic-ai/claude-code`\n- Authenticated: `claude auth`\n- Active Claude Pro/Max/Team subscription\n\n## Quick start\n\nNo need to clone — run directly with `uvx`:\n\n```bash\nuvx --from eval-claude inspect eval inspect_evals/arc_easy --model claude-code/haiku --limit 5 -M max_connections=5\n```\n\n## Usage\n\n```bash\n# Basic\nuv run inspect eval inspect_evals/arc_easy --model claude-code/sonnet --limit 10\n\n# Parallel (10 concurrent CLI processes)\nuv run inspect eval inspect_evals/gpqa_diamond --model claude-code/opus -M max_connections=10\n\n# Extended thinking\nuv run inspect eval inspect_evals/gpqa_diamond --model claude-code/sonnet -M thinking_level=ultrathink\n\n# Let Claude Code pick the default model\nuv run inspect eval task.py --model claude-code/default\n```\n\n## Benchmark results\n\nAll benchmarks run on a ThinkPad with Claude Code CLI v2.0.76, February 2026.\n\n### ARC Easy (5 samples)\n\n```\n╭──────────────────────────────────────────────────────────────────────────────╮\n│arc_easy (5 samples): claude-code/haiku                                       │\n╰──────────────────────────────────────────────────────────────────────────────╯\ntotal time:                    0:00:25\n\nchoice\naccuracy  1.000\nstderr    0.000\n```\n\n### GPQA Diamond (50 samples, all models)\n\n```bash\nfor model in haiku sonnet opus; do\n  uv run inspect eval inspect_evals/gpqa_diamond --model claude-code/$model --limit 50 -M max_connections=10\ndone\n```\n\n**Haiku 4.5** (61.5% +/- 5.9%, 12:42)\n```\n╭──────────────────────────────────────────────────────────────────────────────╮\n│gpqa_diamond (50 x 4 samples): claude-code/haiku                              │\n╰──────────────────────────────────────────────────────────────────────────────╯\ntotal time:                 0:12:42\n\nchoice\naccuracy  0.615\nstderr    0.059\n```\n\n**Sonnet 4.5** (78.5% +/- 4.8%, 13:20)\n```\n╭──────────────────────────────────────────────────────────────────────────────╮\n│gpqa_diamond (50 x 4 samples): claude-code/sonnet                             │\n╰──────────────────────────────────────────────────────────────────────────────╯\ntotal time:                  0:13:20\n\nchoice\naccuracy  0.785\nstderr    0.048\n```\n\n**Opus 4.5** (86.0% +/- 4.5%, 14:13)\n```\n╭──────────────────────────────────────────────────────────────────────────────╮\n│gpqa_diamond (50 x 4 samples): claude-code/opus                               │\n╰──────────────────────────────────────────────────────────────────────────────╯\ntotal time:                0:14:13\n\nchoice\naccuracy  0.860\nstderr    0.045\n```\n\nResults show the expected model ranking: Haiku \u003c Sonnet \u003c Opus.\n\n## Provider comparison: `anthropic/` vs `claude-code/`\n\n| Feature | `anthropic/` | `claude-code/` |\n|---------|-------------|----------------|\n| **Billing** | Per-token API | Subscription (Pro/Max/Team) |\n| **Tool/function calling** | Full support | Not supported |\n| **Vision/images** | Yes | No |\n| **Streaming** | Yes | No |\n| **Concurrent requests** | Configurable | Via `max_connections` |\n| **Extended thinking** | Fine-grained (up to 200k tokens) | Coarse-grained via `thinking_level` |\n| **Token usage** | Real counts | Real counts |\n| **Cost tracking** | Via API | From CLI JSON |\n\n## Model args\n\n| Arg | Default | Description |\n|-----|---------|-------------|\n| `skip_permissions` | `True` | Skip permission prompts (`--dangerously-skip-permissions`) |\n| `timeout` | `300` | CLI timeout in seconds |\n| `max_connections` | `1` | Concurrent CLI processes |\n| `thinking_level` | `\"none\"` | `\"none\"`, `\"think\"` (~4k tokens), `\"megathink\"` (~10k), `\"ultrathink\"` (~32k) |\n\n## Extended thinking\n\nThe CLI uses magic words to trigger thinking budgets ([Simon Willison's blog](https://simonwillison.net/2025/Apr/19/claude-code-best-practices/)):\n\n| `thinking_level` | Approx. tokens |\n|------------------|----------------|\n| `none` (default) | 0 |\n| `think` | ~4,000 |\n| `megathink` | ~10,000 |\n| `ultrathink` | ~32,000 |\n\nLess granular than the API's `budget_tokens` parameter (up to 200k).\n\n## Model names\n\nPassed directly to the CLI. Accepts aliases (`sonnet`, `opus`, `haiku`) and full model IDs (`claude-sonnet-4-5-20250929`). Use `--model claude-code/default` to let Claude Code choose its default model.\n\n## Environment\n\n- `CLAUDE_CODE_COMMAND` — override the CLI path (default: `claude` from PATH)\n\n## Implementation notes\n\n- **CLI discovery**: Supports `CLAUDE_CODE_COMMAND` env var for custom paths\n- **Model names**: Passed directly to the CLI — it handles aliases and full model IDs natively\n- **Token usage**: Extracted from the CLI's JSON output (`--output-format json`)\n- **Cost \u0026 timing**: Extracted from CLI JSON (`total_cost_usd`, `duration_ms`, `duration_api_ms`, `session_id`)\n- **Tools disabled**: Uses `--tools \"\"` to disable Claude Code's built-in tools for clean eval responses\n\n## Development\n\n```bash\nuv sync --extra dev\nuv run pytest tests/ -v\n```\n\n## License\n\nMIT\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucharo%2Feval-claude","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flucharo%2Feval-claude","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flucharo%2Feval-claude/lists"}