{"id":50443620,"url":"https://github.com/tiger-ai-lab/clawbench","last_synced_at":"2026-05-31T20:01:55.467Z","repository":{"id":350559256,"uuid":"1206519531","full_name":"TIGER-AI-Lab/ClawBench","owner":"TIGER-AI-Lab","description":"Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.","archived":false,"fork":false,"pushed_at":"2026-05-23T20:54:03.000Z","size":3559,"stargazers_count":320,"open_issues_count":41,"forks_count":20,"subscribers_count":7,"default_branch":"main","last_synced_at":"2026-05-23T21:23:48.082Z","etag":null,"topics":["agent-evaluation","agentic-ai","ai-agent-benchmark","ai-agents","benchmark","browser-agent","browser-automation","browser-use","chrome-agent","chrome-extension","computer-use","dataset","evaluation","everyday-tasks","llm","llm-evaluation","online-tasks","real-world-benchmark","web-agent","web-agents"],"latest_commit_sha":null,"homepage":"https://claw-bench.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/TIGER-AI-Lab.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null},"funding":{"github":"reacher-z"}},"created_at":"2026-04-10T01:59:17.000Z","updated_at":"2026-05-23T20:54:06.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/TIGER-AI-Lab/ClawBench","commit_stats":null,"previous_names":["reacher-z/clawbench","tiger-ai-lab/clawbench"],"tags_count":11,"template":false,"template_full_name":null,"purl":"pkg:github/TIGER-AI-Lab/ClawBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FClawBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FClawBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FClawBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FClawBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/TIGER-AI-Lab","download_url":"https://codeload.github.com/TIGER-AI-Lab/ClawBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/TIGER-AI-Lab%2FClawBench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33746513,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-31T02:00:06.040Z","response_time":95,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent-evaluation","agentic-ai","ai-agent-benchmark","ai-agents","benchmark","browser-agent","browser-automation","browser-use","chrome-agent","chrome-extension","computer-use","dataset","evaluation","everyday-tasks","llm","llm-evaluation","online-tasks","real-world-benchmark","web-agent","web-agents"],"created_at":"2026-05-31T20:01:54.567Z","updated_at":"2026-05-31T20:01:55.458Z","avatar_url":"https://github.com/TIGER-AI-Lab.png","language":"Python","funding_links":["https://github.com/sponsors/reacher-z"],"categories":[],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\n\n\u003ca href=\"https://github.com/reacher-z/ClawBench\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"assets/hero-dark.svg\"\u003e\n    \u003cimg alt=\"ClawBench\" src=\"assets/hero-light.svg\" width=\"820\"\u003e\n  \u003c/picture\u003e\n\u003c/a\u003e\n\n[![Star this repo](https://img.shields.io/badge/%E2%98%85%20Star%20this%20repo-181717?style=flat-square\u0026logo=github\u0026logoColor=white)](https://github.com/reacher-z/ClawBench)\n[![arXiv](https://img.shields.io/badge/arXiv-2604.08523-B31B1B?style=flat-square\u0026logo=arxiv\u0026logoColor=white)](https://arxiv.org/abs/2604.08523)\n[![HF Daily Paper](https://img.shields.io/badge/Daily_Paper-FFD21E?style=flat-square\u0026logo=huggingface\u0026logoColor=000)](https://huggingface.co/papers/2604.08523)\n[![HF Dataset](https://img.shields.io/badge/Dataset-FFD21E?style=flat-square\u0026logo=huggingface\u0026logoColor=000)](https://huggingface.co/datasets/NAIL-Group/ClawBench)\n[![HF Trace Dataset](https://img.shields.io/badge/Trace_Dataset-FFD21E?style=flat-square\u0026logo=huggingface\u0026logoColor=000)](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace)\n[![Project Page](https://img.shields.io/badge/claw--bench.com-4F46E5?style=flat-square\u0026logo=googlechrome\u0026logoColor=white)](https://claw-bench.com)\n[![GitHub stars](https://img.shields.io/github/stars/reacher-z/ClawBench?style=flat-square\u0026logo=github\u0026color=181717\u0026cacheSeconds=300)](https://github.com/reacher-z/ClawBench)\n[![Discord](https://img.shields.io/badge/Discord-Join-5865F2?style=flat-square\u0026logo=discord\u0026logoColor=white)](https://discord.gg/clawbench)\n[![Codespaces](https://img.shields.io/badge/Codespaces-Open-181717?style=flat-square\u0026logo=github\u0026logoColor=white)](https://codespaces.new/reacher-z/ClawBench?quickstart=1)\n\n[![PyPI downloads](https://img.shields.io/pypi/dm/clawbench-eval?style=flat-square\u0026logo=pypi\u0026color=3775A9\u0026logoColor=white\u0026label=PyPI%20downloads)](https://pypi.org/project/clawbench-eval/)\n[![PyPI version](https://img.shields.io/pypi/v/clawbench-eval?style=flat-square\u0026logo=pypi\u0026color=3775A9\u0026logoColor=white)](https://pypi.org/project/clawbench-eval/)\n[![Last commit](https://img.shields.io/github/last-commit/reacher-z/ClawBench?style=flat-square\u0026logo=github\u0026logoColor=white)](https://github.com/reacher-z/ClawBench/commits/main)\n[![Contributors](https://img.shields.io/github/contributors/reacher-z/ClawBench?style=flat-square\u0026logo=github\u0026logoColor=white)](https://github.com/reacher-z/ClawBench/graphs/contributors)\n[![Commit activity](https://img.shields.io/github/commit-activity/m/reacher-z/ClawBench?style=flat-square\u0026logo=github\u0026logoColor=white)](https://github.com/reacher-z/ClawBench/graphs/commit-activity)\n[![License](https://img.shields.io/github/license/reacher-z/ClawBench?style=flat-square\u0026color=A42E2B)](https://github.com/reacher-z/ClawBench/blob/main/LICENSE)\n\n\u003cp align=\"center\"\u003e\u003csub\u003e\u003ci\u003eFeatured in\u003c/i\u003e\u003c/sub\u003e\u003c/p\u003e\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://github.com/walkinglabs/awesome-harness-engineering\"\u003e\u003cimg alt=\"awesome-harness-engineering\" src=\"https://img.shields.io/badge/Featured-awesome--harness--engineering-7C3AED?style=flat-square\u0026logo=awesomelists\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/Jenqyang/Awesome-AI-Agents\"\u003e\u003cimg alt=\"Awesome-AI-Agents\" src=\"https://img.shields.io/badge/Featured-Awesome--AI--Agents-7C3AED?style=flat-square\u0026logo=awesomelists\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ranpox/awesome-computer-use\"\u003e\u003cimg alt=\"awesome-computer-use\" src=\"https://img.shields.io/badge/Featured-awesome--computer--use-7C3AED?style=flat-square\u0026logo=awesomelists\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/ZJU-REAL/Awesome-GUI-Agents\"\u003e\u003cimg alt=\"Awesome-GUI-Agents\" src=\"https://img.shields.io/badge/Featured-Awesome--GUI--Agents-7C3AED?style=flat-square\u0026logo=awesomelists\u0026logoColor=white\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://github.com/zhangxjohn/LLM-Agent-Benchmark-List\"\u003e\u003cimg alt=\"LLM-Agent-Benchmark-List\" src=\"https://img.shields.io/badge/Featured-LLM--Agent--Benchmark--List-7C3AED?style=flat-square\u0026logo=awesomelists\u0026logoColor=white\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://huggingface.co/papers/2604.08523\"\u003e\u003cimg src=\"https://img.shields.io/badge/%233_Paper_of_the_Day-FFD21E?style=for-the-badge\u0026logo=huggingface\u0026logoColor=000\" alt=\"#3 Paper of the Day\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://deepwiki.com/reacher-z/ClawBench\"\u003e\u003cimg alt=\"Ask DeepWiki\" src=\"https://deepwiki.com/badge.svg\" /\u003e\u003c/a\u003e\n\u003c/p\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cb\u003eNew:\u003c/b\u003e Check out our sister project \u003ca href=\"https://github.com/reacher-z/HarnessBench\"\u003e\u003cb\u003eHarnessBench\u003c/b\u003e\u003c/a\u003e \u0026mdash;\n  fixes the base model, varies the harness. Same scoring pipeline, orthogonal axis.\n\u003c/p\u003e\n\n\u003ca href=\"#-human-quick-start\"\u003e\u003cimg src=\"https://img.shields.io/badge/Run%20in%20one%20line%20of%20code-4F46E5?style=for-the-badge\u0026labelColor=4F46E5\u0026logoColor=white\u0026logo=data:image/svg%2Bxml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCA1NzYgNTEyIj48cGF0aCBmaWxsPSIjZmZmZmZmIiBkPSJNMjYzLjQtMjdMMjc4LjIgOS44IDMxNSAyNC42YzMgMS4yIDUgNC4yIDUgNy40cy0yIDYuMi01IDcuNEwyNzguMiA1NC4yIDI2My40IDkxYy0xLjIgMy00LjIgNS03LjQgNXMtNi4yLTItNy40LTVMMjMzLjggNTQuMiAxOTcgMzkuNGMtMy0xLjItNS00LjItNS03LjRzMi02LjIgNS03LjRMMjMzLjggOS44IDI0OC42LTI3YzEuMi0zIDQuMi01IDcuNC01czYuMiAyIDcuNCA1ek0xMTAuNyA0MS43bDIxLjUgNTAuMSA1MC4xIDIxLjVjNS45IDIuNSA5LjcgOC4zIDkuNyAxNC43cy0zLjggMTIuMi05LjcgMTQuN2wtNTAuMSAyMS41LTIxLjUgNTAuMWMtMi41IDUuOS04LjMgOS43LTE0LjcgOS43cy0xMi4yLTMuOC0xNC43LTkuN0w1OS44IDE2NC4yIDkuNyAxNDIuN0MzLjggMTQwLjIgMCAxMzQuNCAwIDEyOHMzLjgtMTIuMiA5LjctMTQuN0w1OS44IDkxLjggODEuMyA0MS43QzgzLjggMzUuOCA4OS42IDMyIDk2IDMyczEyLjIgMy44IDE0LjcgOS43ek00NjQgMzA0YzYuNCAwIDEyLjIgMy44IDE0LjcgOS43bDIxLjUgNTAuMSA1MC4xIDIxLjVjNS45IDIuNSA5LjcgOC4zIDkuNyAxNC43cy0zLjggMTIuMi05LjcgMTQuN2wtNTAuMSAyMS41LTIxLjUgNTAuMWMtMi41IDUuOS04LjMgOS43LTE0LjcgOS43cy0xMi4yLTMuOC0xNC43LTkuN2wtMjEuNS01MC4xLTUwLjEtMjEuNWMtNS45LTIuNS05LjctOC4zLTkuNy0xNC43czMuOC0xMi4yIDkuNy0xNC43bDUwLjEtMjEuNSAyMS41LTUwLjFjMi41LTUuOSA4LjMtOS43IDE0LjctOS43ek00NjAgMGMxMSAwIDIxLjYgNC40IDI5LjUgMTIuMmw0Mi4zIDQyLjNDNTM5LjYgNjIuNCA1NDQgNzMgNTQ0IDg0cy00LjQgMjEuNi0xMi4yIDI5LjVsLTg4LjIgODguMi0xMDEuMy0xMDEuMyA4OC4yLTg4LjJDNDM4LjQgNC40IDQ0OSAwIDQ2MCAwek00NC4yIDM5OC41TDMwOC40IDEzNC4zIDQwOS43IDIzNS42IDE0NS41IDQ5OS44QzEzNy42IDUwNy42IDEyNyA1MTIgMTE2IDUxMnMtMjEuNi00LjQtMjkuNS0xMi4yTDQ0LjIgNDU3LjVDMzYuNCA0NDkuNiAzMiA0MzkgMzIgNDI4czQuNC0yMS42IDEyLjItMjkuNXoiLz48L3N2Zz4=\" alt=\"Run in one line of code\"\u003e\u003c/a\u003e\n\n```bash\ngit clone https://github.com/reacher-z/ClawBench.git \u0026\u0026 cd ClawBench \u0026\u0026 ./run.sh\n```\n\n\u003csub\u003e\u003ci\u003eClone → configure → run. \u0026nbsp; Root uv package. \u0026nbsp; Docker-isolated harnesses.\u003c/i\u003e\u003c/sub\u003e\n\n### Can AI Agents Complete Everyday Online Tasks?\n\n**ClawBench is an open-source benchmark that evaluates AI browser agents on everyday online tasks — booking travel, ordering food, applying for jobs, managing email — across live websites. V1 lives in `test-cases/v1/` with 153 tasks across 144 websites; V2 lives in `test-cases/v2/` with 130 tasks. It measures end-to-end task success with a 5-layer recording pipeline and an agentic evaluator that compares each run against human references. Top score to date: 33.3%.**\n\n\u003cimg src=\"assets/clawbench_logo.png\" alt=\"ClawBench logo\" width=\"320\"\u003e\n\nWe asked frontier AI agents to do what people do every day --\u003cbr/\u003e\norder food, book travel, apply for jobs, write reviews, manage projects.\u003cbr/\u003e\n**Even the best agent only completes about 1 in 3.**\n\n\u003csub\u003e\u003ci\u003eBuilt by NAIL Group \u0026nbsp;·\u0026nbsp; Sister project: \u003ca href=\"https://github.com/reacher-z/HarnessBench\"\u003eHarnessBench\u003c/a\u003e \u0026nbsp;·\u0026nbsp; Runs on any Chrome.\u003c/i\u003e\u003c/sub\u003e\n\n---\n\n**V1: 153** everyday tasks \u0026nbsp;\u0026middot;\u0026nbsp; **V2: 130** tasks \u0026nbsp;\u0026middot;\u0026nbsp; **144** live websites \u0026nbsp;\u0026middot;\u0026nbsp; **15** life categories\n\n\u003ca href=\"docs/README.zh-CN.md\"\u003e\u003cimg src=\"assets/icons/language.svg\" width=\"16\" height=\"16\"\u003e 中文\u003c/a\u003e\n\n\u003c/div\u003e\n\n## \u003cimg src=\"assets/icons/circle-question.svg\" width=\"20\" height=\"20\"\u003e What are you looking for?\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd width=\"25%\" align=\"center\" valign=\"top\"\u003e\n\n🏆 **See scores**\u003cbr/\u003e\n[Live leaderboard](https://huggingface.co/spaces/TIGER-Lab/ClawBench)\u003cbr/\u003e\n\u003csub\u003ePick a corpus (v1 / v2)\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003ctd width=\"25%\" align=\"center\" valign=\"top\"\u003e\n\n🚀 **Run it on your model**\u003cbr/\u003e\n[Quick start ↓](#-human-quick-start)\u003cbr/\u003e\n\u003csub\u003e\u003ccode\u003epip install clawbench-eval\u003c/code\u003e\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003ctd width=\"25%\" align=\"center\" valign=\"top\"\u003e\n\n📊 **Browse 283 tasks**\u003cbr/\u003e\n[Task explorer](https://claw-bench.com/tasks)\u003cbr/\u003e\n\u003csub\u003eSearch · filter · category\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003ctd width=\"25%\" align=\"center\" valign=\"top\"\u003e\n\n📄 **Read the paper**\u003cbr/\u003e\n[arXiv:2604.08523](https://arxiv.org/abs/2604.08523)\u003cbr/\u003e\n\u003csub\u003eMethodology · evaluator · results\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\" valign=\"top\"\u003e\n\n🎬 **Re-grade old runs**\u003cbr/\u003e\n[V1](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace) · [V2](https://huggingface.co/datasets/TIGER-Lab/ClawBenchV2Trace) raw traces\u003cbr/\u003e\n\u003csub\u003e5 layers per (task × model)\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003ctd align=\"center\" valign=\"top\"\u003e\n\n📦 **Download the data**\u003cbr/\u003e\n[`hf download NAIL-Group/ClawBench`](https://huggingface.co/datasets/NAIL-Group/ClawBench)\u003cbr/\u003e\n\u003csub\u003eTasks · rubrics · metadata\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003ctd align=\"center\" valign=\"top\"\u003e\n\n🌱 **Add a task / model**\u003cbr/\u003e\n[How to contribute](#contributing)\u003cbr/\u003e\n\u003csub\u003eYAML spec + rubric\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003ctd align=\"center\" valign=\"top\"\u003e\n\n❓ **Have a question**\u003cbr/\u003e\n[FAQ](#frequently-asked-questions) · [Discord](https://discord.gg/clawbench)\u003cbr/\u003e\n\u003csub\u003eOr open an issue\u003c/sub\u003e\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## News\n\n- **[2026.05.20]** — V2 is now the default corpus + lenient judge + 6 first-class harnesses. [Details →](https://github.com/reacher-z/ClawBench/blob/main/docs/v1-vs-v2.md)\n- **[2026.05.16]** — Added Claw-Eval suite: 19 browser-research tasks with final-answer submission. [Details →](test-cases/claw-eval/)\n- **[2026.05.12]** — Canonical leaderboard moved to TIGER-Lab/ClawBench Gradio Space. [Details →](https://huggingface.co/spaces/TIGER-Lab/ClawBench)\n- **[2026.05.11]** — V2 leaderboard ships: top so far `glm-5.1 / hermes` at 18.5% reward / 48.5% intercepted. [Details →](https://claw-bench.com/leaderboard)\n- **[2026.05.09]** — Inline LLM judge added as second scoring stage; runs now auto-produce pass/fail. [Details →](docs/scoring.md)\n- **[2026.05.09]** — `clawbench-eval` package published to PyPI for one-command install. [Details →](https://pypi.org/project/clawbench-eval/)\n- **[2026.05.09]** — Released ClawBenchV1Trace: full 5-layer execution trace for every V1 run. [Details →](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace)\n- **[2026.04.25]** — Added support for the hermes harness. [Details →](src/clawbench/runtime/harnesses/hermes/)\n- **[2026.04.18]** — Added support for the browser-use harness. [Details →](src/clawbench/runtime/harnesses/browser-use/)\n- **[2026.04.11]** — Paper released on arXiv (2604.08523); #3 HuggingFace Paper of the Day. [Details →](https://arxiv.org/abs/2604.08523)\n\n\u003cbr/\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"assets/icons/globe.svg\" width=\"24\" height=\"24\"\u003e\u0026nbsp;\u003cb\u003eLive Websites\u003c/b\u003e\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003cimg src=\"assets/icons/cube.svg\" width=\"24\" height=\"24\"\u003e\u0026nbsp;\u003cb\u003eIsolated Containers\u003c/b\u003e\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003cimg src=\"assets/icons/shield-halved.svg\" width=\"24\" height=\"24\"\u003e\u0026nbsp;\u003cb\u003eRequest Interceptor\u003c/b\u003e\n\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\u0026nbsp;\n\u003cimg src=\"assets/icons/layer-group.svg\" width=\"24\" height=\"24\"\u003e\u0026nbsp;\u003cb\u003eFive-Layer Recording\u003c/b\u003e\n\u003c/p\u003e\n\n\u003cbr/\u003e\n\n## \u003cimg src=\"assets/icons/layer-group.svg\" width=\"20\" height=\"20\"\u003e Datasets\n\nClawBench ships **three** Hugging Face datasets — task definitions plus full execution traces for V1 and V2. All open, downloadable in one command. The benchmark itself is also mirrored on **TIGER-Lab** for visibility.\n\n| Dataset                                                                                                                                                                          | What's in it                                                                                                                                                                                          | Get it                                                        |\n| -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------- |\n| **[NAIL-Group/ClawBench](https://huggingface.co/datasets/NAIL-Group/ClawBench)** _(also mirrored at [TIGER-Lab/ClawBench](https://huggingface.co/datasets/TIGER-Lab/ClawBench))_ | Task definitions, rubrics, and metadata for V1 (153 tasks) and V2 (130 tasks) — what to attempt and how it's judged.                                                                                  | `hf download --repo-type dataset NAIL-Group/ClawBench`        |\n| **[NAIL-Group/ClawBenchV1Trace](https://huggingface.co/datasets/NAIL-Group/ClawBenchV1Trace)**                                                                                   | One directory per V1 model run, each with `recording.mp4`, `requests.jsonl`, `actions.jsonl`, `agent-messages.jsonl`, `interception.json`, and `run-meta.json` — everything we used to score the run. | `hf download --repo-type dataset NAIL-Group/ClawBenchV1Trace` |\n| **[NAIL-Group/ClawBenchV2Trace](https://huggingface.co/datasets/NAIL-Group/ClawBenchV2Trace)**                                                                                   | Same 5-layer bundle for **V2** model runs. Rolling — new models added as they're evaluated.                                                                                                           | `hf download --repo-type dataset NAIL-Group/ClawBenchV2Trace` |\n\n\u003e The trace datasets are large; use `hf download --include \"\u003cpattern\u003e\"` to pull a single model or a single task.\n\n\u003e **🏆 Live leaderboard:** [`claw-bench.com/leaderboard`](https://claw-bench.com/leaderboard) (V2 default, two-stage scoring — interception + LLM judge). Full scoring formula in [`eval/scoring.md`](eval/scoring.md). Add your run: PR to [`leaderboard/results.csv`](https://huggingface.co/datasets/NAIL-Group/ClawBench/blob/main/leaderboard/results.csv).\n\n## How It Works\n\n```\n   You pick a task            ClawBench spins up           Agent drives the         Interceptor captures\n   from V1 or V2              an isolated Docker           browser: navigates,      every action across\n   everyday scenarios         container + Chromium         fills forms, clicks      all 5 layers of data\n\n   ┌──────────────┐           ┌──────────────┐           ┌──────────────┐           ┌──────────────┐\n   │  \"Book a pet │    ──►    │   Container  │    ──►    │   AI Agent   │    ──►    │   5 layers   │\n   │   sitter on  │           │  + Chromium  │           │  browses the │           │  intercepted │\n   │   Rover\"     │           │  + Agent     │           │   live site  │           │  \u0026 recorded  │\n   └──────────────┘           └──────────────┘           └──────────────┘           └──────────────┘\n```\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/robot.svg\" width=\"28\" height=\"28\"\u003e LLM Quick Start\n\nPoint your coding agent (Claude Code, Cursor, Copilot, etc.) at [`AGENTS.md`](AGENTS.md) and prompt away.\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/person.svg\" width=\"28\" height=\"28\"\u003e Human Quick Start\n\nInstall ClawBench from PyPI for normal use:\n\n```bash\nuv tool install clawbench-eval\n```\n\nYou can also use `pipx install clawbench-eval` or `python -m pip install clawbench-eval`.\nThe installed commands are still `clawbench`, `clawbench-run`, and\n`clawbench-batch`.\n\nFor those want more granular control and contribution, clone the repo and run the root `uv` package entrypoint:\n\n```bash\ngit clone https://github.com/reacher-z/ClawBench.git \u0026\u0026 cd ClawBench \u0026\u0026 ./run.sh\n```\n\n**Prerequisites:** [Python 3.11+](https://python.org), [uv](https://docs.astral.sh/uv/), and a container engine — [Docker](https://www.docker.com/) **or** [Podman](https://podman.io/). ClawBench auto-detects whichever is installed; force one with `export CONTAINER_ENGINE=docker` or `export CONTAINER_ENGINE=podman`.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eInstall Docker or Podman\u003c/b\u003e (macOS / Linux / Windows)\u003c/summary\u003e\n\n#### macOS\n\n```bash\n# Option A — Docker Desktop (easiest, includes GUI)\nbrew install --cask docker\nopen -a Docker                 # launch and wait for the whale icon to settle\n\n# Option B — Podman (rootless, no daemon, CLI only)\nbrew install podman\npodman machine init            # one-time: downloads the Linux VM image\npodman machine start           # must be running before any podman command\n```\n\n\u003e **macOS Podman needs a VM.** `brew install podman` alone is not enough — Podman on macOS runs containers inside a small Linux VM, so you must `podman machine init \u0026\u0026 podman machine start` once after install or `podman info` will fail with `Cannot connect to Podman`.\n\n#### Linux (Ubuntu / Debian)\n\n```bash\n# Option A — Podman (rootless by default, recommended)\nsudo apt update \u0026\u0026 sudo apt install -y podman\n\n# Option B — Docker\nsudo apt install -y docker.io\nsudo usermod -aG docker $USER  # log out / back in so your shell picks up the group\n```\n\n\u003e **Rootful Docker ownership note:** with classic `sudo`-docker, files extracted from containers land owned by `root` on the host. ClawBench's driver detects this after each run and chowns `test-output/` back to your user automatically — but if you run other container tooling alongside, rootless Podman (or rootless Docker) avoids the issue entirely.\n\n#### Windows\n\n```powershell\n# Option A — Docker Desktop (WSL2 backend)\nwinget install Docker.DockerDesktop\n# then launch Docker Desktop from the Start menu and wait for it to be ready\n\n# Option B — Podman\nwinget install RedHat.Podman\npodman machine init\npodman machine start\n```\n\n\u003e Run the `uv run …` commands below from **PowerShell**, **WSL2**, or **Git Bash**. Like macOS, Windows Podman requires `podman machine init \u0026\u0026 podman machine start` before its first use.\n\n\u003c/details\u003e\n\n**1. Configure models** — one-time setup.\n\nIf you installed from PyPI, run `clawbench` from the directory where you want\nresults and editable config to live. On first launch it creates local templates\nunder `models/`; use the TUI to add a model or edit the file directly:\n\n```bash\nclawbench\n$EDITOR models/models.yaml\n```\n\nIf you are working from a source checkout:\n\n```bash\ncp models/models.example.yaml models/models.yaml\n$EDITOR models/models.yaml\n```\n\nPurelyMail credentials for disposable run emails are provided in the committed `.env`.\nYou only need to edit `.env` if you want to use your own PurelyMail account or enable optional HuggingFace upload.\n\n\u003e [!NOTE]\n\u003e **First run builds a container image** (Chromium + ffmpeg + noVNC + the selected agent harness dependencies). You'll see a live progress spinner with the current build step. Subsequent runs reuse the cached layers and finish in seconds.\n\n**2. Run your first task** (pick one):\n\n\u003e [!TIP]\n\u003e **Recommended \u0026rarr; Interactive TUI** \u0026nbsp; guided model + test case selection\n\u003e ```bash\n\u003e clawbench         # PyPI install\n\u003e uv run clawbench  # source checkout\n\u003e ```\n\u003e If installed from PyPI, run `clawbench` directly. Needs an interactive terminal.\n\u003e For pipes / CI / non-TTY, use `clawbench-run` or `clawbench-batch` directly;\n\u003e from a source checkout, prefix commands with `uv run`.\n\n**(b) Run one specific task against a specific model:**\n```bash\nuv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6\n```\nOnce the container starts, the script prints a **noVNC URL** (e.g. `http://localhost:6080/vnc.html`) — open it in your browser to watch the agent operate in real-time. If port 6080 is already in use, an alternative port is chosen automatically.\n\nResults land in `./test-output/\u003cmodel\u003e/\u003charness\u003e-\u003ccase\u003e-\u003cmodel\u003e-\u003ctimestamp\u003e/` with the full five-layer recording. The default harness is `openclaw`; pass `--harness opencode` to use [opencode](https://opencode.ai), `--harness claude-code` to use [Claude Code](https://docs.anthropic.com/en/docs/claude-code), `--harness claude-code-chrome-extension` to use Claude Code + the [Claude in Chrome](https://code.claude.com/docs/en/chrome) extension (Microsoft Edge + local bridge, bypass stack so any LiteLLM-routed provider works), `--harness codex` to use [OpenAI Codex CLI](https://github.com/openai/codex), `--harness claw-code` to use [claw-code](https://github.com/ultraworkers/claw-code), `--harness browser-use` to use [browser-use](https://github.com/browser-use/browser-use) (Python framework, routed via LiteLLM), `--harness hermes` to use [Hermes Agent](https://github.com/NousResearch/hermes-agent) with native browser tools attached to ClawBench Chrome via CDP, or `--harness pi` to use [Pi](https://pi.dev/) with pinned [pi-browser-harness](https://pi.dev/packages/pi-browser-harness) browser tools attached to the same ClawBench Chrome CDP endpoint.\n\n**(c) Drive the browser yourself via noVNC** — produces a human reference run:\n```bash\nuv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats --human\n```\nOpen the noVNC URL the script prints, complete the task by hand, then close the tab. Port is auto-assigned if 6080 is busy.\n\n**(d) Pair with an external browser agent** — run in Human mode, open the noVNC URL, and let an external browser agent control that browser session while ClawBench records and intercepts it.\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eDevelop from source\u003c/b\u003e \u0026nbsp;— clone + ``./run.sh`` for contributors\u003c/summary\u003e\n\nPrefer the repo checkout if you want to modify the driver, the bundled V1/V2 test cases, or the container build itself.\n\n```bash\ngit clone https://github.com/reacher-z/ClawBench.git \u0026\u0026 cd ClawBench\ncp models/models.example.yaml models/models.yaml   # edit: add your model API keys\n# .env is already provided for PurelyMail; edit only for your own creds or HF upload\n./run.sh                                           # interactive TUI\nuv run clawbench-run \\\n  test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6   # single run\nuv run clawbench-run \\\n  test-cases/v1/001-daily-life-food-uber-eats --human             # human mode\n```\n\nThis path gives you live-reload on ``src/``, ``src/clawbench/runtime/chrome-extension/``, and all suites under ``test-cases/`` — useful when iterating on the harness itself.\n\n\u003c/details\u003e\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/check-double.svg\" width=\"28\" height=\"28\"\u003e Reproduce the leaderboard\n\n\u003e **Our scores are stable**: two independent runs of the same model under the same judge (`deepseek/deepseek-v4-pro`, lenient rubric) reproduce Intercepted and Reward within ±2 pp on the V2 130-task corpus.\n\nThere are **two ways** to verify this on your own machine.\n\n### Path A — Re-run the agent, then score\n\nConfirms the *full pipeline* (your agent + our judge) lines up with our leaderboard row.\n\n```bash\nclawbench-batch --models deepseek/deepseek-v4-flash --cases-suite v2 \\\n  --all-cases --harness hermes --no-judge --output-dir ./my-run\nclawbench-rescore ./my-run --judge-model deepseek-v4-pro --rubric both\n```\n\n### Path B — Skip the run, re-judge our published traces\n\nConfirms *just the judge* matches ours (cheap, no agent compute, useful for sanity-checking your judge config).\n\n```bash\nhf download --repo-type dataset TIGER-Lab/ClawBenchV2Trace \\\n  --include \"batch-aligned-*/deepseek-v4-flash-free/**\" --local-dir ./reproduce\nclawbench-rescore ./reproduce --judge-model deepseek-v4-pro --rubric both\n```\n\nOne-shot equivalent of Path B for any model in the leaderboard:\n\n```bash\nclawbench-reproduce --model deepseek-v4-flash --tolerance 2.0\n```\n\n### Pass criterion\n\nFor `deepseek-v4-flash:free × hermes × v2`, the published row is **Intercepted 3.1% / Reward-lenient 2.3% / Reward-strict 0.0% (3 / 129)**. Path A or B counts as **reproduced** when all three metrics land within ±2 pp. Larger gaps usually mean a different judge model, a different rubric prompt, or a harness configuration drift — diff your `eval_results/\u003cbatch\u003e/summary.json` against the published row to localize the cause.\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/chart-bar.svg\" width=\"28\" height=\"28\"\u003e ClawBench-Lite\n\n**New here? Run this first.** [`test-cases/v1-lite/`](test-cases/v1-lite/) is a **20-task curated subset** of the V1 153-task corpus, selected for household-name sites, real-world relevance, difficulty, and category diversity. It matches the 20-tasks-per-source convention of [browser-use/benchmark](https://github.com/browser-use/benchmark) and gives you a credible signal at a fraction of the full-benchmark cost.\n\nTier distribution: **flagship 9 / core 8 / wildcard 3** — spanning daily life (OpenTable, DoorDash, Instacart, TaskRabbit), entertainment (Eventbrite, Goodreads, Fandango), creation (Asana, Mailchimp, Squarespace), travel (Airbnb), education (LeetCode), dev-tech (GitHub), academia (Overleaf), personal management (1Password), and more. All Lite tasks are judged by [`eval/agentic_eval.md`](eval/agentic_eval.md) regardless of `url_pattern` shape.\n\nThe Lite suite is a first-class task directory: run it with `--cases-suite v1-lite`, or inspect the link-backed task files in [`test-cases/v1-lite/`](test-cases/v1-lite/).\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/play.svg\" width=\"28\" height=\"28\"\u003e Demos\n\nEach ClawBench run produces a full MP4 session recording. See the [project page](https://claw-bench.com) for V1 task recordings.\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/circle-question.svg\" width=\"28\" height=\"28\"\u003e Example Walkthrough\n\nCurious what one task actually looks like, start to finish? Here's task **001** end to end.\n\n**The task** — from [`test-cases/v1/001-daily-life-food-uber-eats/task.json`](test-cases/v1/001-daily-life-food-uber-eats/task.json):\n\n```json\n{\n  \"instruction\": \"On Uber Eats, order delivery: one Pad Thai, deliver to home address, note \\\"no peanuts\\\"\",\n  \"time_limit\": 30,\n  \"eval_schema\": {\n    \"url_pattern\": \"__PLACEHOLDER_WILL_NOT_MATCH__\",\n    \"method\": \"POST\"\n  }\n}\n```\n\nThe agent gets this `instruction` verbatim, plus read-only access to `/my-info/alex_green_personal_info.json` (the dummy user's name, home address, phone, date of birth) and a disposable email account for any sign-in prompt. It has **30 minutes** to reach a `POST` request — any longer and the container is killed.\n\n**What the agent does** (the happy path):\n\n1. Navigates to `ubereats.com`\n2. Reads the dummy user's home address from `/my-info/alex_green_personal_info.json` and enters it in the delivery-address box\n3. Searches for **\"Pad Thai\"** in the food search\n4. Picks a restaurant that has Pad Thai available for delivery to that address\n5. Opens the item detail page, finds the customization or special-instructions field, enters **\"no peanuts\"**\n6. Adds one to cart, opens the cart, and handles any sign-in prompt using the disposable email credentials\n7. Reaches checkout, taps **Place Order**\n\n**What the interceptor catches** — that final *Place Order* tap fires a `POST` request. ClawBench's request interceptor sits in front of the browser and **captures the outbound request before it reaches Uber Eats's servers**, so the dummy user is never actually charged. At the exact moment of interception, all five recording layers (MP4 video, PNG screenshots, HTTP traffic, browser actions, agent messages) are frozen into `/data/`.\n\n**How the judge decides PASS / FAIL** — task 001's `url_pattern` is the intentional sentinel `__PLACEHOLDER_WILL_NOT_MATCH__`, which means **no request path can mechanically match**. The verdict comes from the agentic judge in [`eval/agentic_eval.md`](eval/agentic_eval.md), which replays the five-layer recording against a human reference run and checks four things:\n\n- Did the agent actually reach the final checkout step?\n- Is the cart exactly **one** Pad Thai (not two, not a combo)?\n- Is the delivery address the user's home address from `alex_green_personal_info.json`?\n- Does the order carry the **\"no peanuts\"** note in the instructions field?\n\nAll four must hold for a **PASS**. Miss any one and it's a **FAIL** with evidence from the recording pinned to the failing criterion. This per-task rubric is what makes ClawBench judge-sensitive rather than URL-regex-sensitive — see [`eval/README.md`](eval/README.md) for the full rubric format and [`eval/agentic_eval.md`](eval/agentic_eval.md) for the judge prompt.\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/chart-bar.svg\" width=\"28\" height=\"28\"\u003e Results\n\n\u003cdiv align=\"center\"\u003e\n\n**ClawBench leaderboard** \u0026nbsp;\u0026middot;\u0026nbsp; 6 tabs by corpus × harness \u0026nbsp;\u0026middot;\u0026nbsp; live at [claw-bench.com](https://claw-bench.com/)\n\n\u003c/div\u003e\n\n\u003cdetails open\u003e\n\u003csummary\u003e\u003cb\u003eV2 (Hermes)\u003c/b\u003e \u0026nbsp;·\u0026nbsp; 8 models \u0026nbsp;·\u0026nbsp; ds-v4-pro judge, lenient + strict\u003c/summary\u003e\n\n| Rank  | Model                  | Harness | Intercepted | Reward (lenient) | Reward (strict) | Pass / Total |\n| :---: | ---------------------- | ------- | ----------: | ---------------: | --------------: | -----------: |\n|   1   | **claude-opus-4-7**    | hermes  |   **54.6%** |        **44.6%** |           24.6% |     58 / 130 |\n|   2   | gpt-5.5                | hermes  |       45.4% |            35.4% |           18.5% |     46 / 130 |\n|   3   | glm-5.1                | hermes  |       48.5% |            34.6% |           17.7% |     45 / 130 |\n|   4   | deepseek-v4-pro        | hermes  |       43.9% |            33.9% |           12.3% |     44 / 130 |\n|   5   | openrouter-owl-alpha   | hermes  |       14.6% |             0.0% |            0.0% |      0 / 130 |\n|   6   | z-ai/glm-4.5-air:free  | hermes  |        4.6% |             2.3% |            0.8% |      3 / 130 |\n|   7   | deepseek-v4-flash:free | hermes  |        3.1% |             2.3% |            0.0% |      3 / 129 |\n|   8   | minimax-m2.5:free      | hermes  |        2.3% |             1.5% |            0.0% |      2 / 130 |\n\n**Intercepted** = final HTTP request matched the task's URL/method (Stage 1, deterministic). **Reward (lenient)** = additionally judged by `deepseek/deepseek-v4-pro` to fulfill the instruction under the \"no contradiction → match\" rubric (Stage 2). **Reward (strict)** = same judge, strict rubric (\"ambiguous → mismatch\"). Ranked by Intercepted; Reward as tiebreak.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eV2 (OpenClaw)\u003c/b\u003e \u0026nbsp;·\u0026nbsp; 1 model\u003c/summary\u003e\n\n| Rank  | Model   | Harness  | Intercepted | Reward (lenient) | Reward (strict) | Pass / Total |\n| :---: | ------- | -------- | ----------: | ---------------: | --------------: | -----------: |\n|   1   | glm-5.1 | openclaw |        0.0% |             0.0% |            0.0% |      0 / 130 |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eV2 (Codex)\u003c/b\u003e \u0026nbsp;·\u0026nbsp; — (in progress)\u003c/summary\u003e\n\nIn-flight: gpt-5.5-oauth, gpt-5.4-oauth, gpt-5.4-mini-oauth, gpt-5.3-codex-oauth, gpt-5.3-codex-spark-oauth, gpt-5.2-oauth. Will be filled in after `judge_llm` re-judge completes.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eV2 (Claude Code)\u003c/b\u003e \u0026nbsp;·\u0026nbsp; — (not yet run)\u003c/summary\u003e\n\n—\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eV1 (Hermes)\u003c/b\u003e \u0026nbsp;·\u0026nbsp; 6 frontier models, original paper rubric\u003c/summary\u003e\n\n| Rank  | Model                     | Harness | Pass Rate | Pass / Total |\n| :---: | ------------------------- | ------- | --------: | -----------: |\n|   1   | claude-opus-4-6           | hermes  |     61.4% |     94 / 153 |\n|   2   | claude-sonnet-4-6         | hermes  |     56.9% |     87 / 153 |\n|   3   | claude-haiku-4-5-20251001 | hermes  |     30.1% |     46 / 153 |\n|   4   | gpt-5.4-2026-03-05        | hermes  |     25.5% |     39 / 153 |\n|   5   | gpt-5.4-mini-2026-03-17   | hermes  |     24.8% |     38 / 153 |\n|   6   | kimi-k2.5                 | hermes  |     17.6% |     27 / 153 |\n\nV1 Pass Rate is from the original paper rubric (Claude Code agentic-eval subagent comparing each run against human reference trajectories under `eval/agentic_eval.md`). The two-stage Reward (interception + `deepseek/deepseek-v4-pro` lenient judge) for V1 will appear here once V1 trace bundles are re-judged.\n\n\u003cdetails\u003e\n\u003csummary\u003eV1 per-category breakdown (Sonnet 4.6 vs 6-model comparison)\u003c/summary\u003e\n\n| Rank  | Model                 | Overall  |  Daily   | Finance  |   Work   |   Dev    | Academic |  Travel  |  Social  |   Pets   |\n| :---: | --------------------- | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: | :------: |\n|   1   | **Claude Sonnet 4.6** | **33.3** |   44.2   | **50.0** |   19.0   |   11.1   | **50.0** |   23.1   | **38.9** | **18.2** |\n|   2   | GLM-5                 |   24.2   | **30.8** |   16.7   | **38.1** |   16.7   |   28.6   |   0.0    |   16.7   | **18.2** |\n|   3   | Gemini 3 Flash        |   19.0   |   15.4   |   33.3   |   23.8   | **22.2** |   28.6   | **30.8** |   11.1   |   0.0    |\n|   4   | Claude Haiku 4.5      |   18.3   |   15.4   |   22.2   |   19.0   | **27.8** |   21.4   |   7.7    |   16.7   | **18.2** |\n|   5   | GPT-5.4               |   6.5    |   9.6    |   0.0    |   0.0    |   11.1   |   7.1    |   7.7    |   0.0    |   9.1    |\n|   6   | Gemini 3.1 Flash Lite |   3.3    |   1.9    |   0.0    |   0.0    |   5.6    |   14.3   |   0.0    |   0.0    |   9.1    |\n\n\u003c/details\u003e\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eV1 (OpenClaw)\u003c/b\u003e \u0026nbsp;·\u0026nbsp; — (not yet aggregated)\u003c/summary\u003e\n\n—\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eTask Categories (V1: 15 categories, 153 tasks)\u003c/b\u003e\u003c/summary\u003e\n\n| Category                  | Tasks | Example Platforms                                             |\n| ------------------------- | :---: | ------------------------------------------------------------- |\n| Daily Life                |  21   | Uber Eats, DoorDash, Instacart, Zillow, Craigslist            |\n| Entertainment \u0026 Hobbies   |  15   | Ticketmaster, AMC Theatres, Topgolf, Crunchyroll              |\n| Creation \u0026 Initialization |  13   | Squarespace, Wix, Webflow, Ghost, Substack                    |\n| Rating \u0026 Voting           |  10   | Trustpilot, G2, Goodreads, RateMyProfessors                   |\n| Travel                    |   9   | Booking.com, Expedia, Airbnb, TripAdvisor                     |\n| Education \u0026 Learning      |   9   | Coursera, Udemy, Khan Academy, Duolingo                       |\n| Office \u0026 Secretary        |   9   | Google Calendar, Slack, Notion, Trello                        |\n| Beauty \u0026 Personal Care    |   9   | Sephora, Ulta, Glossier                                       |\n| Job Search \u0026 HR           |   8   | LinkedIn, Greenhouse, Lever, Workday                          |\n| Pet \u0026 Animal Care         |   8   | Chewy, Petco, Rover                                           |\n| Personal Management       |   6   | Mint, YNAB, Todoist                                           |\n| Shopping \u0026 Commerce       |   6   | Amazon, eBay, Etsy, Target                                    |\n| Nonprofit \u0026 Charity       |   6   | GoFundMe, DonorsChoose                                        |\n| Academia \u0026 Research       |   5   | Google Scholar, Semantic Scholar, OpenReview                  |\n| Finance \u0026 Investment      |   4   | Robinhood, Fidelity, Coinbase                                 |\n| Others                    |  15   | Automation, Dev \u0026 Tech, Government, Home Services, Automotive |\n\n\u003c/details\u003e\n\n\u003cbr/\u003e\n\n## How ClawBench compares\n\n| Benchmark                                                           | Domain               | Environment               | Task count | ClawBench difference                                                         |\n| ------------------------------------------------------------------- | -------------------- | ------------------------- | ---------- | ---------------------------------------------------------------------------- |\n| [WebArena](https://webarena.dev)                                    | Synthetic web apps   | Self-hosted replicas      | 812        | Live consumer sites, not admin UIs on hosted replicas                        |\n| [GAIA](https://huggingface.co/datasets/gaia-benchmark/GAIA)         | General assistants   | Closed-book text + tools  | 466        | Browser-centric; end-to-end task execution                                   |\n| [SWE-bench](https://www.swebench.com)                               | Software engineering | GitHub repos              | 2,294      | Non-code; everyday consumer workflows                                        |\n| [BrowserGym](https://github.com/ServiceNow/BrowserGym)              | Web agents           | Headless sandbox          | —          | Cloud-parity; records real user journeys                                     |\n| [Mind2Web](https://github.com/OSU-NLP-Group/Mind2Web)               | Web navigation       | Static traces             | 2,350      | Dynamic live websites, not replayed traces                                   |\n| [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web) | Live web navigation  | Real websites             | 300        | 4× more tasks (V1+V2: 283 vs 300 — comparable), with full 5-layer recordings |\n| [VisualWebArena](https://jykoh.com/vwa)                             | Visual web tasks     | Self-hosted (3 sites)     | 910        | Real websites with full visual layer (vs 3 hosted apps)                      |\n| [WebVoyager](https://github.com/MinorJerry/WebVoyager)              | Real-website nav     | Real websites (15)        | 643        | Interception-graded vs LLM-judge-only, 144 sites covered                     |\n| [TheAgentCompany](https://the-agent-company.com)                    | Office workflows     | Self-hosted (6 platforms) | 175        | Consumer everyday tasks instead of enterprise sandbox                        |\n\nClawBench's niche: **live consumer websites, everyday tasks, end-to-end recording**. If you want a controlled sandbox or replayed traces, the projects above are excellent. If you want to know whether your agent can actually order food or book a flight *today*, this is the benchmark for that.\n\n\u003cbr/\u003e\n\n## Architecture\n\n\u003cdetails\u003e\n\u003csummary\u003eContainer internals\u003c/summary\u003e\n\n```\n┌─────────────────────────────────────────────────┐\n│  Container (Docker / Podman)                    │\n│                                                 │\n│  ┌───────────┐   DOM events  ┌──────────────┐   │\n│  │ content.js├──────────────►│ background.js│   │\n│  │ (per tab) │               │  (service    │   │\n│  └───────────┘               │   worker)    │   │\n│                              └──┬──────┬────┘   │\n│                                 │      │        │\n│                         actions │      │ screenshots\n│                                 │      │        │\n│  ┌──────────┐            ┌──────▼──────▼────┐   │\n│  │  Xvfb    │◄──ffmpeg──►│  FastAPI Server  │   │\n│  │ :99      │  x11grab   │  :7878           │   │\n│  └──────────┘            └──────────────────┘   │\n│                                  │              │\n│  ┌──────────┐            ┌───────▼─────────┐    │\n│  │ Chromium │            │     /data       │    │\n│  │ :9222 CDP│            │  actions.jsonl  │    │\n│  └──────────┘            │  requests.jsonl │    │\n│                          │  screenshots/   │    │\n│                          │  recording.mp4  │    │\n│                          └─────────────────┘    │\n└─────────────────────────────────────────────────┘\n```\n\n\u003c/details\u003e\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/terminal.svg\" width=\"28\" height=\"28\"\u003e CLI\n\n```bash\n# Interactive TUI (recommended):\n./run.sh\n\n# Single run:\nuv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats claude-sonnet-4-6\n\n# Human mode (you control the browser via noVNC):\nuv run clawbench-run test-cases/v1/001-daily-life-food-uber-eats --human\n\n# Batch (all models x cases 1-50, 3 concurrent):\nuv run clawbench-batch --all-models --case-range 1-50 --max-concurrent 3\n\n# Batch all V1 tasks from test-cases/v1/:\nuv run clawbench-batch --models claude-sonnet-4-6 --all-cases --max-concurrent 3\n\n# Batch all V2 tasks from test-cases/v2/:\nuv run clawbench-batch --models claude-sonnet-4-6 --cases-suite v2 --all-cases --max-concurrent 3\n\n# Batch converted Claw-Eval tasks from test-cases/claw-eval/:\nuv run clawbench-batch --models claude-sonnet-4-6 --cases-suite claw-eval --all-cases\n\n# Batch a custom case directory:\nuv run clawbench-batch --models claude-sonnet-4-6 --cases-dir custom-cases --all-cases\n```\n\nV1 tasks are in [`test-cases/v1/`](test-cases/v1/) (153 tasks). V2 tasks are in `test-cases/v2/` (130 tasks), Lite is in `test-cases/v1-lite/` (20 tasks), and converted Claw-Eval tasks live in `test-cases/claw-eval/` (19 tasks). All suites use [`test-cases/task.schema.json`](test-cases/task.schema.json). For test case authoring details, see [CONTRIBUTING.md](CONTRIBUTING.md). For output structure and evaluation guidance, see [eval/README.md](eval/README.md).\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/chart-bar.svg\" width=\"28\" height=\"28\"\u003e Evaluation\n\nEvaluation is a **post-session** step -- first run agents to collect trajectories, then evaluate them against human reference runs.\n\n```\n 1. Run agents (root uv package)   2. Evaluate (eval/)\n ─────────────────────────         ────────────────────────────────\n ./run.sh / clawbench-batch ──►    Claude Code subagents compare\n produces test-output/             agent vs human trajectories\n   with 5-layer recordings         under eval/agentic_eval.md rubric\n```\n\nThe evaluator compares each agent trajectory against a human reference trajectory across all five recording layers (video, screenshots, HTTP traffic, browser actions, agent messages), then outputs PASS/FAIL with evidence-backed justification.\n\nSee [eval/README.md](eval/README.md) for the full evaluation guide and Claude Code prompt template.\n\n\u003cbr/\u003e\n\n# \u003cimg src=\"assets/icons/circle-question.svg\" width=\"28\" height=\"28\"\u003e FAQ\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eWhat data does each run produce?\u003c/b\u003e\u003c/summary\u003e\n\nEach session records five layers of synchronized data under `/data/`:\n\n| Layer              | File                   | Description                                                     |\n| ------------------ | ---------------------- | --------------------------------------------------------------- |\n| Session replay     | `recording.mp4`        | Full session video (H.264, 15fps)                               |\n| Action screenshots | `screenshots/*.png`    | Timestamped PNG per browser action                              |\n| Browser actions    | `actions.jsonl`        | Every DOM event (click, keydown, input, pageLoad, scroll, etc.) |\n| HTTP traffic       | `requests.jsonl`       | Every HTTP request with headers, body, and query params         |\n| Agent messages     | `agent-messages.jsonl` | Full agent conversation transcript (thinking, text, tool calls) |\n\nFor the Pi harness, `agent-messages.jsonl` is filtered Pi JSON mode output, including `message_start`/`message_end` events, `tool_execution_*` events, tool-call content blocks, and `thinking` blocks when the selected model emits reasoning. Streaming `message_update` fragments, including `*_delta` rows, are omitted because complete assistant messages are already preserved in `message_end` events.\n\nHarness diagnostic logs such as Pi's `agent.log` and `proxy.log` are not copied into the final `data/` directory.\n\nThe interceptor result is saved to `interception.json`.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eHow does the request interceptor work?\u003c/b\u003e\u003c/summary\u003e\n\nThe interceptor blocks critical, irreversible HTTP requests (checkout, form submit, email send) to prevent real-world side effects. It connects to Chrome via CDP's `Fetch` domain and matches requests against the eval schema (`url_pattern` regex + `method` + optional `body`/`params`). When triggered, it saves the blocked request to `interception.json`, kills the agent, and stops recording.\n\nThe interceptor does **not** validate task completion -- evaluation is handled separately by evaluators post-session.\n\nFor tasks behind payment walls (agent has no valid credit card), the eval schema uses a placeholder pattern that never matches, so the session runs until timeout.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eWhat is the synthetic user profile?\u003c/b\u003e\u003c/summary\u003e\n\nEach container gets a `/my-info/` directory with a dummy user identity (Alex Green): personal info JSON, email credentials, and a resume PDF. The email is a fresh disposable PurelyMail address generated per run. The agent reads these files when it needs to fill forms, register accounts, etc.\n\nSource templates: `src/clawbench/runtime/shared/alex_green_personal_info.json` (profile) and `src/clawbench/runner/run_support/resume_template.json` (resume).\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eCan I use Podman instead of Docker?\u003c/b\u003e\u003c/summary\u003e\n\nYes. Set `export CONTAINER_ENGINE=podman`. The framework auto-detects whichever is available. Podman works without root privileges.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eWhat tools can the agent use?\u003c/b\u003e\u003c/summary\u003e\n\nAll supported harnesses run inside the same container recording and interception environment. CLI/MCP harnesses expose the browser tool plus a restricted set of read-only shell commands (`ls`, `cat`, `find`, `grep`, `head`, `tail`, `jq`, `wc`, etc.); commands that could bypass the browser (`curl`, `python`, `node`, `wget`) are blocked. Hermes and Pi use native browser/file tools attached to the same ClawBench Chrome CDP endpoint. The Pi harness intentionally allowlists only read-only file tools and browser interaction tools; `bash`, `write`, `edit`, `browser_http_get`, and `browser_run_script` are not enabled. The agent instruction also explicitly requires browser-only task completion.\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003e\u003cb\u003eHow do I add a new test case?\u003c/b\u003e\u003c/summary\u003e\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md). In short: create a directory under the target corpus (`test-cases/v1/` for V1 or `test-cases/v2/` for V2) with a `task.json` conforming to `test-cases/task.schema.json`, define the eval schema, test with human mode, and submit a PR.\n\n\u003c/details\u003e\n\n\u003cbr/\u003e\n\n## Contributing\n\nWe welcome contributions -- especially new test cases. If you've ever ordered groceries, booked an appointment, or filed a form online, you already know how to write one. Most PRs are a single JSON file and land in under a day.\n\n**Quick wins:**\n\n- [Add a new test case](CONTRIBUTING.md#adding-a-new-test-case) (~30 min, no container expertise needed)\n- [Add a new category](CONTRIBUTING.md#what-were-looking-for) of 10+ tasks \u0026rarr; co-author invitation on the next paper revision\n- [Submit a new model](CONTRIBUTING.md#what-were-looking-for) to the public leaderboard\n- Browse [good first issues](https://github.com/reacher-z/ClawBench/labels/good%20first%20issue)\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for the full guide and contributor recognition policy.\n\n## Community\n\nCome hang out with researchers, builders, and contributors working on real-world browser agents.\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\" width=\"33%\"\u003e\n\u003ca href=\"https://discord.gg/clawbench\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/Discord-Join-5865F2?style=for-the-badge\u0026logo=discord\u0026logoColor=white\" alt=\"Discord\"\u003e\n\u003c/a\u003e\n\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eEnglish community\u003c/b\u003e\u003cbr/\u003eAgent builders, researchers, contributors\u003c/sub\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\" width=\"33%\"\u003e\n\u003ca href=\"assets/community/wechat_grp_422.jpg\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/%E5%BE%AE%E4%BF%A1%E7%BE%A4-%E5%8A%A0%E5%85%A5-07C160?style=for-the-badge\u0026logo=wechat\u0026logoColor=white\" alt=\"微信群\"\u003e\n\u003c/a\u003e\n\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003e中文社区\u003c/b\u003e\u003cbr/\u003e研究者、开发者、贡献者交流\u003c/sub\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\" width=\"33%\"\u003e\n\u003ca href=\"https://github.com/reacher-z/ClawBench/discussions\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/GitHub-Discussions-181717?style=for-the-badge\u0026logo=github\u0026logoColor=white\" alt=\"GitHub Discussions\"\u003e\n\u003c/a\u003e\n\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eAsync Q\u0026A\u003c/b\u003e\u003cbr/\u003eSearchable, long-form, permanent\u003c/sub\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\nUse the Discord and GitHub Discussions links for ongoing community support. For 微信群, use the QR link above.\n\n## Frequently Asked Questions\n\n**What is ClawBench?**\nClawBench is an open-source benchmark for AI browser agents — the systems (GPT-based, Claude-based, or open) that drive a real web browser to complete a user's task. V1 measures whether the agent actually finishes 153 everyday online tasks across 144 live websites; V2 adds a 130-task corpus in `test-cases/v2/`. It measures completion, not whether the agent produces the right-looking text.\n\n**What kinds of tasks does ClawBench cover?**\nFifteen life categories: food delivery, travel booking, job applications, shopping, housing search, email and calendar management, academic research, software development, learning platforms, and more. Every task is something a normal person might do in a normal week, on a real website.\n\n**Are 153 tasks enough for evaluation?**\nYes for a V1 benchmark signal: the 153 tasks span 144 live websites and 15 life categories, and each full run is expensive because it uses isolated containers, real websites, five-layer recording, and post-session judgment against human references. V2 adds another 130 tasks in `test-cases/v2/`. For cheaper iteration, start with the 20-task [`test-cases/v1-lite/`](test-cases/v1-lite/) subset.\n\n**How is a task judged successful?**\nEach task runs in an isolated browser container with a five-layer recording: video, screenshots, network requests, browser actions, and agent messages. For the original V1 results, an evaluator compares the agent trajectory against human reference runs and assigns PASS/FAIL with evidence from the recording. For V2 and newer leaderboard rows, scoring is two-stage: first, the request interceptor checks whether the final blocked HTTP request matches the task's URL/method schema; second, an LLM judge checks whether the captured request payload fulfills the natural-language instruction.\n\n**How do account login, registration, and initial task state work?**\nEach run receives a synthetic user profile plus a fresh disposable PurelyMail address. If a task requires sign-up, the agent normally starts from scratch and registers during the run, using the provided identity and email. If a task needs starting files or workspace context, those files live under the task's `extra_info/` directory and are mounted for the agent at runtime.\n\n**What happens when live websites change?**\nLive-site change is part of the benchmark's target: ClawBench measures whether agents can handle production websites rather than frozen snapshots. That also means some runs can be affected by layout changes, availability, anti-bot systems, or alternate flows. Reproducibility comes from publishing task definitions, eval schemas, run metadata, and five-layer traces; repeated runs over time are still useful for measuring site drift.\n\n**Do CAPTCHA or bot checks dominate failures?**\nIf an agent encounters a CAPTCHA, it must attempt it. We have seen cases where frontier models are able to solve some CAPTCHAS. CAPTCHA failures can reflect model behavior, browser-control stack limits, or site defenses. The trace datasets make these failures inspectable.\n\n**What's the current top score?**\n33.3% — roughly one task in three — from the strongest frontier model we evaluated. The majority of tasks still defeat every model we've tested; the headroom is real, and the benchmark is not saturated.\n\n**Which harness are the published model results based on?**\nThe repo default is `openclaw`, but leaderboard rows include their harness explicitly. V1 results used OpenClaw; newer runs may use Hermes or other supported harnesses. Use the `harness` column when comparing models, because model and harness changes are separate experimental axes.\n\n**Is ClawBench tightly coupled to OpenClaw?**\nNo. OpenClaw is the default harness, but ClawBench supports interchangeable harnesses listed in `src/clawbench/runtime/harnesses/harnesses.yaml`.\n\n**Can ClawBench evaluate CLI agents?**\nYes. ClawBench is a browser-task benchmark, but CLI and coding-agent harnesses can drive the same instrumented Chromium session using native tools or MCPs.\n\n**How do I reproduce a published score?**\nFrom a source checkout, configure `models/models.yaml`, then run `uv run clawbench`. The TUI builds the container image and runs local tasks against your model of choice. For batch runs, use `--all-cases` for the default V1 suite, `--cases-suite v2 --all-cases` for V2, or `--cases-suite v1-lite --all-cases` for Lite.\n\n**Will newer models be added?**\nYes. New model runs can be submitted or requested through the contribution flow and issues. Public rows are added as complete or clearly marked partial runs, depending on what has finished.\n\n**Is ClawBench safe to run against live websites?**\nThe runner uses a hardened container with a request interceptor that blocks purchases, account creation, outbound email sends, and similar irreversible actions by default. Tasks that need to *simulate* those actions (e.g., \"add to cart and checkout\") terminate at the last reversible step. You can relax the interceptor per-task if your research requires it.\n\n**Can I contribute new tasks or harnesses?**\nYes. V1 tasks live in `test-cases/v1/`; V2 tasks live in `test-cases/v2/`; Lite tasks live in `test-cases/v1-lite/`. Harness definitions live in `src/clawbench/runtime/harnesses/harnesses.yaml`. See `CONTRIBUTING.md` for the task schema and validation flow.\n\n**How does ClawBench relate to HarnessBench?**\nSame scoring pipeline, orthogonal axis. ClawBench fixes the harness and varies the model; HarnessBench fixes the model and varies the harness. They share the V1 153-task corpus, the five-layer recording, and the agentic evaluator — so numbers are directly comparable.\n\n## Citation\n\nIf you use ClawBench in your research, please cite:\n\n```bibtex\n@misc{zhang2026clawbenchaiagentscomplete,\n  title         = {ClawBench: Can AI Agents Complete Everyday Online Tasks?},\n  author        = {Yuxuan Zhang and Yubo Wang and Yipeng Zhu and Penghui Du and Junwen Miao and Xuan Lu and Wendong Xu and Yunzhuo Hao and Songcheng Cai and Xiaochen Wang and Huaisong Zhang and Xian Wu and Yi Lu and Minyi Lei and Kai Zou and Huifeng Yin and Ping Nie and Liang Chen and Dongfu Jiang and Wenhu Chen and Kelsey R. Allen},\n  year          = {2026},\n  eprint        = {2604.08523},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.CL},\n  url           = {https://arxiv.org/abs/2604.08523}\n}\n```\n\n## Contact\n\nQuestions, suggestions, or research collaboration? Reach the maintainer:\n\n- **Yuxuan Zhang** \u0026mdash; `reacher` \u0026lbrack;at\u0026rbrack; `cs.ubc.ca` (UBC, NAIL Group) \u0026middot; [Homepage \u0026#8599;](https://reacher-z.github.io)\n- For bug reports or feature requests, please [open a GitHub issue](https://github.com/reacher-z/ClawBench/issues/new/choose) \u0026mdash; it's faster than email and gets seen by all maintainers.\n\n## Core Contributors\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/reacher-z\"\u003e\n\u003cimg src=\"https://github.com/reacher-z.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eYuxuan Zhang\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/Wyyyb\"\u003e\n\u003cimg src=\"https://github.com/Wyyyb.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eYubo Wang\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/Perry2004\"\u003e\n\u003cimg src=\"https://github.com/Perry2004.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003ePerry Zhu\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/eternaldolphin\"\u003e\n\u003cimg src=\"https://github.com/eternaldolphin.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003ePenghui Du\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/MEKSAAA\"\u003e\n\u003cimg src=\"https://github.com/MEKSAAA.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eJunwen Miao\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## Advisors\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/k-r-allen\"\u003e\n\u003cimg src=\"https://github.com/k-r-allen.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eKelsey R. Allen\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/wenhuchen\"\u003e\n\u003cimg src=\"https://github.com/wenhuchen.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eWenhu Chen\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/jdf-prog\"\u003e\n\u003cimg src=\"https://github.com/jdf-prog.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eDongfu Jiang\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003ctd align=\"center\"\u003e\n\u003ca href=\"https://github.com/chenllliang\"\u003e\n\u003cimg src=\"https://github.com/chenllliang.png\" width=\"80\" height=\"80\" style=\"border-radius:50%\"\u003e\u003cbr/\u003e\n\u003csub\u003e\u003cb\u003eLiang Chen\u003c/b\u003e\u003c/sub\u003e\n\u003c/a\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n## Support ClawBench\n\nIf ClawBench is useful for your research or product work,\nthe single most helpful thing you can do is **[star the repo](https://github.com/reacher-z/ClawBench)** —\nit surfaces the benchmark to other AI-agent researchers and helps us justify\ncontinued dataset curation.\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/reacher-z/ClawBench\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/%E2%98%85%20Star%20this%20repo-181717?style=for-the-badge\u0026logo=github\u0026logoColor=white\" alt=\"Star this repo\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\nOpen to contributions — new test cases, bug fixes, or evaluation submissions for a model we haven't scored yet. See [`CONTRIBUTING.md`](CONTRIBUTING.md).\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://github.com/reacher-z/ClawBench/graphs/contributors\"\u003e\n\u003cimg src=\"https://contrib.rocks/image?repo=reacher-z/ClawBench\" alt=\"Contributors\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\n## Star History\n\n\u003ca href=\"https://star-history.com/#reacher-z/ClawBench\u0026Date\"\u003e\n  \u003cpicture\u003e\n    \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=reacher-z/ClawBench\u0026type=Date\u0026theme=dark\" /\u003e\n    \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=reacher-z/ClawBench\u0026type=Date\" /\u003e\n    \u003cimg alt=\"ClawBench Star History\" src=\"https://api.star-history.com/svg?repos=reacher-z/ClawBench\u0026type=Date\" width=\"600\" /\u003e\n  \u003c/picture\u003e\n\u003c/a\u003e\n\n## License \u0026 Acknowledgments\n\nApache 2.0 -- see [LICENSE](LICENSE).\n\nThe converted Claw-Eval suite in [`test-cases/claw-eval/`](test-cases/claw-eval/) is derived from [claw-eval/claw-eval](https://github.com/claw-eval/claw-eval) and the [claw-eval/Claw-Eval](https://huggingface.co/datasets/claw-eval/Claw-Eval) dataset, which are released under the MIT License. Third-party package notices are in [NOTICE](NOTICE).\n\nBuilt with [OpenClaw](https://github.com/openclaw/openclaw), [opencode](https://opencode.ai), [Claude Code](https://docs.anthropic.com/en/docs/claude-code), the [Claude in Chrome](https://code.claude.com/docs/en/chrome) extension, [OpenAI Codex CLI](https://github.com/openai/codex), [browser-use](https://github.com/browser-use/browser-use), [claw-code](https://github.com/ultraworkers/claw-code), [Hermes Agent](https://github.com/NousResearch/hermes-agent), and [Pi](https://pi.dev/) with [pi-browser-harness](https://pi.dev/packages/pi-browser-harness) (selectable harnesses), [Microsoft Playwright MCP](https://github.com/microsoft/playwright-mcp) (browser control bridge for the opencode, claude-code, codex, and claw-code harnesses), [LiteLLM](https://github.com/BerriAI/litellm) (API translation proxy for the claude-code, claude-code-chrome-extension, codex, browser-use, claw-code, and pi harnesses), [noVNC](https://github.com/novnc/noVNC) (MPL 2.0), and [websockify](https://github.com/novnc/websockify) (LGPL 3.0).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fclawbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftiger-ai-lab%2Fclawbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftiger-ai-lab%2Fclawbench/lists"}