{"id":49969549,"url":"https://github.com/lexmount/browseruse-agent-bench","last_synced_at":"2026-05-18T07:16:38.247Z","repository":{"id":356658897,"uuid":"1233522763","full_name":"lexmount/browseruse-agent-bench","owner":"lexmount","description":"browseruse agent bench","archived":false,"fork":false,"pushed_at":"2026-05-09T04:02:15.000Z","size":9185,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-09T06:15:47.923Z","etag":null,"topics":["agent","benchmark","browseruse"],"latest_commit_sha":null,"homepage":"https://lexmount.github.io/browseruse-agent-bench/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/lexmount.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-05-09T03:49:23.000Z","updated_at":"2026-05-09T04:12:01.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/lexmount/browseruse-agent-bench","commit_stats":null,"previous_names":["lexmount/browseruse-agent-bench"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/lexmount/browseruse-agent-bench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexmount%2Fbrowseruse-agent-bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexmount%2Fbrowseruse-agent-bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexmount%2Fbrowseruse-agent-bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexmount%2Fbrowseruse-agent-bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/lexmount","download_url":"https://codeload.github.com/lexmount/browseruse-agent-bench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/lexmount%2Fbrowseruse-agent-bench/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33168919,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T05:43:36.989Z","status":"ssl_error","status_checked_at":"2026-05-18T05:43:19.133Z","response_time":71,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","benchmark","browseruse"],"created_at":"2026-05-18T07:16:36.994Z","updated_at":"2026-05-18T07:16:38.240Z","avatar_url":"https://github.com/lexmount.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/logo/blue.svg\" alt=\"Browseruse-Bench\" width=\"600\"\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://lexmount.github.io/browseruse-agent-bench/\"\u003eLanding Page\u003c/a\u003e •\n  \u003ca href=\"https://github.com/lexmount/browseruse-agent-bench/issues\"\u003eIssues\u003c/a\u003e •\n  \u003ca href=\"https://github.com/lexmount/browseruse-agent-bench/discussions\"\u003eDiscussions\u003c/a\u003e •\n  \u003ca href=\"#leaderboard\"\u003eLeaderboard\u003c/a\u003e •\n  \u003ca href=\"https://docs.bubench.lexmount.io/\"\u003eDocumentation\u003c/a\u003e •\n  \u003ca href=\"https://huggingface.co/datasets/Lexmount/LexBench-Browser\"\u003eDataset\u003c/a\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  English | \u003ca href=\"./README_ZH.md\"\u003e简体中文\u003c/a\u003e\n\u003c/p\u003e\n\n## Why browseruse-agent-bench\n\n**browseruse-agent-bench** is a reproducible evaluation framework for browser agents.\n**LexBench-Browser** is the built-in public dataset used by the default benchmark workflow.\nTogether they make external results easy to run, compare, cite, and submit back.\n\n| What you can do | Why it matters |\n|-----------------|----------------|\n| Run **LexBench-Browser: 210 public tasks across 107 real websites** | Test browser agents on long-tail multilingual workflows beyond toy pages |\n| Compare **Agent × Model × Browser × Eval** | Separate agent quality from model choice, browser backend, and judge strategy |\n| Inspect leaderboard, cost, latency, token usage, and trajectories | Debug failures instead of only reporting a final score |\n| Submit agents, dataset tasks, and reproducible results | Turn forks and PRs into visible benchmark contributions |\n\n## Description\n\n\n**browseruse-agent-bench** is an all-in-one evaluation framework for AI browser agents, designed to benchmark *multiple agents across multiple datasets, browser backends, and models* under controlled and reproducible settings. The Python package/CLI is published as **browseruse-bench** and `bubench`. It supports both local and cloud browsers, integrates LLM-as-Judge for automated evaluation, and provides a built-in local leaderboard along with efficiency and cost metrics such as agent steps, end-to-end latency, and token usage.\n\n**Supported Datasets**\n\n- [x] **LexBench-Browser** — Browser-agent dataset covering e-commerce, social, academic, financial, and other mainstream Chinese/English websites (v1.0, 2026-04-30)\n  - `All` (210, no login required)\n  - `lexmount` (118, mainland-accessible websites) / `global` (92, international websites)\n  - Hugging Face: [Lexmount/LexBench-Browser](https://huggingface.co/datasets/Lexmount/LexBench-Browser)\n- [x] **Online-Mind2Web** — Real website interaction tasks\n  - `All` (300) / `Hard` (hard subset)\n- [x] **BrowseComp** — Browser operation competition tasks, no login required\n  - `All` (1266)\n- [ ] More benchmarks\n\n\u003e Details: [Benchmarks overview](https://docs.bubench.lexmount.io/en/benchmarks/overview).\n\n**Supported Agents \u0026 Browsers**\n\n| Agent | Supported Browsers |\n|-------|-------------------|\n| [browser-use](https://github.com/browser-use/browser-use) | `Chrome-Local`, `lexmount`, `browser-use-cloud`, `agentbay` |\n| [skyvern](https://github.com/Skyvern-AI/Skyvern/) | `local`, `lexmount`, `skyvern-cloud` |\n| [Agent-TARS](https://github.com/bytedance/UI-TARS-desktop) | Built-in browser |\n| More agents | — |\n\n\u003e Details: [Agents overview](https://docs.bubench.lexmount.io/en/agents/overview).\n\n## News\n\n- **[2026.04.30]** 🎉 **browseruse-agent-bench v1.0** — initial open-source release. The LexBench-Browser dataset v1.0 ships 210 public tasks across 107 distinct websites with a 6-category × 16-tag robustness label system; reference integrations cover browser-use, skyvern, Agent-TARS and deepbrowse.\n\n## Quickstart\n\n\n**1. Clone the repository**\n\n```bash\ngit clone https://github.com/lexmount/browseruse-agent-bench.git\ncd browseruse-agent-bench\n```\n\n**2. Install dependencies (Python\u003e=3.11)**\n\nRequires [uv](https://docs.astral.sh/uv/) (recommended). Select the section for your agent.\n\n\u003e **Note**: `browser-use` and `skyvern` have conflicting dependencies and cannot be installed together. If you plan to run multiple agents in parallel, refer to the [Environment Isolation](https://docs.bubench.lexmount.io/en/quickstart#running-multiple-agents-in-parallel) section in the documentation.\n\n**browser-use**\n\n```bash\nuv sync --extra browser-use\nsource .venv/bin/activate          # macOS / Linux\n.venv\\Scripts\\Activate.ps1         # Windows PowerShell\n```\n\n**skyvern**\n\n```bash\nuv sync --extra skyvern\nsource .venv/bin/activate          # macOS / Linux\n.venv\\Scripts\\Activate.ps1         # Windows PowerShell\n```\n\n**Agent-TARS** (requires Node.js 18+)\n\n```bash\nuv sync\nnpm install -g @agent-tars/cli@0.3.0\nsource .venv/bin/activate          # macOS / Linux\n.venv\\Scripts\\Activate.ps1         # Windows PowerShell\n```\n\n\u003e After activation, the `bubench` CLI is available on your PATH. Without activation, prefix every `bubench …` command in the following steps with `uv run` (e.g. `uv run bubench run …`).\n\n**3. Configure**\n\n\u003e **Principle**: `.env` holds sensitive credentials (API keys). `config.example.yaml` → `config.yaml` (git-ignored) holds all agent, model, browser, and eval settings in one place.\n\n**3.1 Shared credentials (`.env`)**\n\n```bash\ncp .env.example .env\nvim .env\n```\n\n| Variable | Description | Sign up | Required |\n|----------|-------------|---------|----------|\n| `OPENAI_API_KEY` | API key for agents and evaluation | [platform.openai.com](https://platform.openai.com/api-keys) | ✅ |\n| `OPENAI_BASE_URL` | Custom API base URL (e.g. LiteLLM proxy) | — | Optional |\n| `LEXMOUNT_API_KEY` + `LEXMOUNT_PROJECT_ID` | Lexmount cloud browser | [browser.lexmount.cn](https://browser.lexmount.cn/) | When using lexmount |\n| `BROWSER_USE_API_KEY` | Browser Use cloud browser | [browser-use.com](https://www.browser-use.com/) | When using browser-use-cloud |\n| `AGENTBAY_API_KEY` | AgentBay cloud browser | [agentbay.ai](https://agentbay.ai/) | When using agentbay |\n| `HF_ENDPOINT=https://hf-mirror.com` | HuggingFace mirror (China) | — | Optional |\n\n**3.2 Runtime config (`config.yaml`)**\n\n```bash\ncp config.example.yaml config.yaml\nvim config.yaml\n```\n\nAll agents are configured in one file. Key fields under `agents.\u003cagent\u003e`:\n\n| Field | Description |\n|-------|-------------|\n| `active_model` | Which model entry to use (must match a key under `models`) |\n| `models.\u003cname\u003e.model_type` | Provider: `BROWSER_USE`, `OPENAI`, `AZURE`, `GEMINI`, `ANTHROPIC` |\n| `models.\u003cname\u003e.model_id` | Model ID (e.g. `gpt-4.1`, `qwen3.5-plus`, `kimi-k2.5`) |\n| `models.\u003cname\u003e.api_key` | API key for this model (supports `$ENV_VAR` expansion) |\n| `models.\u003cname\u003e.base_url` | API base URL (optional, supports `$ENV_VAR` expansion) |\n| `browser.browser_id` | Browser backend: `Chrome-Local`, `lexmount`, `browser-use-cloud`, `agentbay`, `cdp` |\n| `defaults.*` | Shared agent params: `max_steps`, `timeout`, `use_vision`, etc. |\n| `eval.model` + `eval.api_key` + `eval.base_url` | Evaluation model settings |\n\nTo switch models, change `active_model` and ensure the matching entry exists under `models`.\n\n**4. Install Skills (Optional)**\n\n```bash\nbubench skills\n```\n\nInstalls the prebuilt developer-friendly skills pack (`browseruse_bench/skills/`) into your agent toolchain.\n\n**5. Run \u0026 Evaluate**\n\n**Run**\n```bash\nbubench run --agent {AGENT} --data {BENCHMARK} --mode first_n --count 3\n# Output: experiments/{benchmark}/{split}/{agent}/{model_id}/{timestamp}/\n\n# Example: LexBench-Browser (no login required)\nbubench run --agent browser-use --data LexBench-Browser --mode first_n --count 3\n# Output: experiments/LexBench-Browser/All/browser-use/gpt-4.1/20260101_120000/\n```\n\n**Evaluate**\n```bash\nbubench eval --agent {AGENT} --data {BENCHMARK} --model-id {MODEL_ID}\n\n# Example\nbubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1\n```\n\n\u003e `--split` is optional — the benchmark's `default_split` (from `data_info.json`) is used automatically. Pass `--split \u003cname\u003e` only to override the default.\n\u003e For the full parameter reference, see the [Quickstart docs](https://docs.bubench.lexmount.io/en/quickstart).\n\n## Data Loading\n\nUse `--data-source` to control where benchmark data is loaded from:\n\n| Mode | Description | Example |\n|------|-------------|---------|\n| `local` (default) | Uses local files under `benchmarks/{benchmark}/data/`, errors if missing | `--data-source local` |\n| `huggingface` | Downloads to HF cache (`~/.cache/huggingface`), does not write back to repo | `--data-source huggingface` |\n| `huggingface` + `--force-download` | Forces re-download, refreshes HF cache | `--data-source huggingface --force-download` |\n\n\u003e **Speed up in China**: Set `HF_ENDPOINT=https://hf-mirror.com` in `.env`.\n\u003e **Private datasets**: Set `HF_TOKEN=hf_your_token_here` in `.env`.\n\nDetails: [Data Loading](https://docs.bubench.lexmount.io/en/benchmarks/data-loading).\n\n\u003e 📖 For complete guides, API reference, and more examples, see the [full documentation](https://docs.bubench.lexmount.io/).\n\n## Leaderboard\n\nWe provide an interactive local leaderboard to compare agent performance across benchmarks.\n\nGenerate leaderboard HTML:\n```bash\nbubench leaderboard\n```\n\nDeploy leaderboard service (temporary process):\n```bash\nbubench server --host 0.0.0.0 --port 8012 \u0026\n```\n\nDeploy leaderboard service (systemd):\n```bash\nsudo bubench service install\nsudo bubench service start\n```\n\nSee [Leaderboard Documentation](https://docs.bubench.lexmount.io/en/leaderboard/overview) for more details.\n\n**Access URLs (default port `8012`):**\n- Local leaderboard: [http://localhost:8012](http://localhost:8012)\n- Local API docs: [http://localhost:8012/docs](http://localhost:8012/docs)\n- Remote leaderboard: `http://\u003cSERVER_IP\u003e:8012/`\n- Remote API docs: `http://\u003cSERVER_IP\u003e:8012/docs`\n\n## Visualization\n\nAn interactive experiment explorer for browsing agent trajectories, evaluation details, and per-task API logs — complements the static leaderboard with task-level drill-down.\n\n```bash\n# Start server (auto-regenerates index when experiment files change)\nbubench viz --watch\n\n# Access at http://localhost:8080\n```\n\n**Options:**\n\n```bash\nbubench viz --port 8090              # custom port (default: 8080)\nbubench viz --generate-only          # regenerate experiments.json and exit\nbubench viz --watch-interval 5       # poll interval in seconds (default: 3)\n```\n\nFor remote sharing with tmux and firewall configuration, see [Visualization Documentation](https://docs.bubench.lexmount.io/en/leaderboard/visualization#remote--intranet-sharing).\n\n## Acknowledgements\n\nSome code in this project is cited and modified from [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web) and [simple-evals](https://github.com/openai/simple-evals).\n\n## Citation\n\n\n```bibtex\n@misc{lexbench_browser_2026,\n    title        = {LexBench-Browser: A Real-World Browser Agent Benchmark with Long-Tail and Multilingual Tasks},\n    author       = {Lexmount Research and Collaborators},\n    year         = {2026},\n    howpublished = {\\url{https://lexmount.github.io/browseruse-agent-bench/}},\n    note         = {Open benchmark; v1.0 reference release},\n}\n```\n\n## Contact\n\n\nQuestions, benchmark proposals, agent integrations, and result reproductions are welcome:\n\n- Report bugs or request features in [GitHub Issues](https://github.com/lexmount/browseruse-agent-bench/issues).\n- Ask questions and discuss results in [GitHub Discussions](https://github.com/lexmount/browseruse-agent-bench/discussions).\n- Email official result, dataset, and collaboration questions to [lexbench@lexmount.com](mailto:lexbench@lexmount.com).\n- Track upcoming releases in [Milestones](https://github.com/lexmount/browseruse-agent-bench/milestones).\n- Use [Contributing](./CONTRIBUTING.md) when opening pull requests or adding a new agent/benchmark.\n- See [Governance](./GOVERNANCE.md) and [Evaluation Protocol](./EVALUATION_PROTOCOL.md) for result review rules.\n\n## Coming Soon\n\n- 🔐 **Login-state preservation** — first-class support for reusing browser login across eval runs, so login-gated tasks can be benchmarked end-to-end without manual re-login. Stay tuned.\n\n## Roadmap/ Development Plan\n\nRefer to our [Milestones](https://github.com/lexmount/browseruse-agent-bench/milestones) for upcoming versions and deadlines.\n\n\n## Star History\n\n\n\n\u003ca href=\"https://star-history.com/#lexmount/browseruse-agent-bench\u0026Date\"\u003e\n \u003cpicture\u003e\n   \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=lexmount/browseruse-agent-bench\u0026type=Date\u0026theme=dark\" /\u003e\n   \u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=lexmount/browseruse-agent-bench\u0026type=Date\" /\u003e\n   \u003cimg alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=lexmount/browseruse-agent-bench\u0026type=Date\" /\u003e\n \u003c/picture\u003e\n\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flexmount%2Fbrowseruse-agent-bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flexmount%2Fbrowseruse-agent-bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flexmount%2Fbrowseruse-agent-bench/lists"}