https://github.com/lexmount/browseruse-agent-bench

browseruse agent bench
https://github.com/lexmount/browseruse-agent-bench
agent benchmark browseruse
Last synced: about 1 month ago
JSON representation
browseruse agent bench
Host: GitHub
URL: https://github.com/lexmount/browseruse-agent-bench
Owner: lexmount
License: apache-2.0
Created: 2026-05-09T03:49:23.000Z (about 2 months ago)
Default Branch: main
Last Pushed: 2026-05-09T04:02:15.000Z (about 2 months ago)
Last Synced: 2026-05-09T06:15:47.923Z (about 2 months ago)
Topics: agent, benchmark, browseruse
Language: Python
Homepage: https://lexmount.github.io/browseruse-agent-bench/
Size: 8.76 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Agents: AGENTS.md
Awesome Lists containing this project

README

          


  





  Landing Page •

  Issues •

  Discussions •

  Leaderboard •

  Documentation •

  Dataset





  English | 简体中文



## Why browseruse-agent-bench

**browseruse-agent-bench** is a reproducible evaluation framework for browser agents.

**LexBench-Browser** is the built-in public dataset used by the default benchmark workflow.

Together they make external results easy to run, compare, cite, and submit back.

| What you can do | Why it matters |

|-----------------|----------------|

| Run **LexBench-Browser: 210 public tasks across 107 real websites** | Test browser agents on long-tail multilingual workflows beyond toy pages |

| Compare **Agent × Model × Browser × Eval** | Separate agent quality from model choice, browser backend, and judge strategy |

| Inspect leaderboard, cost, latency, token usage, and trajectories | Debug failures instead of only reporting a final score |

| Submit agents, dataset tasks, and reproducible results | Turn forks and PRs into visible benchmark contributions |

## Description

**browseruse-agent-bench** is an all-in-one evaluation framework for AI browser agents, designed to benchmark *multiple agents across multiple datasets, browser backends, and models* under controlled and reproducible settings. The Python package/CLI is published as **browseruse-bench** and `bubench`. It supports both local and cloud browsers, integrates LLM-as-Judge for automated evaluation, and provides a built-in local leaderboard along with efficiency and cost metrics such as agent steps, end-to-end latency, and token usage.

**Supported Datasets**

- [x] **LexBench-Browser** — Browser-agent dataset covering e-commerce, social, academic, financial, and other mainstream Chinese/English websites (v1.0, 2026-04-30)

  - `All` (210, no login required)

  - `lexmount` (118, mainland-accessible websites) / `global` (92, international websites)

  - Hugging Face: [Lexmount/LexBench-Browser](https://huggingface.co/datasets/Lexmount/LexBench-Browser)

- [x] **Online-Mind2Web** — Real website interaction tasks

  - `All` (300) / `Hard` (hard subset)

- [x] **BrowseComp** — Browser operation competition tasks, no login required

  - `All` (1266)

- [ ] More benchmarks

> Details: [Benchmarks overview](https://docs.bubench.lexmount.io/en/benchmarks/overview).

**Supported Agents & Browsers**

| Agent | Supported Browsers |

|-------|-------------------|

| [browser-use](https://github.com/browser-use/browser-use) | `Chrome-Local`, `lexmount`, `browser-use-cloud`, `agentbay` |

| [skyvern](https://github.com/Skyvern-AI/Skyvern/) | `local`, `lexmount`, `skyvern-cloud` |

| [Agent-TARS](https://github.com/bytedance/UI-TARS-desktop) | Built-in browser |

| More agents | — |

> Details: [Agents overview](https://docs.bubench.lexmount.io/en/agents/overview).

## News

- **[2026.04.30]** 🎉 **browseruse-agent-bench v1.0** — initial open-source release. The LexBench-Browser dataset v1.0 ships 210 public tasks across 107 distinct websites with a 6-category × 16-tag robustness label system; reference integrations cover browser-use, skyvern, Agent-TARS and deepbrowse.

## Quickstart

**1. Clone the repository**

```bash

git clone https://github.com/lexmount/browseruse-agent-bench.git

cd browseruse-agent-bench

```

**2. Install dependencies (Python>=3.11)**

Requires [uv](https://docs.astral.sh/uv/) (recommended). Select the section for your agent.

> **Note**: `browser-use` and `skyvern` have conflicting dependencies and cannot be installed together. If you plan to run multiple agents in parallel, refer to the [Environment Isolation](https://docs.bubench.lexmount.io/en/quickstart#running-multiple-agents-in-parallel) section in the documentation.

**browser-use**

```bash

uv sync --extra browser-use

source .venv/bin/activate          # macOS / Linux

.venv\Scripts\Activate.ps1         # Windows PowerShell

```

**skyvern**

```bash

uv sync --extra skyvern

source .venv/bin/activate          # macOS / Linux

.venv\Scripts\Activate.ps1         # Windows PowerShell

```

**Agent-TARS** (requires Node.js 18+)

```bash

uv sync

npm install -g @agent-tars/cli@0.3.0

source .venv/bin/activate          # macOS / Linux

.venv\Scripts\Activate.ps1         # Windows PowerShell

```

> After activation, the `bubench` CLI is available on your PATH. Without activation, prefix every `bubench …` command in the following steps with `uv run` (e.g. `uv run bubench run …`).

**3. Configure**

> **Principle**: `.env` holds sensitive credentials (API keys). `config.example.yaml` → `config.yaml` (git-ignored) holds all agent, model, browser, and eval settings in one place.

**3.1 Shared credentials (`.env`)**

```bash

cp .env.example .env

vim .env

```

| Variable | Description | Sign up | Required |

|----------|-------------|---------|----------|

| `OPENAI_API_KEY` | API key for agents and evaluation | [platform.openai.com](https://platform.openai.com/api-keys) | ✅ |

| `OPENAI_BASE_URL` | Custom API base URL (e.g. LiteLLM proxy) | — | Optional |

| `LEXMOUNT_API_KEY` + `LEXMOUNT_PROJECT_ID` | Lexmount cloud browser | [browser.lexmount.cn](https://browser.lexmount.cn/) | When using lexmount |

| `BROWSER_USE_API_KEY` | Browser Use cloud browser | [browser-use.com](https://www.browser-use.com/) | When using browser-use-cloud |

| `AGENTBAY_API_KEY` | AgentBay cloud browser | [agentbay.ai](https://agentbay.ai/) | When using agentbay |

| `HF_ENDPOINT=https://hf-mirror.com` | HuggingFace mirror (China) | — | Optional |

**3.2 Runtime config (`config.yaml`)**

```bash

cp config.example.yaml config.yaml

vim config.yaml

```

All agents are configured in one file. Key fields under `agents.`:

| Field | Description |

|-------|-------------|

| `active_model` | Which model entry to use (must match a key under `models`) |

| `models..model_type` | Provider: `BROWSER_USE`, `OPENAI`, `AZURE`, `GEMINI`, `ANTHROPIC` |

| `models..model_id` | Model ID (e.g. `gpt-4.1`, `qwen3.5-plus`, `kimi-k2.5`) |

| `models..api_key` | API key for this model (supports `$ENV_VAR` expansion) |

| `models..base_url` | API base URL (optional, supports `$ENV_VAR` expansion) |

| `browser.browser_id` | Browser backend: `Chrome-Local`, `lexmount`, `browser-use-cloud`, `agentbay`, `cdp` |

| `defaults.*` | Shared agent params: `max_steps`, `timeout`, `use_vision`, etc. |

| `eval.model` + `eval.api_key` + `eval.base_url` | Evaluation model settings |

To switch models, change `active_model` and ensure the matching entry exists under `models`.

**4. Install Skills (Optional)**

```bash

bubench skills

```

Installs the prebuilt developer-friendly skills pack (`browseruse_bench/skills/`) into your agent toolchain.

**5. Run & Evaluate**

**Run**

```bash

bubench run --agent {AGENT} --data {BENCHMARK} --mode first_n --count 3

# Output: experiments/{benchmark}/{split}/{agent}/{model_id}/{timestamp}/

# Example: LexBench-Browser (no login required)

bubench run --agent browser-use --data LexBench-Browser --mode first_n --count 3

# Output: experiments/LexBench-Browser/All/browser-use/gpt-4.1/20260101_120000/

```

**Evaluate**

```bash

bubench eval --agent {AGENT} --data {BENCHMARK} --model-id {MODEL_ID}

# Example

bubench eval --agent browser-use --data LexBench-Browser --model-id gpt-4.1

```

> `--split` is optional — the benchmark's `default_split` (from `data_info.json`) is used automatically. Pass `--split ` only to override the default.

> For the full parameter reference, see the [Quickstart docs](https://docs.bubench.lexmount.io/en/quickstart).

## Data Loading

Use `--data-source` to control where benchmark data is loaded from:

| Mode | Description | Example |

|------|-------------|---------|

| `local` (default) | Uses local files under `benchmarks/{benchmark}/data/`, errors if missing | `--data-source local` |

| `huggingface` | Downloads to HF cache (`~/.cache/huggingface`), does not write back to repo | `--data-source huggingface` |

| `huggingface` + `--force-download` | Forces re-download, refreshes HF cache | `--data-source huggingface --force-download` |

> **Speed up in China**: Set `HF_ENDPOINT=https://hf-mirror.com` in `.env`.

> **Private datasets**: Set `HF_TOKEN=hf_your_token_here` in `.env`.

Details: [Data Loading](https://docs.bubench.lexmount.io/en/benchmarks/data-loading).

> 📖 For complete guides, API reference, and more examples, see the [full documentation](https://docs.bubench.lexmount.io/).

## Leaderboard

We provide an interactive local leaderboard to compare agent performance across benchmarks.

Generate leaderboard HTML:

```bash

bubench leaderboard

```

Deploy leaderboard service (temporary process):

```bash

bubench server --host 0.0.0.0 --port 8012 &

```

Deploy leaderboard service (systemd):

```bash

sudo bubench service install

sudo bubench service start

```

See [Leaderboard Documentation](https://docs.bubench.lexmount.io/en/leaderboard/overview) for more details.

**Access URLs (default port `8012`):**

- Local leaderboard: [http://localhost:8012](http://localhost:8012)

- Local API docs: [http://localhost:8012/docs](http://localhost:8012/docs)

- Remote leaderboard: `http://:8012/`

- Remote API docs: `http://:8012/docs`

## Visualization

An interactive experiment explorer for browsing agent trajectories, evaluation details, and per-task API logs — complements the static leaderboard with task-level drill-down.

```bash

# Start server (auto-regenerates index when experiment files change)

bubench viz --watch

# Access at http://localhost:8080

```

**Options:**

```bash

bubench viz --port 8090              # custom port (default: 8080)

bubench viz --generate-only          # regenerate experiments.json and exit

bubench viz --watch-interval 5       # poll interval in seconds (default: 3)

```

For remote sharing with tmux and firewall configuration, see [Visualization Documentation](https://docs.bubench.lexmount.io/en/leaderboard/visualization#remote--intranet-sharing).

## Acknowledgements

Some code in this project is cited and modified from [Online-Mind2Web](https://github.com/OSU-NLP-Group/Online-Mind2Web) and [simple-evals](https://github.com/openai/simple-evals).

## Citation

```bibtex

@misc{lexbench_browser_2026,

    title        = {LexBench-Browser: A Real-World Browser Agent Benchmark with Long-Tail and Multilingual Tasks},

    author       = {Lexmount Research and Collaborators},

    year         = {2026},

    howpublished = {\url{https://lexmount.github.io/browseruse-agent-bench/}},

    note         = {Open benchmark; v1.0 reference release},

}

```

## Contact

Questions, benchmark proposals, agent integrations, and result reproductions are welcome:

- Report bugs or request features in [GitHub Issues](https://github.com/lexmount/browseruse-agent-bench/issues).

- Ask questions and discuss results in [GitHub Discussions](https://github.com/lexmount/browseruse-agent-bench/discussions).

- Email official result, dataset, and collaboration questions to [lexbench@lexmount.com](mailto:lexbench@lexmount.com).

- Track upcoming releases in [Milestones](https://github.com/lexmount/browseruse-agent-bench/milestones).

- Use [Contributing](./CONTRIBUTING.md) when opening pull requests or adding a new agent/benchmark.

- See [Governance](./GOVERNANCE.md) and [Evaluation Protocol](./EVALUATION_PROTOCOL.md) for result review rules.

## Coming Soon

- 🔐 **Login-state preservation** — first-class support for reusing browser login across eval runs, so login-gated tasks can be benchmarked end-to-end without manual re-login. Stay tuned.

## Roadmap/ Development Plan

Refer to our [Milestones](https://github.com/lexmount/browseruse-agent-bench/milestones) for upcoming versions and deadlines.

## Star History
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/lexmount/browseruse-agent-bench

Awesome Lists containing this project

README