{"id":37084551,"url":"https://github.com/servicenow/webarena-verified","last_synced_at":"2026-02-07T22:13:30.878Z","repository":{"id":327781851,"uuid":"1108591363","full_name":"ServiceNow/webarena-verified","owner":"ServiceNow","description":"A verified version of the WebArena Benchmark","archived":false,"fork":false,"pushed_at":"2025-12-05T20:52:22.000Z","size":1443,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-12-09T05:37:34.225Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"https://servicenow.github.io/webarena-verified/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ServiceNow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"contributing/tasks.py","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":".github/CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-12-02T16:47:23.000Z","updated_at":"2025-12-06T19:55:09.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/ServiceNow/webarena-verified","commit_stats":null,"previous_names":["servicenow/webarena-verified"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ServiceNow/webarena-verified","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2Fwebarena-verified","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2Fwebarena-verified/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2Fwebarena-verified/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2Fwebarena-verified/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ServiceNow","download_url":"https://codeload.github.com/ServiceNow/webarena-verified/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2Fwebarena-verified/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28416988,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-14T10:18:03.274Z","status":"ssl_error","status_checked_at":"2026-01-14T10:16:11.865Z","response_time":107,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-01-14T10:22:23.186Z","updated_at":"2026-02-02T21:53:45.798Z","avatar_url":"https://github.com/ServiceNow.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WebArena-Verified\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/webarena-verified/\"\u003e\u003cimg src=\"https://img.shields.io/pypi/v/webarena-verified.svg\" alt=\"PyPI version\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://hub.docker.com/r/am1n3e/webarena-verified\"\u003e\u003cimg src=\"https://img.shields.io/docker/pulls/am1n3e/webarena-verified.svg\" alt=\"Docker Hub\"\u003e\u003c/a\u003e\n  \u003ca href=\"pyproject.toml\"\u003e\u003cimg src=\"https://img.shields.io/badge/Python-3.11+-3776AB.svg\" alt=\"Python 3.11+\"\u003e\u003c/a\u003e\n  \u003ca href=\"tests\"\u003e\u003cimg src=\"https://img.shields.io/badge/Tests-Pytest-6B2F8.svg\" alt=\"Tests: Pytest\"\u003e\u003c/a\u003e\n  \u003ca href=\"https://servicenow.github.io/webarena-verified/\"\u003e\u003cimg src=\"https://img.shields.io/badge/Docs-MkDocs-0288D1.svg\" alt=\"Docs: MkDocs\"\u003e\u003c/a\u003e\n\u003c/p\u003e\n\nWebArena-Verified is the verified release of the WebArena benchmark. It distributes a curated, version-controlled dataset of web tasks together with deterministic evaluators that operate on agent responses and captured network traces. The project is designed for reproducible benchmarking of web agents and provides tooling for both single-task debugging and batch evaluation.\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://servicenow.github.io/webarena-verified/\"\u003e📖 Documentation\u003c/a\u003e\n\u003c/p\u003e\n\n## 📢 Announcements\n\n- **February 2, 2026**: Optimized Docker images for all WebArena environments are now available on [Docker Hub](https://hub.docker.com/u/am1n3e)! Images are up to 92% smaller than originals, include auto-login headers, plus a single container for Map (beta) (previously 5 separate containers). See the [Environments documentation](https://servicenow.github.io/webarena-verified/environments/).\n- **February 2, 2026**: WebArena-Verified is now available via Docker and uvx! Run `uvx webarena-verified --help` or `docker run am1n3e/webarena-verified:latest --help` to get started.\n- **January 7, 2026**: WebArena-Verified is now available on PyPI! Install it easily with `pip install webarena-verified`.\n- **December 2, 2025**: We are presenting WebArena-Verified at the [Scaling Environments for Agents (SEA) Workshop](https://sea-workshop.github.io/) at NeurIPS 2025 on December 7th in San Diego. Come see us!\n- **November 12, 2024**: Started initial release with collaborators to gather early feedback, catch any issues, and clarify the documentation. **Public release scheduled for December 4th, 2025.**\n\n## 🎯 Highlights\n\n- **Fully audited benchmark**: Every task, reference answer, and evaluator has been manually reviewed and corrected\n- **Offline evaluation**: Evaluate agent runs without requiring live web environments using network trace replay\n- **Deterministic scoring**: Removed LLM-as-a-judge evaluation and substring matching in favor of type-aware normalization and structural comparison\n- **WebArena-Verified Hard subset**: A difficulty-prioritized 258-task subset for cost-effective evaluation\n\n## 🚀 Quick Start\n\n### Using uvx (Recommended)\n\nThe fastest way to try WebArena-Verified without installing anything:\n\n```bash\nuvx webarena-verified --help\n```\n\nRun evaluation directly:\n\n```bash\nuvx webarena-verified eval-tasks \\\n  --task-ids 108 \\\n  --output-dir examples/agent_logs/demo\n```\n\n### Using Docker\n\nRun evaluation using the Docker image by mounting your output directory:\n\n```bash\ndocker run --rm \\\n  -v /path/to/output:/data \\\n  am1n3e/webarena-verified:latest \\\n  eval-tasks --output-dir /data\n```\n\nYour output directory should contain task subdirectories with `agent_response.json` and `network.har` files:\n```\noutput/\n├── 1/\n│   ├── agent_response.json\n│   └── network.har\n├── 2/\n│   └── ...\n```\n\n### Using pip\n\nInstall from PyPI:\n\n```bash\npip install webarena-verified\n```\n\nVerify the CLI is working:\n\n```bash\nwebarena-verified --help\n```\n\nFor development, clone and install from source:\n\n```bash\ngit clone https://github.com/ServiceNow/webarena-verified.git\ncd webarena-verified\nuv sync\n```\n\n## 🌐 Run WebArena Environments\n\n### Using the CLI (Recommended)\n\nStart and manage WebArena environments using the built-in CLI:\n\n```bash\n# Start a site (waits for services to be ready)\nwebarena-verified env start --site shopping\nwebarena-verified env start --site shopping_admin\nwebarena-verified env start --site reddit\nwebarena-verified env start --site gitlab\n\n# Check status\nwebarena-verified env status --site shopping\n\n# Stop a site\nwebarena-verified env stop --site shopping\n\n# Stop all running sites\nwebarena-verified env stop-all\n```\n\nFor sites requiring data setup (Wikipedia, Map):\n\n```bash\n# Wikipedia - download data first (~100GB)\nwebarena-verified env setup init --site wikipedia --data-dir ./downloads\nwebarena-verified env start --site wikipedia --data-dir ./downloads\n\n# Map - download data first (~60GB)\nwebarena-verified env setup init --site map --data-dir ./downloads\nwebarena-verified env start --site map\n```\n\n### Using Docker Directly\n\nYou can also run environments directly with Docker:\n\n```bash\n# Shopping (Magento)\ndocker run -d --name webarena-verified-shopping -p 7770:80 -p 7771:8877 am1n3e/webarena-verified-shopping\n\n# Shopping Admin\ndocker run -d --name webarena-verified-shopping_admin -p 7780:80 -p 7781:8877 am1n3e/webarena-verified-shopping_admin\n\n# Reddit (Postmill)\ndocker run -d --name webarena-verified-reddit -p 9999:80 -p 9998:8877 am1n3e/webarena-verified-reddit\n\n# GitLab\ndocker run -d --name webarena-verified-gitlab -p 8023:8023 -p 8024:8877 am1n3e/webarena-verified-gitlab\n```\n\nSee the [Environments documentation](https://servicenow.github.io/webarena-verified/environments/) for detailed setup instructions, credentials, and configuration options.\n\n## 🧪 Evaluate A Task\n\nEvaluate a task using the CLI or programmatically:\n\n**CLI:**\n```bash\nwebarena-verified eval-tasks \\\n  --task-ids 108 \\\n  --output-dir examples/agent_logs/demo \\\n  --config examples/configs/config.example.json\n```\n\n**Library:**\n\nStart by creating a `WebArenaVerified` instance with your environment configuration:\n\n```python\nfrom pathlib import Path\nfrom webarena_verified.api import WebArenaVerified\nfrom webarena_verified.types.config import WebArenaVerifiedConfig\n\n# Initialize with configuration\nconfig = WebArenaVerifiedConfig(\n    environments={\n        \"__GITLAB__\": {\n            \"urls\": [\"http://localhost:8012\"],\n            \"credentials\": {\"username\": \"root\", \"password\": \"demopass\"}\n        }\n    }\n)\nwa = WebArenaVerified(config=config)\n\n# Get a single task\ntask = wa.get_task(44)\nprint(f\"Task intent: {task.intent}\")\n```\n\nOnce you have your agent's output, evaluate it against the task definition:\n\n**With Files:**\n```python\n# Evaluate a task with file paths\nresult = wa.evaluate_task(\n    task_id=44,\n    agent_response=Path(\"output/44/agent_response_44.json\"),\n    network_trace=Path(\"output/44/network_44.har\")\n)\n\nprint(f\"Score: {result.score}, Status: {result.status}\")\n```\n\n**With Inline Response:**\n```python\n# Evaluate a task with inline response\nresult = wa.evaluate_task(\n    task_id=44,\n    agent_response={\n        \"task_type\": \"NAVIGATE\",\n        \"status\": \"SUCCESS\",\n        \"retrieved_data\": None\n    },\n    network_trace=Path(\"output/44/network_44.har\")\n)\n\nprint(f\"Score: {result.score}, Status: {result.status}\")\n```\n\nSee the [Quick Start Guide](https://servicenow.github.io/webarena-verified/) for a complete walkthrough using example task logs.\n\n## 📊 Dataset\n\n- WebArena Verified dataset is in `assets/dataset/webarena-verified.json`\n- The original WebArena dataset is in `assets/dataset/test.raw.json` (kept for reference)\n- The WebArena Verified Hard subset task IDs are in `assets/dataset/subsets/webarena-verified-hard.json`\n\nTo export the hard subset's task data:\n\n```bash\nwebarena-verified subset-export --name webarena-verified-hard --output webarena-verified-hard.json\n```\n\nSee the [documentation](https://servicenow.github.io/webarena-verified/) for more info.\n\n## 🤝 Contributing\n\nWe welcome improvements to both the dataset and the evaluation tooling. See the [Contributing Guide](CONTRIBUTING.md) for guidelines, local development tips, and dataset update workflows.\n\n## 📄 Citation\n\nIf you use WebArena-Verified in your research, please cite our paper:\n\n```bibtex\n@inproceedings{\nhattami2025webarena,\ntitle={WebArena Verified: Reliable Evaluation for Web Agents},\nauthor={Amine El hattami and Megh Thakkar and Nicolas Chapados and Christopher Pal},\nbooktitle={Workshop on Scaling Environments for Agents},\nyear={2025},\nurl={https://openreview.net/forum?id=94tlGxmqkN}\n}\n```\n\n## 🙏 Acknowledgements\n\nWe thank [Prof. Shuyan Zhou](https://scholars.duke.edu/person/shuyan.zhou) and [Prof. Graham Neubig](https://miis.cs.cmu.edu/people/222215657/graham-neubig) for their valuable guidance and feedback.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fservicenow%2Fwebarena-verified","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fservicenow%2Fwebarena-verified","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fservicenow%2Fwebarena-verified/lists"}