{"id":23587652,"url":"https://github.com/ServiceNow/AgentLab","last_synced_at":"2025-08-30T04:31:15.803Z","repository":{"id":245112677,"uuid":"803954003","full_name":"ServiceNow/AgentLab","owner":"ServiceNow","description":"AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.","archived":false,"fork":false,"pushed_at":"2024-12-18T16:56:55.000Z","size":2447,"stargazers_count":156,"open_issues_count":16,"forks_count":32,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-12-18T17:39:28.394Z","etag":null,"topics":["agents","benchmark","evaluation-framework","llm","llm-agents","prompting","web-agents"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ServiceNow.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-21T17:17:20.000Z","updated_at":"2024-12-18T11:46:29.000Z","dependencies_parsed_at":"2024-06-19T22:28:03.737Z","dependency_job_id":"16c26294-f4ef-41b8-8c4c-07b421d3ada5","html_url":"https://github.com/ServiceNow/AgentLab","commit_stats":null,"previous_names":["servicenow/agentlab"],"tags_count":19,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FAgentLab","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FAgentLab/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FAgentLab/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ServiceNow%2FAgentLab/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ServiceNow","download_url":"https://codeload.github.com/ServiceNow/AgentLab/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":231438987,"owners_count":18376834,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","benchmark","evaluation-framework","llm","llm-agents","prompting","web-agents"],"created_at":"2024-12-27T05:01:25.987Z","updated_at":"2025-08-30T04:31:15.786Z","avatar_url":"https://github.com/ServiceNow.png","language":"Python","funding_links":[],"categories":["Agent Observability","A01_文本生成_文本对话","Building","3）参考实现与开源工具（GitHub）","Agent Harnessing and Evaluation","Benchmark/Evaluator","Browser / Web Agent Evaluation","General RL-for-LLM Gyms (cross-task frameworks)","AI Agents"],"sub_categories":["大语言对话模型及数据","Benchmarks","AI Native 工具","Benchmark Reality Check (real-world tool use)","Tools"],"readme":"\n\u003cdiv align=\"center\"\u003e\n    \n\n\n[![pypi](https://badge.fury.io/py/agentlab.svg)](https://pypi.org/project/agentlab/)\n[![PyPI - License](https://img.shields.io/pypi/l/agentlab?style=flat-square)]([https://opensource.org/licenses/MIT](http://www.apache.org/licenses/LICENSE-2.0))\n[![PyPI - Downloads](https://img.shields.io/pypi/dm/agentlab?style=flat-square)](https://pypistats.org/packages/agentlab)\n[![GitHub star chart](https://img.shields.io/github/stars/ServiceNow/AgentLab?style=flat-square)](https://star-history.com/#ServiceNow/AgentLab)\n[![Code Format](https://github.com/ServiceNow/AgentLab/actions/workflows/code_format.yml/badge.svg)](https://github.com/ServiceNow/AgentLab/actions/workflows/code_format.yml)\n[![Tests](https://github.com/ServiceNow/AgentLab/actions/workflows/unit_tests.yml/badge.svg)](https://github.com/ServiceNow/AgentLab/actions/workflows/unit_tests.yml)\n\n\n\n[🛠️ Setup](#%EF%B8%8F-setup-agentlab) \u0026nbsp;|\u0026nbsp; \n[🤖 Assistant](#-ui-assistant) \u0026nbsp;|\u0026nbsp; \n[🚀 Launch Experiments](#-launch-experiments) \u0026nbsp;|\u0026nbsp;\n[🔍 Analyse Results](#-analyse-results) \u0026nbsp;|\u0026nbsp;\n\u003cbr\u003e\n[🏆 Leaderboard](#-leaderboard) \u0026nbsp;|\u0026nbsp; \n[🤖 Build Your Agent](#-implement-a-new-agent) \u0026nbsp;|\u0026nbsp;\n[↻ Reproducibility](#-reproducibility) \u0026nbsp;|\u0026nbsp;\n[💪 BrowserGym](https://github.com/ServiceNow/BrowserGym)\n\n\n\u003cimg src=\"https://github.com/user-attachments/assets/47a7c425-9763-46e5-be54-adac363be850\" alt=\"agentlab-diagram\" width=\"700\"/\u003e\n\n\n[Demo solving tasks:](https://github.com/ServiceNow/BrowserGym/assets/26232819/e0bfc788-cc8e-44f1-b8c3-0d1114108b85)\n\n\n\u003c/div\u003e\n\n\u003e [!WARNING]\n\u003e AgentLab is meant to provide an open, easy-to-use and extensible framework to accelerate the field of web agent research.\n\u003e It is not meant to be a consumer product. Use with caution!\n\nAgentLab is a framework for developing and evaluating agents on a variety of\n[benchmarks](#-supported-benchmarks) supported by\n[BrowserGym](https://github.com/ServiceNow/BrowserGym). It is presented in more details in our [BrowserGym ecosystem paper](https://arxiv.org/abs/2412.05467)\n\nAgentLab Features:\n* Easy large scale parallel [agent experiments](#-launch-experiments) using [ray](https://www.ray.io/)\n* Building blocks for making agents over BrowserGym\n* Unified LLM API for OpenRouter, OpenAI, Azure, or self-hosted using TGI.\n* Preferred way for running benchmarks like WebArena\n* Various [reproducibility features](#reproducibility-features)\n* Unified [LeaderBoard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard)\n\n## 🎯 Supported Benchmarks\n\n| Benchmark | Setup  \u003cbr\u003e Link | # Task \u003cbr\u003e Template| Seed  \u003cbr\u003e Diversity | Max  \u003cbr\u003e Step | Multi-tab | Hosted Method | BrowserGym \u003cbr\u003e Leaderboard |\n|-----------|------------|---------|----------------|-----------|-----------|---------------|----------------------|\n| [WebArena](https://webarena.dev/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/webarena/README.md) | 812 | None | 30 | yes | self hosted (docker) | soon |\n| [WorkArena](https://github.com/ServiceNow/WorkArena) L1 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 33 | High | 30 | no | demo instance | soon |\n| [WorkArena](https://github.com/ServiceNow/WorkArena) L2 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |\n| [WorkArena](https://github.com/ServiceNow/WorkArena) L3 | [setup](https://github.com/ServiceNow/WorkArena?tab=readme-ov-file#getting-started) | 341 | High | 50 | no | demo instance | soon |\n| [WebLinx](https://mcgill-nlp.github.io/weblinx/) | - | 31586 | None | 1 | no | self hosted (dataset) | soon |\n| [VisualWebArena](https://github.com/web-arena-x/visualwebarena) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/visualwebarena/README.md) | 910 | None | 30 | yes | self hosted (docker) | soon |\n| [AssistantBench](https://assistantbench.github.io/) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/assistantbench/README.md) | 214 | None | 30 | yes | live web | soon |\n| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard) (soon) | - | - | None | - | - | live web | soon |\n| [Mind2Web-live](https://huggingface.co/datasets/iMeanAI/Mind2Web-Live) (soon) | - | - | None | - | - | live web | soon |\n| [MiniWoB](https://miniwob.farama.org/index.html) | [setup](https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/miniwob/README.md) | 125 | Medium | 10 | no | self hosted (static files) | soon |\n| [OSWorld](https://os-world.github.io/) | [setup](https://github.com/ServiceNow/AgentLab/blob/main/src/agentlab/benchmarks/osworld.md) | 369 | None | - | - | self hosted  | soon |\n\n\n## 🛠️ Setup AgentLab\n\nAgentLab requires python 3.11 or 3.12.\n\n```bash\npip install agentlab\n```\n\nIf not done already, install Playwright:\n```bash\nplaywright install\n```\n\nMake sure to prepare the required benchmark according to the instructions provided in the [setup\ncolumn](#-supported-benchmarks).\n\n```bash\nexport AGENTLAB_EXP_ROOT=\u003croot directory of experiment results\u003e  # defaults to $HOME/agentlab_results\nexport OPENAI_API_KEY=\u003cyour openai api key\u003e # if openai models are used\n```\n\n\u003cdetails\u003e\n\u003csummary\u003eSetup OpenRouter API\u003c/summary\u003e\n\n```bash\nexport OPENROUTER_API_KEY=\u003cyour openrouter api key\u003e # if openrouter models are used\n```\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eSetup Azure API\u003c/summary\u003e\n\n```bash\nexport AZURE_OPENAI_API_KEY=\u003cyour azure api key\u003e # if using azure models\nexport AZURE_OPENAI_ENDPOINT=\u003cyour endpoint\u003e # if using azure models\n```\n\u003c/details\u003e\n\n## 🤖 UI-Assistant \n\nUse an assistant to work for you (at your own cost and risk).\n\n```bash\nagentlab-assistant --start_url https://www.google.com\n```\n\nTry your own agent: \n\n```bash\nagentlab-assistant --agent_config=\"module.path.to.your.AgentArgs\"\n```\n\n## 🚀 Launch experiments\n\n```python\n# Import your agent configuration extending bgym.AgentArgs class\n# Make sure this object is imported from a module accessible in PYTHONPATH to properly unpickle\nfrom agentlab.agents.generic_agent import AGENT_4o_MINI \n\nfrom agentlab.experiments.study import make_study\n\nstudy = make_study(\n    benchmark=\"miniwob\",  # or \"webarena\", \"workarena_l1\" ...\n    agent_args=[AGENT_4o_MINI],\n    comment=\"My first study\",\n)\n\nstudy.run(n_jobs=5)\n```\n\nRelaunching incomplete or errored tasks\n\n```python\nfrom agentlab.experiments.study import Study\nstudy = Study.load(\"/path/to/your/study/dir\")\nstudy.find_incomplete(include_errors=True)\nstudy.run()\n```\n\nSee [main.py](main.py) to launch experiments with a variety of options. This is like a lazy CLI that\nis actually more convenient. Just comment and uncomment the lines you need or modify at will (but\ndon't push to the repo).\n\n\n### Job Timeouts\n\nThe complexity of the wild web, Playwright, and asyncio can sometimes cause jobs to hang. This\ndisables workers until the study is terminated and relaunched. If you are running jobs sequentially\nor with a small number of workers, this could halt your entire study until you manually kill and\nrelaunch it. In the Ray parallel backend, we've implemented a system to automatically terminate jobs\nexceeding a specified timeout. This feature is particularly useful when task hanging limits your\nexperiments. \n\n### Debugging\n\nFor debugging, run experiments with `n_jobs=1` and use VSCode's debug mode. This allows you to pause\nexecution at breakpoints.\n\n### About Parallel Jobs\n\nRunning one agent on one task corresponds to a single job. Conducting ablation studies or random\nsearches across hundreds of tasks with multiple seeds can generate more than 10,000 jobs. Efficient\nparallel execution is therefore critical. Agents typically wait for responses from the LLM server or\nupdates from the web server. As a result, you can run 10–50 jobs in parallel on a single computer,\ndepending on available RAM.\n\n⚠️ **Note for (Visual)WebArena**: These benchmarks have task dependencies designed to minimize\n\"corrupting\" the instance between tasks. For example, an agent on task 323 could alter the instance\nstate, making task 201 impossible. To address this, the Ray backend accounts for task dependencies,\nenabling some degree of parallelism. On WebArena, you can disable dependencies to increase\nparallelism, but this might reduce performance by 1–2%.\n\n⚠️ **Instance Reset for (Visual)WebArena**: Before evaluating an agent, the instance is\nautomatically reset, a process that takes about 5 minutes. When evaluating multiple agents, the\n`make_study` function returns a `SequentialStudies` object to ensure proper sequential evaluation of\neach agent. AgentLab currently does not support evaluations across multiple instances, but you could\neither create a quick script to handle this or submit a PR to AgentLab. For a smoother parallel\nexperience, consider using benchmarks like WorkArena instead.\n\n## 🔍 Analyse Results\n\n### Loading Results\n\nThe class [`ExpResult`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L595) provides a lazy loader for all the information of a specific experiment. You can use [`yield_all_exp_results`](https://github.com/ServiceNow/BrowserGym/blob/da26a5849d99d9a3169d7b1fde79f909c55c9ba7/browsergym/experiments/src/browsergym/experiments/loop.py#L872) to recursively find all results in a directory. Finally [`load_result_df`](https://github.com/ServiceNow/AgentLab/blob/be1998c5fad5bda47ba50497ec3899aae03e85ec/src/agentlab/analyze/inspect_results.py#L119C5-L119C19) gathers all the summary information in a single dataframe. See [`inspect_results.ipynb`](src/agentlab/analyze/inspect_results.ipynb) for example usage.\n\n```python\nfrom agentlab.analyze import inspect_results\n\n# load the summary of all experiments of the study in a dataframe\nresult_df = inspect_results.load_result_df(\"path/to/your/study\")\n\n# load the detailed results of the 1st experiment\nexp_result = bgym.ExpResult(result_df[\"exp_dir\"][0])\nstep_0_screenshot = exp_result.screenshots[0]\nstep_0_action = exp_result.steps_info[0].action\n```\n\n\n### AgentXray\n\nhttps://github.com/user-attachments/assets/06c4dac0-b78f-45b7-9405-003da4af6b37\n\nIn a terminal, execute:\n```bash\nagentlab-xray\n```\n\nYou can load previous or ongoing experiments in the directory `AGENTLAB_EXP_ROOT` and visualize\nthe results in a gradio interface.\n\nIn the following order, select:\n* The experiment you want to visualize\n* The agent if there is more than one\n* The task\n* And the seed\n\nOnce this is selected, you can see the trace of your agent on the given task. Click on the profiling\nimage to select a step and observe the action taken by the agent.\n\n\n**⚠️ Note**: Gradio is still developing, and unexpected behavior has been frequently noticed. Version 5.5 seems to work properly so far. If you're not sure that the proper information is displaying, refresh the page and select your experiment again.\n\n\n## 🏆 Leaderboard\n\nOfficial unified [leaderboard](https://huggingface.co/spaces/ServiceNow/browsergym-leaderboard) across all benchmarks. \n\nExperiments are on their way for more reference points using GenericAgent. We are also working on code to automatically push a study to the leaderboard.\n\n## 🤖 Implement a new Agent\n\nGet inspiration from the `MostBasicAgent` in\n[agentlab/agents/most_basic_agent/most_basic_agent.py](src/agentlab/agents/most_basic_agent/most_basic_agent.py).\nFor a better integration with the tools, make sure to implement most functions in the\n[AgentArgs](src/agentlab/agents/agent_args.py#L5) API and the extended `bgym.AbstractAgentArgs`.\n\nIf you think your agent should be included directly in AgenLab, let us know and it can be added in\nagentlab/agents/ with the name of your agent.  \n\n## ↻ Reproducibility\nSeveral factors can influence reproducibility of results in the context of evaluating agents on\ndynamic benchmarks.\n\n### Factors affecting reproducibility\n* **Software version**: Different versions of Playwright or any package in the software stack could\n  influence the behavior of the benchmark or the agent.\n* **API-based LLMs silently changing**: Even for a fixed version, an LLM may be updated e.g. to\n  incorporate the latest web knowledge.\n* **Live websites**:\n  * WorkArena: The demo instance is mostly fixed in time to a specific version but ServiceNow\n    sometimes pushes minor modifications.\n  * AssistantBench and GAIA: These rely on the agent navigating the open web. The experience may\n    change depending on which country or region, some websites might be in different languages by\n    default.\n* **Stochastic Agents**: Setting the temperature of the LLM to 0 can reduce most stochasticity.\n* **Non-deterministic tasks**: For a fixed seed, the changes should be minimal\n\n### Reproducibility Features\n* `Study` contains a dict of information about reproducibility, including benchmark version, package\n  version and commit hash\n* The `Study` class allows automatic upload of your results to\n  [`reproducibility_journal.csv`](reproducibility_journal.csv). This makes it easier to populate a\n  large amount of reference points. For this feature, you need to `git clone` the repository and install via `pip install -e .`.\n* **Reproduced results in the leaderboard**. For agents that are reprocudibile, we encourage users\n  to try to reproduce the results and upload them to the leaderboard. There is a special column\n  containing information about all reproduced results of an agent on a benchmark.\n* **ReproducibilityAgent**: [You can run this agent](src/agentlab/agents/generic_agent/reproducibility_agent.py) on an existing study and it will try to re-run\n  the same actions on the same task seeds. A visual diff of the two prompts will be displayed in the\n  AgentInfo HTML tab of AgentXray. You will be able to inspect on some tasks what kind of changes\n  between the two executions. **Note**: this is a beta feature and will need some adaptation for your\n  own agent.\n\n## Variables\nHere's a list of relevant env. variables that are used by AgentLab:\n- `OPEAI_API_KEY` which is used by default for OpenAI LLMs.\n- `AZURE_OPENAI_API_KEY`, used by default for AzureOpenAI LLMs.\n- `AZURE_OPENAI_ENDPOINT` to specify your Azure endpoint.\n- `OPENAI_API_VERSION` for the Azure API.\n- `OPENROUTER_API_KEY` for the Openrouter API\n- `AGENTLAB_EXP_ROOT`, desired path for your experiments to be stored, defaults to `~/agentlab-results`.\n- `AGENTXRAY_SHARE_GRADIO`, which prompts AgentXRay to open a public tunnel on launch.\n\n## Misc\n\nif you want to download HF models more quickly\n```\npip install hf-transfer\npip install torch\nexport HF_HUB_ENABLE_HF_TRANSFER=1\n```\n\n\n## 📝 Citing This Work\n\nPlease use the two following bibtex entries if you wish to cite AgentLab:\n\n```tex\n@article{\n    chezelles2025browsergym,\n    title={The BrowserGym Ecosystem for Web Agent Research},\n    author={Thibault Le Sellier de Chezelles and Maxime Gasse and Alexandre Lacoste and Massimo Caccia and Alexandre Drouin and L{\\'e}o Boisvert and Megh Thakkar and Tom Marty and Rim Assouel and Sahar Omidi Shayegan and Lawrence Keunho Jang and Xing Han L{\\`u} and Ori Yoran and Dehan Kong and Frank F. Xu and Siva Reddy and Graham Neubig and Quentin Cappart and Russ Salakhutdinov and Nicolas Chapados},\n    journal={Transactions on Machine Learning Research},\n    issn={2835-8856},\n    year={2025},\n    url={https://openreview.net/forum?id=5298fKGmv3},\n    note={Expert Certification}\n}\n\n@inproceedings{workarena2024,\n    title = {{W}ork{A}rena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?},\n    author = {Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Del Verme, Manuel and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre},\n    booktitle = {Proceedings of the 41st International Conference on Machine Learning},\n    pages = {11642--11662},\n    year = {2024},\n    editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},\n    volume = {235},\n    series = {Proceedings of Machine Learning Research},\n    month = {21--27 Jul},\n    publisher = {PMLR},\n    url = {https://proceedings.mlr.press/v235/drouin24a.html},\n}\n```\n\nHere is an example of how they can be used:\n\n```tex\nWe use the AgentLab framework to run and manage our experiments \\cite{workarena2024,chezelles2025browsergym}.\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FServiceNow%2FAgentLab","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FServiceNow%2FAgentLab","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FServiceNow%2FAgentLab/lists"}