https://github.com/asappresearch/webagents-step
https://github.com/asappresearch/webagents-step
Last synced: 3 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/asappresearch/webagents-step
- Owner: asappresearch
- License: mit
- Created: 2023-11-22T13:05:28.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-07-21T15:02:03.000Z (9 months ago)
- Last Synced: 2025-04-05T11:07:35.418Z (27 days ago)
- Language: Jupyter Notebook
- Size: 2.98 MB
- Stars: 39
- Watchers: 1
- Forks: 8
- Open Issues: 4
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# SteP: Stacked LLM Policies for Web Actions
Paper link: [https://arxiv.org/abs/2310.03720](https://arxiv.org/abs/2310.03720)
## Installation
To set up the project, clone the repository and create a virtual environment:
```bash
cd webagents-step
pyenv virtualenv webagents-step
pyenv activate webagents-step
```Install the required packages:
```bash
pip install -r requirements.txt
```## WebArena Evaluation
### WebArena Results
We break down the success rates across different websites and provide links to the trajectory logs below, containing the observations, model predictions, and evaluator outputs for each task.
The latest runs with `gpt-4-turbo-2024-04-09` model and WebArena code (last commit May 29, 2024) are linked below
| Website | Number of tasks | Success Rate | Trajectory Logs |
|---------|--------------------|--------------|------------------|
| Gitlab | 180 | 31.7% | [logs](https://drive.google.com/drive/folders/1znkg8aQoEVLTvSyQ8iebb_bsOJL2DrKl?usp=share_link) |
| Reddit | 106 | 59.4% | [logs](https://drive.google.com/drive/folders/1Ek9cMz344tKXbEchakPyPXoTU14FYSlm?usp=share_link) |
| Shopping | 187 | 36.9% | [logs](https://drive.google.com/drive/folders/1ztCP7JH18XS_mGlPCIrP7cKc2eF6Yf8S?usp=share_link) |
| Shopping admin (CMS) | 182 | 24.2% | [logs](https://drive.google.com/drive/folders/1quti9851rBO49alYYL9C1NZNcpRI_Cg-?usp=share_link) |
| Map | 109 | 30.3% | [logs](https://drive.google.com/drive/folders/1V7c122QKNAIVdbskLFNwTJcwILGIf_kS?usp=share_link) |
| Multisite | 48 | 12.5% | [logs](https://drive.google.com/drive/folders/1JmvrY1Ys_bHHY8eQmJocnyZGiPeG7BpV?usp=share_link) |
| All | 812 | 33.5% | [logs](https://drive.google.com/drive/folders/1AKXlClGbFU4RQtfWN9f6jva7MbbGCbur?usp=share_link) |### Installing WebArena
Install WebArena from [WebArena github repository](https://github.com/web-arena-x/webarena). This code uses the last commit 4c741b4b20a3e183836e58f383f9be1785248160 on May 29, 2024.Generate test data configs:
```bash
python scripts/generate_test_data.py
```
You will see `*.json` files generated in config_files/ folder. Copy these over to a `tasks/webarena` directory in the `webagents-step/` root directory.You will also need to setup authentication for all websites as per instructions in the WebArena README (See instructions for *Obtain the auto-login cookies for all websites*). This will generate a `.auth` folder. Copy this over to `webagents-step/` root directory.
### Running Evaluation
To run WebArena evaluation:
```bash
python scripts/evaluate/eval_webarena.py --config configs/webarena/eval_openai_agent.yml
```Important:
* Set up each website as a docker as listed in [WebArena instructions](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md)
* Reset the website state before running an evaluation. This matters since the initial state of the website affects the success of the task.
* For Reddit tasks, there is a rate limit on making more than 3 posts in an hour. You need to add a sleep of 21 minutes before every new task. This can be done by adding `time.sleep(1260)` inside the for loop in `eval_webarena.py`## MiniWoB++ Evaluation
### Installing MiniWob++
Install MiniWoB++ from [this repository](https://github.com/Farama-Foundation/miniwob-plusplus). Use commit 43bd1fe.### Running Evaluation
To run MiniWoB++ evaluation:
```bash
python scripts/evaluate/eval_miniwob.py --config configs/miniwob/eval_openai_agent.yml
```## Contact
This project is still in active development. For any questions or issues, please contact us at [[email protected]](mailto:[email protected]).