https://github.com/asappresearch/webagents-step

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://github.com/asappresearch/webagents-step
Owner: asappresearch
License: mit
Created: 2023-11-22T13:05:28.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-21T15:02:03.000Z (12 months ago)
Last Synced: 2025-04-05T11:07:35.418Z (3 months ago)
Language: Jupyter Notebook
Size: 2.98 MB
Stars: 39
Watchers: 1
Forks: 8
Open Issues: 4
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # SteP: Stacked LLM Policies for Web Actions

Paper link: [https://arxiv.org/abs/2310.03720](https://arxiv.org/abs/2310.03720)

## Installation

To set up the project, clone the repository and create a virtual environment:

```bash

cd webagents-step

pyenv virtualenv webagents-step

pyenv activate webagents-step

```

Install the required packages:

```bash

pip install -r requirements.txt

```

## WebArena Evaluation

### WebArena Results

We break down the success rates across different websites and provide links to the trajectory logs below, containing the observations, model predictions, and evaluator outputs for each task.

The latest runs with `gpt-4-turbo-2024-04-09` model and WebArena code (last commit May 29, 2024) are linked below

| Website | Number of tasks              | Success Rate | Trajectory Logs             |

|---------|--------------------|--------------|------------------|

| Gitlab  | 180 | 31.7%       | [logs](https://drive.google.com/drive/folders/1znkg8aQoEVLTvSyQ8iebb_bsOJL2DrKl?usp=share_link) |

| Reddit  | 106 | 59.4%        | [logs](https://drive.google.com/drive/folders/1Ek9cMz344tKXbEchakPyPXoTU14FYSlm?usp=share_link) |

| Shopping  | 187 | 36.9%       | [logs](https://drive.google.com/drive/folders/1ztCP7JH18XS_mGlPCIrP7cKc2eF6Yf8S?usp=share_link) |

| Shopping admin (CMS)  | 182 |   24.2%   | [logs](https://drive.google.com/drive/folders/1quti9851rBO49alYYL9C1NZNcpRI_Cg-?usp=share_link) |

| Map  | 109 | 30.3%       | [logs](https://drive.google.com/drive/folders/1V7c122QKNAIVdbskLFNwTJcwILGIf_kS?usp=share_link) |

| Multisite  | 48 |   12.5%    | [logs](https://drive.google.com/drive/folders/1JmvrY1Ys_bHHY8eQmJocnyZGiPeG7BpV?usp=share_link) |

| All  | 812 |   33.5%    | [logs](https://drive.google.com/drive/folders/1AKXlClGbFU4RQtfWN9f6jva7MbbGCbur?usp=share_link) |

### Installing WebArena

Install WebArena from [WebArena github repository](https://github.com/web-arena-x/webarena). This code uses the last commit 4c741b4b20a3e183836e58f383f9be1785248160 on May 29, 2024.

Generate test data configs:

```bash

python scripts/generate_test_data.py

```

You will see `*.json` files generated in config_files/ folder. Copy these over to a `tasks/webarena` directory in the `webagents-step/` root directory.

You will also need to setup authentication for all websites as per instructions in the WebArena README (See instructions for *Obtain the auto-login cookies for all websites*). This will generate a `.auth` folder. Copy this over to `webagents-step/` root directory.

### Running Evaluation

To run WebArena evaluation:

```bash

python scripts/evaluate/eval_webarena.py --config configs/webarena/eval_openai_agent.yml

```

Important:

* Set up each website as a docker as listed in [WebArena instructions](https://github.com/web-arena-x/webarena/blob/main/environment_docker/README.md)

* Reset the website state before running an evaluation. This matters since the initial state of the website affects the success of the task.

* For Reddit tasks, there is a rate limit on making more than 3 posts in an hour. You need to add a sleep of 21 minutes before every new task. This can be done by adding `time.sleep(1260)` inside the for loop in `eval_webarena.py`

## MiniWoB++ Evaluation

### Installing MiniWob++

Install MiniWoB++ from [this repository](https://github.com/Farama-Foundation/miniwob-plusplus). Use commit 43bd1fe.

### Running Evaluation

To run MiniWoB++ evaluation:

```bash

python scripts/evaluate/eval_miniwob.py --config configs/miniwob/eval_openai_agent.yml

```

## Contact

This project is still in active development. For any questions or issues, please contact us at [[email protected]](mailto:[email protected]).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/asappresearch/webagents-step

Awesome Lists containing this project

README