https://github.com/shengyin1224/SafeAgentBench

Codes for paper "SafeAgentBench: A Benchmark for Safe Task Planning of \\ Embodied LLM Agents"
https://github.com/shengyin1224/SafeAgentBench

Last synced: 4 months ago
JSON representation

Codes for paper "SafeAgentBench: A Benchmark for Safe Task Planning of \\ Embodied LLM Agents"

Host: GitHub
URL: https://github.com/shengyin1224/SafeAgentBench
Owner: shengyin1224
Created: 2024-09-14T16:02:34.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-10-27T16:16:00.000Z (6 months ago)
Last Synced: 2024-10-27T18:51:45.499Z (6 months ago)
Language: Python
Size: 1.79 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome_ai_agents - Safeagentbench - Codes for paper "SafeAgentBench - A Benchmark for Safe Task Planning of \\ Embodied LLM Agents" (Building / Benchmarks)
awesome_ai_agents - Safeagentbench - Codes for paper "SafeAgentBench - A Benchmark for Safe Task Planning of \\ Embodied LLM Agents" (Building / Benchmarks)

README

# SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

With the integration of large language models (LLMs), embodied agents have strong capabilities to execute complicated instructions in natural language, paving a way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in real world. To study this issue, we present **SafeAgentBench** —- a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that the best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks, indicating significant safety risks.

For the latest updates, see: [**our website**](https://safeagentbench.github.io)

![](figure/safeagentbench_show.jpg)

## Quickstart

Clone repo:

```bash
$ git clone https://github.com/shengyin/safeagentbench.git
$ cd safeagentbench
```

Install requirements:

```bash
$ pip install -r requirements.txt
```

## More Info

- [**Dataset**](dataset/): Safe detailed tasks(300 samples), unsafe detailed tasks(300 samples), abstract tasks(100 samples) and long-horizon tasks(50 samples).
- [**Evaluators**](evaluator/): Evaluation metrics for each type of task, including success rate, rejection rate, and other metrics.
- [**low-level controller**](low_level_controller/): A low-level controller for SafeAgentEnv, which takes in the high-level action and map them to low-level actions supported by AI2-THOR for the agent to execute. You can choose multi-agent version or single-agent version.

## SOTA Embodied LLM Agents

Because each agent has different code structure, we can not provide all the implementation codes. You can refer to these works' papers and codes to implement your own agent.

LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents

Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang

Paper, Code

Building Cooperative Embodied Agents Modularly with Large Language Models

Hongxin Zhang*, Weihua Du*, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan_: Building Cooperative Embodied Agents Modularly with Large Language Models

Paper, Code

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg

Paper, Code

MLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model

Yike Wu, Jiatao Zhang, Nan Hu, LanLing Tang, Guilin Qi, Jun Shao, Jie Ren, Wei Song

Paper, Code

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang

Paper, Code

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao

Paper, Code

LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models

Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, Yu Su,

Paper, Code

Multi-agent Planning using Visual Language Models

Michele Brienza, Francesco Argenziano, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi

Paper, Code

## Hardware

The same as AI2-THOR.

## Citation

If you find the dataset or code useful, please cite:

```
TBD
```

## License

MIT License

## Contact

Questions or issues? Contact [[email protected]]([email protected])

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/shengyin1224/SafeAgentBench

Awesome Lists containing this project

README