Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/shengyin1224/SafeAgentBench
Codes for paper "SafeAgentBench: A Benchmark for Safe Task Planning of \\ Embodied LLM Agents"
https://github.com/shengyin1224/SafeAgentBench
Last synced: 3 days ago
JSON representation
Codes for paper "SafeAgentBench: A Benchmark for Safe Task Planning of \\ Embodied LLM Agents"
- Host: GitHub
- URL: https://github.com/shengyin1224/SafeAgentBench
- Owner: shengyin1224
- Created: 2024-09-14T16:02:34.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2024-10-27T16:16:00.000Z (2 months ago)
- Last Synced: 2024-10-27T18:51:45.499Z (2 months ago)
- Language: Python
- Size: 1.79 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome_ai_agents - Safeagentbench - Codes for paper "SafeAgentBench - A Benchmark for Safe Task Planning of \\ Embodied LLM Agents" (Building / Benchmarks)
- awesome_ai_agents - Safeagentbench - Codes for paper "SafeAgentBench - A Benchmark for Safe Task Planning of \\ Embodied LLM Agents" (Building / Benchmarks)
README
# SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents
With the integration of large language models (LLMs), embodied agents have strong capabilities to execute complicated instructions in natural language, paving a way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in real world. To study this issue, we present **SafeAgentBench** —- a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that the best-performing baseline gets 69% success rate for safe tasks, but only 5% rejection rate for hazardous tasks, indicating significant safety risks.
For the latest updates, see: [**our website**](https://safeagentbench.github.io)
![](figure/safeagentbench_show.jpg)
## Quickstart
Clone repo:
```bash
$ git clone https://github.com/shengyin/safeagentbench.git
$ cd safeagentbench
```Install requirements:
```bash
$ pip install -r requirements.txt
```## More Info
- [**Dataset**](dataset/): Safe detailed tasks(300 samples), unsafe detailed tasks(300 samples), abstract tasks(100 samples) and long-horizon tasks(50 samples).
- [**Evaluators**](evaluator/): Evaluation metrics for each type of task, including success rate, rejection rate, and other metrics.
- [**low-level controller**](low_level_controller/): A low-level controller for SafeAgentEnv, which takes in the high-level action and map them to low-level actions supported by AI2-THOR for the agent to execute. You can choose multi-agent version or single-agent version.## SOTA Embodied LLM Agents
Because each agent has different code structure, we can not provide all the implementation codes. You can refer to these works' papers and codes to implement your own agent.
LoTa-Bench: Benchmarking Language-oriented Task Planners for Embodied Agents
Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, Minsu Jang
Paper, CodeBuilding Cooperative Embodied Agents Modularly with Large Language Models
Hongxin Zhang*, Weihua Du*, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, Chuang Gan_: Building Cooperative Embodied Agents Modularly with Large Language Models
Paper, CodeProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg
Paper, CodeMLDT: Multi-Level Decomposition for Complex Long-Horizon Robotic Task Planning with Open-Source Large Language Model
Yike Wu, Jiatao Zhang, Nan Hu, LanLing Tang, Guilin Qi, Jun Shao, Jie Ren, Wei Song
Paper, CodePCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang
Paper, CodeReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, Yuan Cao
Paper, CodeLLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models
Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M. Sadler, Wei-Lun Chao, Yu Su,
Paper, CodeMulti-agent Planning using Visual Language Models
Michele Brienza, Francesco Argenziano, Vincenzo Suriani, Domenico D. Bloisi, Daniele Nardi
Paper, Code## Hardware
The same as AI2-THOR.
## Citation
If you find the dataset or code useful, please cite:
```
TBD
```## License
MIT License
## Contact
Questions or issues? Contact [[email protected]]([email protected])