Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/aws-samples/multiagent-collab-scenario-benchmark

Benchmarking data and script used for LLM multi-agent collaboration systems from AWS Bedrock Agents Science team.
https://github.com/aws-samples/multiagent-collab-scenario-benchmark

Last synced: 3 days ago
JSON representation

Benchmarking data and script used for LLM multi-agent collaboration systems from AWS Bedrock Agents Science team.

Awesome Lists containing this project

README

        

## Multi-agent Collaboration Scenario Benchmarking

This repository contains benchmarking material from the AWS Bedrock Agents multi-agents collaboration technical report: "Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications". The technical report is here on arXiv: https://arxiv.org/abs/2412.05449.

### Data

Benchmarking data is in the `datasets` directory where there are 30 hypothetical scenarios for three domains: travel planning, mortgage financing, and software development.

Each entry in the scenarios file contains:
- `scenario`: The user background and goals.
- `input_problem`: A description of the problem to be solved by the agent.
- `assertions`: A list of assertions that must be true to judge the interaction between user and the agent.

In each dataset, there is also a `agents.json` file that contains the agent's name and description, as well as their corresponding tools. The scenarios are collected based on these agent profiles and tool schemas.

### Pre-requisites

Create a Python 3.12 virtual environment and install requirements in `requirements.txt`.

Next, prepare the conversations that you want to benchmark. Each conversation should be in its own JSON file titled `conversation_0.json`, `conversation_1.json`, etc. where the index corresponds to the scenario index. The `conversation_{i}.json` file should be formatted as follows:

```
{
"trajectories": {
"agent_id_1": [
{
"role": null, # null, User, Action, Observation
"source": "", # agent_id of the agent who sent this message
"destination": "", # agent_id of the user who received this message
"content": "", # content of the message
"actions": [], # list of action objects executed by the agent
"observation": null, # observation of the agent
}
],
"agent_id_2": [...],
...
}
}
```

See `sample_conversations` for examples.

### How to use

First, export any environment variables needed for LLM providers (Bedrock, OpenAI, Anthropic, etc) to support the LLM judge. See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for setting up LLMs.

Run the benchmarking script on a sample travel conversation:

```
{export env variables}

python -m src.benchmark
```

Customize the benchmarking parameters as needed:
```
python -m src.benchmark \
--dataset_dir \
--scenario_filename \
--conversations_dir \
--llm_judge_id \
```

## Security

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License

This library is licensed under the MIT-0 License. See the LICENSE file.

The dataset is licensed under the CC-BY-4.0 license.

## Citation

If you have found our work useful, please cite the technical report:

```
@misc{shu2024effectivegenaimultiagentcollaboration,
title={Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications},
author={Raphael Shu and Nilaksh Das and Michelle Yuan and Monica Sunkara and Yi Zhang},
year={2024},
eprint={2412.05449},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.05449},
}
```
## Core Contributors
* [Raphael Shu](https://github.com/zomux)
* [Nilaksh Das](https://github.com/nilakshdas)
* [Michelle Yuan](https://github.com/forest-snow)