Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aws-samples/multiagent-collab-scenario-benchmark
Benchmarking data and script used for LLM multi-agent collaboration systems from AWS Bedrock Agents Science team.
https://github.com/aws-samples/multiagent-collab-scenario-benchmark
Last synced: 3 days ago
JSON representation
Benchmarking data and script used for LLM multi-agent collaboration systems from AWS Bedrock Agents Science team.
- Host: GitHub
- URL: https://github.com/aws-samples/multiagent-collab-scenario-benchmark
- Owner: aws-samples
- License: mit-0
- Created: 2024-11-06T19:17:30.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2024-12-10T15:59:34.000Z (29 days ago)
- Last Synced: 2024-12-10T17:30:53.659Z (29 days ago)
- Language: Python
- Homepage:
- Size: 95.7 KB
- Stars: 2
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
- awesome_ai_agents - Multiagent-Collab-Scenario-Benchmark - Benchmarking data and script used for LLM multi-agent collaboration systems from AWS Bedrock Agents Science team. (Building / Benchmarks)
- awesome_ai_agents - Multiagent-Collab-Scenario-Benchmark - Benchmarking data and script used for LLM multi-agent collaboration systems from AWS Bedrock Agents Science team. (Building / Benchmarks)
README
## Multi-agent Collaboration Scenario Benchmarking
This repository contains benchmarking material from the AWS Bedrock Agents multi-agents collaboration technical report: "Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications". The technical report is here on arXiv: https://arxiv.org/abs/2412.05449.
### Data
Benchmarking data is in the `datasets` directory where there are 30 hypothetical scenarios for three domains: travel planning, mortgage financing, and software development.
Each entry in the scenarios file contains:
- `scenario`: The user background and goals.
- `input_problem`: A description of the problem to be solved by the agent.
- `assertions`: A list of assertions that must be true to judge the interaction between user and the agent.In each dataset, there is also a `agents.json` file that contains the agent's name and description, as well as their corresponding tools. The scenarios are collected based on these agent profiles and tool schemas.
### Pre-requisites
Create a Python 3.12 virtual environment and install requirements in `requirements.txt`.
Next, prepare the conversations that you want to benchmark. Each conversation should be in its own JSON file titled `conversation_0.json`, `conversation_1.json`, etc. where the index corresponds to the scenario index. The `conversation_{i}.json` file should be formatted as follows:
```
{
"trajectories": {
"agent_id_1": [
{
"role": null, # null, User, Action, Observation
"source": "", # agent_id of the agent who sent this message
"destination": "", # agent_id of the user who received this message
"content": "", # content of the message
"actions": [], # list of action objects executed by the agent
"observation": null, # observation of the agent
}
],
"agent_id_2": [...],
...
}
}
```See `sample_conversations` for examples.
### How to use
First, export any environment variables needed for LLM providers (Bedrock, OpenAI, Anthropic, etc) to support the LLM judge. See [LiteLLM Providers](https://docs.litellm.ai/docs/providers) for setting up LLMs.
Run the benchmarking script on a sample travel conversation:
```
{export env variables}python -m src.benchmark
```Customize the benchmarking parameters as needed:
```
python -m src.benchmark \
--dataset_dir \
--scenario_filename \
--conversations_dir \
--llm_judge_id \
```## Security
See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.
## License
This library is licensed under the MIT-0 License. See the LICENSE file.
The dataset is licensed under the CC-BY-4.0 license.
## Citation
If you have found our work useful, please cite the technical report:
```
@misc{shu2024effectivegenaimultiagentcollaboration,
title={Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications},
author={Raphael Shu and Nilaksh Das and Michelle Yuan and Monica Sunkara and Yi Zhang},
year={2024},
eprint={2412.05449},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2412.05449},
}
```
## Core Contributors
* [Raphael Shu](https://github.com/zomux)
* [Nilaksh Das](https://github.com/nilakshdas)
* [Michelle Yuan](https://github.com/forest-snow)