Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/microsoft/ToolTalk

Evaluating tool-augmented LLMs in conversation settings
https://github.com/microsoft/ToolTalk

Last synced: 4 months ago
JSON representation

Evaluating tool-augmented LLMs in conversation settings

Host: GitHub
URL: https://github.com/microsoft/ToolTalk
Owner: microsoft
License: mit
Created: 2023-10-10T01:15:30.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-05-14T21:57:57.000Z (9 months ago)
Last Synced: 2024-05-15T17:40:46.689Z (9 months ago)
Language: Python
Size: 262 KB
Stars: 48
Watchers: 4
Forks: 11
Open Issues: 4
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md
- Support: SUPPORT.md

Awesome Lists containing this project

README

# :wrench: ToolTalk :speech_balloon:

:page_facing_up: Paper | :mailbox: Contact

Introducing ToolTalk a benchmark for evaluating Tool LLMs in a conversational setting.

## Details

ToolTalk is designed to evaluate tool-augmented LLMs as a chatbot,
an increasingly popular paradigm for everyday users to harness the power of LLMs.
ToolTalk contains a handcrafted dataset of 28 easy conversations and 50 hard conversations.
We annotate these conversations to contain a ground truth usage of 28 unique tools belonging to 7 themed "plugins".

Evaluation consists of prompting an LLM to predict the correct sequence of tools after every user utterance in a conversation.
Thus, evaluating on a single conversation requires an LLM to correctly predict multiple sub-tasks.
Predictions are compared against the ground truth to determine success for a single conversation.

We evaluate two chatbots on ToolTalk powered by gpt-3.5-turbo-0613 and gpt-4-0613 implemented by using the chat completions API from OpenAI.

| Model | ToolTalk | Success rate | Precision | Recall | Incorrect Action Rate |
|---------|----------|--------------|-----------|--------|-----------------------|
| GPT-3.5 | Easy | 85.7% | 42.4% | 89.3% | 5.0% |
| GPT-4 | Easy | 92.8% | 69.2% | 96.4% | 3.8% |
| GPT-3.5 | Hard | 26.0% | 54.6% | 69.7% | 23.9% |
| GPT-4 | Hard | 50.0% | 74.9% | 79.0% | 25.1% |

## Setup

ToolTalk can be setup using the following commands. Install local package with dev dependencies to enable unit tests.

```bash
pip install -r requirements.txt
pip install -e ".[dev]"
```

To verify that the installation was successful, run the unit tests.

```bash
pytest tests
```

## Reproducing the results

The results on GPT-3.5-turbo and GPT-4 can be reproduced using the following commands. This requires having access to
OpenAI's API. The results will be saved in the `results` folder. The script caches intermediary results, so it can be
re-run if it is interrupted for any reason.

```bash
export OPENAI_API_KEY=
bash evaluate_gpt35turbo.sh
bash evaluate_gpt4.sh
```

Your results should look something like the number above, there will be some variance due to both models having non-deterministic results.

## Generating scenarios

To generate new scenarios, you can use the following command.

```bash
python -m tooltalk.generation.scenario_generator --prompt src/prompts/scenario_template.md --output_dir output/scenarios
```

## Evaluating on new models

The easiest way to evaluate on new models would be to create a new `Predictor` class that inherits from `tooltalk.evaluation.tool_executor.BaseAPIPredictor`.
For an example of how to do this, see `tooltalk.evaluation.tool_executor.GPT3Predictor` and `tooltalk.evaluation.evaluate_openai.OpenAIPredictor`.

## Citing

```
@article{farn2023tooltalk,
title={ToolTalk: Evaluating Tool-Usage in a Conversation Setting},
author={Nicholas Farn and Richard Shin},
year={2023},
journal={arXiv preprint arXiv:2311.10775},
}
```

## Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [[email protected]](mailto:[email protected]) with any additional questions or comments.

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.