https://github.com/gptscript-ai/function-calling-test-suite

Last synced: 11 months ago
JSON representation

Host: GitHub
URL: https://github.com/gptscript-ai/function-calling-test-suite
Owner: gptscript-ai
License: apache-2.0
Created: 2024-04-04T07:33:21.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-07-10T15:58:58.000Z (almost 2 years ago)
Last Synced: 2025-04-07T02:13:55.470Z (about 1 year ago)
Language: Python
Size: 203 KB
Stars: 12
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# function-calling-test-suite

`function-calling-test-suite` (`FCTS`) is a pragmatic test framework for assessing the function calling capabilities of large language models (LLMs).

## Test spec overview

Test specs files contain YAML streams that define the metadata, input, and expected output for a set of test cases.

e.g.

```yaml
---
categories:
- basic
description: >-
Asserts that the model can make a function call with a given argument and
conveys the result to the user
prompt: Call funcA with 1 and respond with the result of the call
available_functions:
- name: funcA
description: Performs funcA
parameters:
type: object
properties:
param1:
type: integer
description: Param 1
expected_function_calls:
- name: funcA
arguments:
param1: 1
result: This is the output of funcA(1)
final_answer_should: >-
The answer should indicate that the result of calling funcA with 1 is "This is
the output of funcA(1)"
---
# ...
```

### Spec anatomy

Every test spec has three primary components:

#### 1. Test metadata

Each test spec must include a `description` and `categories`. The `description` outlines the test's goal, while
`categories` tag the capabilities being tested. Categorizing test cases helps identify the model's strengths and weaknesses.

#### 2. Functions definitions and expected calls

When executed, the framework uses the `prompt` and `available_functions` to generate an initial request.
It compares the model’s response to `expected_function_calls`. If they match, the framework continues making requests
with the `result` field until all expected calls are completed or a call fails.

#### 3. Answer criteria

Even if a model completes all expected function calls, the final response still needs to be verified.
To this end, specs can optionally include a `final_answer_should` field to describe valid answers using natural language.

### Default test suite

The default suite of spec files can be found in the [specs](./specs) directory.

## Basic usage

Initialize test environment

```sh
poetry shell
poetry install
```

Configure FCTS to judge model responses with `gpt-4-turbo`

```sh
export OPENAI_API_KEY=''
```

Target a model to test

```sh
export FCTS_API_KEY=''
export FCTS_BASE_URL=''
export FCTS_MODEL=''
```

Run the [default test suite](./specs) with verbose output enabled:

```sh
poetry run pytest -vvv
```

### Run options

```sh
$ poetry run pytest -h
usage: pytest [options] [file_or_dir] [file_or_dir] [...]
...
Custom options:
--spec-run-count=SPEC_RUN_COUNT Number of times each test spec should be run
--spec-filter=SPEC_FILTER Filter which test specs are run by their generated test IDs
--spec-dir=SPEC_DIR Directory containing JSON test spec files
--stream=STREAM Enables streaming for all chat completion requests
--use-system-prompt=USE_SYSTEM_PROMPT Add a default system prompt to all chat completion requests
--aggregate-summary-file=AGGREGATE_SUMMARY_FILE Add results for the model to an aggregate CSV file
--request-delay=REQUEST_DELAY Delay in seconds between chat completion requests
...
```

## Testing models without chat completion API support

GPTScript's [alternative model provider shims](https://docs.gptscript.ai/alternative-model-providers) can be used to test models that don't support OpenAI's chat
completion API.

### Using the test script (Automated)

The `test.sh` script automates the configuration and deployment of provider shims for a few popular models:

- claude-3.5-sonnet
- gemini-1.5-pro
- mistral-large-latest

Before using `test.sh`, ensure the `OPENAI_API_KEY` environment variable is set and the following CLIs are installed on your system:

- [gptscript](https://docs.gptscript.ai/#getting-started)
- [gcloud](https://cloud.google.com/sdk/docs/install-sdk)

To run the test suite, just pass the name of the model to the script, followed by the desired pytest arguments.

e.g.

```shell
./test.sh gemini-1.5-pro --spec-run-count=10 --spec-filter='*basic.yaml-0*'
```

> **Note:** The script will prompt for auth tokens if necessary

You can also run the provider shims and configure the test suite manually. See the sections below for some examples.

### claude-3.5-sonnet (Manual)

Set an Anthopic key:

```shell
export ANTHROPIC_API_KEY=''
```

Clone the [claude3-anthropic-provider](https://github.com/gptscript-ai/claude3-anthropic-provider):

```sh
git clone https://github.com/gptscript-ai/claude3-anthropic-provider
```

Follow the `Development` instructions in the repo's `README.md`:

```sh
cd claude3-anthropic-provider
export GPTSCRIPT_DEBUG=true
python -m venv .venv
source ./.venv/bin/activate
pip install -r requirements.txt
```

Run the shim:

```sh
./run.sh
```

In another terminal, target the provider shim:

```sh
export FCTS_MODEL='claude-3-5-sonnet-20240620'
export FCTS_BASE_URL='http://127.0.0.1:8000/v1'
export FCTS_API_KEY='foo'
```

> **Note:** The API key can be set to any arbitrary value, but must be set

Run the tests:

```shell
poetry shell
poetry install
poetry run pytest --stream=true
```

> **Note: Streaming must be enabled because the `claude3-anthropic-provider` doesn't support non-streaming responses**

### gemini-1.5-pro (Manual)

Ensure the following requirements are met:

- [gcloud CLI](https://cloud.google.com/sdk/docs/install-sdk)
- [VertexAI](https://cloud.google.com/vertex-ai) access

Configure `gcloud` CLI to use your VertexAI project and account:

```sh
gcloud config set project
gcloud config set billing/quota_project
gcloud config set account
gcloud components update
```

Afterwords, your configuration should look something like this:

```sh
gcloud config list
[billing]
quota_project = acorn-io
[core]
account = nick@acorn.io
disable_usage_reporting = False
project = acorn-io

Your active configuration is: [default]
```

Authenticate with the `gcloud` CLI:

```sh
gcloud auth application-default login
```

Clone the [gemini-vertexai-provider repo](https://github.com/gptscript-ai/gemini-vertexai-provider):

```sh
git clone https://github.com/gptscript-ai/gemini-vertexai-provider
```

Follow the `Development` instructions in the repo's `README.md`:

```sh
cd gemini-vertexai-provider
export GPTSCRIPT_DEBUG=true
python -m venv .venv
source ./.venv/bin/activate
pip install -r requirements.txt
```

Run the shim:

```sh
./run.sh
```

In another terminal, target the provider shim:

```sh
export FCTS_MODEL='gemini-1.5-pro-preview-0409'
export FCTS_BASE_URL='http://127.0.0.1:8081/v1'
export FCTS_API_KEY='foo'
```

> **Note:** The API key can be set to any arbitrary value, but must be set

Run the tests:

```shell
poetry shell
poetry install
poetry run pytest --stream=true
```

> **Note:** Streaming must be enabled because the `gemini-1.5-pro-preview-0409` doesn't support non-streaming responses

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/gptscript-ai/function-calling-test-suite

Awesome Lists containing this project

README