https://github.com/dylibso/mcpx-eval
An open-ended eval framework for mcp.run tools
https://github.com/dylibso/mcpx-eval
Last synced: over 1 year ago
JSON representation
An open-ended eval framework for mcp.run tools
- Host: GitHub
- URL: https://github.com/dylibso/mcpx-eval
- Owner: dylibso
- License: bsd-3-clause
- Created: 2025-03-01T04:52:04.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-03-03T19:54:50.000Z (over 1 year ago)
- Last Synced: 2025-03-03T20:35:52.228Z (over 1 year ago)
- Language: Python
- Homepage:
- Size: 203 KB
- Stars: 1
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# mcpx-eval
A framework for evaluating open-ended tool use across various large language models.
`mcpx-eval` can be used to compare the output of different LLMs with the same prompt for a given task using [mcp.run](https://www.mcp.run) tools.
This means we're not only interested in the quality of the output, but also curious about the helpfulness of various models
when presented with real world tools.
## Test configs
The [tests/](https://github.com/dylibso/mcpx-eval/tree/main/tests) directory contains pre-defined evals
## Installation
```bash
uv tool install git+https://github.com/dylibso/mcpx-eval
```
## Usage
Run the `my-test` test for 10 iterations:
```bash
mcpx-eval test --model ... --model ... --config my-test.toml --iter 10
```
Generate an HTML scoreboard for all evals:
```bash
mcpx-eval gen --html results.html --show
```
### Test file
A test file is a TOML file containing the following fields:
- `name` - name of the test
- `prompt` - prompt to test, this is passed to the LLM under test
- `check` - prompt for the judge, this is used to determine the quality of the test output
- `expected-tools` - list of tool names that might be used
- `ignore-tools` - list of tools to ignore, they will not be available to the LLM
- `import` - includes fields from another test TOML file