https://github.com/dylibso/mcpx-eval

An open-ended eval framework for mcp.run tools
https://github.com/dylibso/mcpx-eval

Last synced: over 1 year ago
JSON representation

An open-ended eval framework for mcp.run tools

Host: GitHub
URL: https://github.com/dylibso/mcpx-eval
Owner: dylibso
License: bsd-3-clause
Created: 2025-03-01T04:52:04.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-03-03T19:54:50.000Z (over 1 year ago)
Last Synced: 2025-03-03T20:35:52.228Z (over 1 year ago)
Language: Python
Homepage:
Size: 203 KB
Stars: 1
Watchers: 4
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# mcpx-eval

A framework for evaluating open-ended tool use across various large language models.

`mcpx-eval` can be used to compare the output of different LLMs with the same prompt for a given task using [mcp.run](https://www.mcp.run) tools.
This means we're not only interested in the quality of the output, but also curious about the helpfulness of various models
when presented with real world tools.

## Test configs

The [tests/](https://github.com/dylibso/mcpx-eval/tree/main/tests) directory contains pre-defined evals

## Installation

```bash
uv tool install git+https://github.com/dylibso/mcpx-eval
```

## Usage

Run the `my-test` test for 10 iterations:

```bash
mcpx-eval test --model ... --model ... --config my-test.toml --iter 10
```

Generate an HTML scoreboard for all evals:

```bash
mcpx-eval gen --html results.html --show
```

### Test file

A test file is a TOML file containing the following fields:

- `name` - name of the test
- `prompt` - prompt to test, this is passed to the LLM under test
- `check` - prompt for the judge, this is used to determine the quality of the test output
- `expected-tools` - list of tool names that might be used
- `ignore-tools` - list of tools to ignore, they will not be available to the LLM
- `import` - includes fields from another test TOML file

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dylibso/mcpx-eval

Awesome Lists containing this project

README