https://github.com/kylejeong2/mcpvals

An MCP Evaluation Library
https://github.com/kylejeong2/mcpvals
evals mcp
Last synced: 3 months ago
JSON representation
An MCP Evaluation Library
Host: GitHub
URL: https://github.com/kylejeong2/mcpvals
Owner: Kylejeong2
License: apache-2.0
Created: 2025-06-28T05:57:46.000Z (12 months ago)
Default Branch: main
Last Pushed: 2025-11-04T06:56:41.000Z (7 months ago)
Last Synced: 2025-11-04T08:30:26.669Z (7 months ago)
Topics: evals, mcp
Language: TypeScript
Homepage: https://www.npmjs.com/package/mcpvals?activeTab=readme
Size: 5.82 MB
Stars: 46
Watchers: 0
Forks: 2
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project

README

          # MCPVals

A comprehensive evaluation library for Model Context Protocol (MCP) servers. Test and validate your MCP servers with complete MCP specification coverage including Tools with deterministic metrics, security validation, and optional LLM-based evaluation.

> **Status**: MVP – API **stable**, minor breaking changes possible before 1.0.0

---

## 0. Quick Start

### 1. Installation

```bash

# Install – pick your favourite package manager

pnpm add -D mcpvals            # dev-dependency is typical

```

### 2. Create a config file

Create a config file (e.g., `mcp-eval.config.ts`):

```typescript

import type { Config } from "mcpvals";

export default {

  server: {

    transport: "stdio",

    command: "node",

    args: ["./example/simple-mcp-server.js"],

  },

  // Test individual tools directly

  toolHealthSuites: [

    {

      name: "Calculator Health Tests",

      tests: [

        {

          name: "add",

          args: { a: 5, b: 3 },

          expectedResult: 8,

          maxLatency: 500,

        },

        {

          name: "divide",

          args: { a: 10, b: 0 },

          expectedError: "division by zero",

        },

      ],

    },

  ],

  // Test multi-step, LLM-driven workflows

  workflows: [

    {

      name: "Multi-step Calculation",

      steps: [

        {

          user: "Calculate (5 + 3) * 2, then divide by 4",

          expectedState: "4",

        },

      ],

      expectTools: ["add", "multiply", "divide"],

    },

  ],

  // Optional LLM judge

  llmJudge: true,

  openaiKey: process.env.OPENAI_API_KEY,

  passThreshold: 0.8,

} satisfies Config;

```

### 3. Run Evaluation

```bash

# Required for workflow execution

export ANTHROPIC_API_KEY="sk-ant-..."

# Optional for LLM judge

export OPENAI_API_KEY="sk-..."

# Run everything

npx mcpvals eval mcp-eval.config.ts

# Run only tool health tests

npx mcpvals eval mcp-eval.config.ts --tool-health-only

# Run with LLM judge and save report

npx mcpvals eval mcp-eval.config.ts --llm-judge --reporter json > report.json

```

---

## 1. Core Concepts

MCPVals provides comprehensive testing for all MCP specification primitives:

1.  **Tool Health Testing**: Directly calls individual tools with specific arguments to verify their correctness, performance, and error handling. This is ideal for unit testing and regression checking.

2.  **Workflow Evaluation**: Uses a large language model (LLM) to interpret natural language prompts and execute a series of tool calls to achieve a goal. This tests the integration of your MCP primitives from an LLM's perspective.

---

## 2. Installation & Runtime Requirements

1.  **Node.js ≥ 18** – we rely on native `fetch`, `EventSource`, and `fs/promises`.

2.  **pnpm / npm / yarn** – whichever you prefer, MCPVals is published as an ESM‐only package.

3.  **MCP Server** – a local `stdio` binary **or** a remote Streaming-HTTP endpoint.

4.  **Anthropic API Key** – Required for workflow execution (uses Claude to drive tool calls). Set via `ANTHROPIC_API_KEY` environment variable.

5.  **(Optional) OpenAI key** – Only required if using the LLM judge feature. Set via `OPENAI_API_KEY`.

> **ESM-only**: You **cannot** `require("mcpvals")` from a CommonJS project. Either enable `"type": "module"` in your `package.json` or use dynamic `import()`.

---

## 3. CLI Reference

```

Usage: mcpvals 

Commands:

  eval    Evaluate MCP servers using workflows and/or tool health tests

  list    List workflows in a config file

  help [command]  Show help                                [default]

Evaluation options:

  -d, --debug              Verbose logging (child-process stdout/stderr is piped)

  -r, --reporter      console | json | junit (JUnit coming soon)

  --llm-judge              Enable LLM judge (requires llmJudge:true + key)

  --tool-health-only       Run only tool health tests, skip others

  --workflows-only         Run only workflows, skip other test types

```

### 3.1 `eval`

Runs tests specified in the config file. It will run all configured test types (`toolHealthSuites` and `workflows`) by default. Use flags to run only specific types. Exits **0** on success or **1** on any failure – perfect for CI.

### 3.2 `list`

Static inspection – prints workflows without starting the server. Handy when iterating on test coverage.

---

## 4. Configuration

MCPVals loads **either** a `.json` file **or** a `.ts/.js` module that `export default` an object. Any string value in the config supports **Bash-style environment variable interpolation** `${VAR}`.

### 4.1 `server`

Defines how to connect to your MCP server.

- `transport`: `stdio`, `shttp` (Streaming HTTP), or `sse` (Server-Sent Events).

- `command`/`args`: (for `stdio`) The command to execute your server.

- `env`: (for `stdio`) Environment variables to set for the child process.

- `url`/`headers`: (for `shttp` and `sse`) The endpoint and headers for a remote server.

- `reconnect`/`reconnectInterval`/`maxReconnectAttempts`: (for `sse`) Reconnection settings for SSE connections.

**Example `shttp` with Authentication:**

```json

{

  "server": {

    "transport": "shttp",

    "url": "https://api.example.com/mcp",

    "headers": {

      "Authorization": "Bearer ${API_TOKEN}",

      "X-API-Key": "${API_KEY}"

    }

  }

}

```

**Example `sse` with Reconnection:**

```json

{

  "server": {

    "transport": "sse",

    "url": "https://api.example.com/mcp/sse",

    "headers": {

      "Accept": "text/event-stream",

      "Cache-Control": "no-cache",

      "Authorization": "Bearer ${API_TOKEN}"

    },

    "reconnect": true,

    "reconnectInterval": 5000,

    "maxReconnectAttempts": 10

  }

}

```

### 4.2 `toolHealthSuites[]`

An array of suites for testing tools directly. Each suite contains:

- `name`: Identifier for the test suite.

- `tests`: An array of individual tool tests.

- `parallel`: (boolean) Whether to run tests in the suite in parallel (default: `false`).

- `timeout`: (number) Override the global timeout for this suite.

#### Tool Test Schema

| Field            | Type      | Description                                                            |

| ---------------- | --------- | ---------------------------------------------------------------------- |

| `name`           | `string`  | Tool name to test (must match an available MCP tool).                  |

| `description`    | `string`? | What this test validates.                                              |

| `args`           | `object`  | Arguments to pass to the tool.                                         |

| `expectedResult` | `any`?    | Expected result. Uses deep equality for objects, contains for strings. |

| `expectedError`  | `string`? | Expected error message if the tool should fail.                        |

| `maxLatency`     | `number`? | Maximum acceptable latency in milliseconds.                            |

| `retries`        | `number`? | Retries on failure (0-5, default: 0).                                  |

### 4.3 `workflows[]`

An array of LLM-driven test workflows. Each workflow contains:

- `name`: Identifier for the workflow.

- `steps`: An array of user interactions (usually just one for a high-level goal).

- `expectTools`: An array of tool names expected to be called during the workflow.

#### Workflow Step Schema

| Field           | Type      | Description                                                                         |

| --------------- | --------- | ----------------------------------------------------------------------------------- |

| `user`          | `string`  | High-level user intent. The LLM will plan how to accomplish this.                   |

| `expectedState` | `string`? | A sub-string the evaluator looks for in the final assistant message or tool result. |

#### Workflow Best Practices

1.  **Write natural prompts**: Instead of micro-managing tool calls, give the LLM a complete task (e.g., "Book a flight from SF to NY for next Tuesday and then find a hotel near the airport.").

2.  **Use workflow-level `expectTools`**: List all tools you expect to be used across the entire workflow to verify the LLM's plan.

### 4.4 Global Options

- `timeout`: (number) Global timeout in ms for server startup and individual tool calls. Default: `30000`.

- `llmJudge`: (boolean) Enables the LLM Judge feature. Default: `false`.

- `openaiKey`: (string) OpenAI API key for the LLM Judge.

- `judgeModel`: (string) The model to use for judging. Default: `"gpt-4o"`.

- `passThreshold`: (number) The minimum score (0-1) from the LLM Judge to pass. Default: `0.8`.

---

## 5. Evaluation & Metrics

### 5.1 Tool Health Metrics

When running tool health tests, the following is assessed for each test:

- **Result Correctness**: Does the output match `expectedResult`?

- **Error Correctness**: If `expectedError` is set, did the tool fail with a matching error?

- **Latency**: Did the tool respond within `maxLatency`?

- **Success**: Did the tool call complete without unexpected errors?

### 5.2 Workflow Metrics (Deterministic)

For each workflow, a trace of the LLM interaction is recorded and evaluated against 3 metrics:

| #   | Metric                | Pass Criteria                                                               |

| --- | --------------------- | --------------------------------------------------------------------------- |

| 1   | End-to-End Success    | `expectedState` is found in the final response.                             |

| 2   | Tool Invocation Order | The tools listed in `expectTools` were called in the exact order specified. |

| 3   | Tool Call Health      | All tool calls completed successfully (no errors, HTTP 2xx, etc.).          |

The overall score is an arithmetic mean. The **evaluation fails** if _any_ metric fails.

### 5.7 LLM Judge (Optional)

Add subjective grading when deterministic checks are not enough (e.g., checking tone, or conversational quality).

- Set `"llmJudge": true` in the config and provide an OpenAI key.

- Use the `--llm-judge` CLI flag.

The judge asks the specified `judgeModel` for a score and a reason. A 4th metric, _LLM Judge_, is added to the workflow results, which passes if `score >= passThreshold`.

---

## 6. Library API

You can run evaluations programmatically.

```ts

import { evaluate } from "mcpvals";

const report = await evaluate("./mcp-eval.config.ts", {

  debug: process.env.CI === undefined,

  reporter: "json",

  llmJudge: true,

});

if (!report.passed) {

  process.exit(1);

}

```

## 7. Vitest Integration

MCPVals provides a complete **Vitest integration** for writing MCP server tests using the popular Vitest testing framework. This integration offers both individual test utilities and comprehensive evaluation suites with built-in scoring and custom matchers.

### 7.1 Quick Start

```bash

# Install vitest alongside mcpvals

pnpm add -D mcpvals vitest

```

```typescript

// tests/calculator.test.ts

import { describe, it, expect, beforeAll, afterAll } from "vitest";

import {

  setupMCPServer,

  teardownMCPServer,

  mcpTest,

  describeEval,

  ToolCallScorer,

  LatencyScorer,

  ContentScorer,

} from "mcpvals/vitest";

describe("Calculator MCP Server", () => {

  beforeAll(async () => {

    await setupMCPServer({

      transport: "stdio",

      command: "node",

      args: ["./calculator-server.js"],

    });

  });

  afterAll(async () => {

    await teardownMCPServer();

  });

  // Individual test

  mcpTest("should add numbers", async (utils) => {

    const result = await utils.callTool("add", { a: 5, b: 3 });

    expect(result.content[0].text).toBe("8");

    // Custom matchers

    await expect(result).toCallTool("add");

    await expect(result).toHaveLatencyBelow(1000);

  });

});

```

### 7.2 Core Functions

#### **`setupMCPServer(config, options?)`**

Starts an MCP server and returns utilities for testing.

```typescript

const utils = await setupMCPServer(

  {

    transport: "stdio",

    command: "node",

    args: ["./server.js"],

  },

  {

    timeout: 30000, // Server startup timeout

    debug: false, // Enable debug logging

  },

);

// Returns utility functions:

utils.callTool(name, args); // Call MCP tools

utils.runWorkflow(steps); // Execute LLM workflows

```

#### **`teardownMCPServer()`**

Cleanly shuts down the MCP server (call in `afterAll`).

#### **`mcpTest(name, testFn, timeout?)`**

Convenient wrapper for individual MCP tests.

```typescript

mcpTest(

  "tool test",

  async (utils) => {

    const result = await utils.callTool("echo", { message: "hello" });

    expect(result).toBeDefined();

  },

  10000,

); // Optional timeout

```

#### **`describeEval(config)`**

Comprehensive evaluation suite with automated scoring.

```typescript

describeEval({

  name: "Calculator Evaluation",

  server: { transport: "stdio", command: "node", args: ["./calc.js"] },

  threshold: 0.8, // 80% score required to pass

  data: async () => [

    {

      input: { operation: "add", a: 5, b: 3 },

      expected: { result: "8", tools: ["add"] },

    },

  ],

  task: async (input, context) => {

    const result = await context.utils.callTool(input.operation, {

      a: input.a,

      b: input.b,

    });

    return {

      result: result.content[0].text,

      toolCalls: [{ name: input.operation }],

      latency: Date.now() - startTime,

    };

  },

  scorers: [

    new ToolCallScorer({ expectedOrder: true }),

    new LatencyScorer({ maxLatencyMs: 1000 }),

    new ContentScorer({ patterns: [/\d+/] }),

  ],

});

```

### 7.3 Built-in Scorers

Scorers automatically evaluate different aspects of MCP server behavior, returning scores from 0-1.

#### **`ToolCallScorer`** - Tool Usage Evaluation

```typescript

new ToolCallScorer({

  expectedTools: ["add", "multiply"], // Tools that should be called

  expectedOrder: true, // Whether order matters

  allowExtraTools: false, // Penalize unexpected tools

});

```

**Scoring Algorithm:**

- 70% for calling expected tools

- 20% for correct order (if enabled)

- 10% penalty for extra tools (if disabled)

#### **`LatencyScorer`** - Performance Evaluation

```typescript

new LatencyScorer({

  maxLatencyMs: 1000, // Maximum acceptable latency

  penaltyThreshold: 500, // Start penalizing after this

});

```

**Scoring Logic:**

- Perfect score (1.0) for latency ≤ threshold

- Linear penalty between threshold and max

- Severe penalty (0.1) for exceeding max

- Perfect score for 0ms latency

#### **`WorkflowScorer`** - Workflow Success Evaluation

```typescript

new WorkflowScorer({

  requireSuccess: true, // Must have success: true

  checkMessages: true, // Validate message structure

  minMessages: 2, // Minimum message count

});

```

#### **`ContentScorer`** - Output Quality Assessment

```typescript

new ContentScorer({

  exactMatch: false, // Exact content matching

  caseSensitive: false, // Case sensitivity

  patterns: [/\d+/, /success/], // RegExp patterns to match

  requiredKeywords: ["result"], // Must contain these

  forbiddenKeywords: ["error", "fail"], // Penalize these

});

```

**Multi-dimensional Scoring:**

- 40% pattern matching

- 40% required keywords

- -20% forbidden keywords penalty

- 20% content relevance

### 7.4 Custom Matchers

MCPVals extends Vitest with MCP-specific assertion matchers:

```typescript

// Tool call assertions

await expect(result).toCallTool("add");

await expect(result).toCallTools(["add", "multiply"]);

await expect(result).toHaveToolCallOrder(["add", "multiply"]);

// Workflow assertions

await expect(workflow).toHaveSuccessfulWorkflow();

// Performance assertions

await expect(result).toHaveLatencyBelow(1000);

// Content assertions

await expect(result).toContainKeywords(["success", "complete"]);

await expect(result).toMatchPattern(/result: \d+/);

```

**Smart Content Extraction**: Matchers automatically handle various output formats:

- MCP server responses (`content[0].text`)

- Custom result objects (`{ result, toolCalls, latency }`)

- String outputs

- Workflow results (`{ success, messages, toolCalls }`)

### 7.5 TypeScript Support

Complete type safety with concrete types for common use cases:

```typescript

import type {

  MCPTestConfig,

  MCPTestContext,

  ToolCallTestCase,

  MCPToolResult,

  MCPWorkflowResult,

  ToolCallScorerOptions,

  LatencyScorerOptions,

  ContentScorerOptions,

  WorkflowScorerOptions,

} from "mcpvals/vitest";

// Typed test case

const testCase: ToolCallTestCase = {

  input: { operation: "add", a: 5, b: 3 },

  expected: { result: "8", tools: ["add"] },

};

// Typed scorer options

const scorer = new ToolCallScorer({

  expectedOrder: true,

  allowExtraTools: false,

} satisfies ToolCallScorerOptions);

// Typed task function

task: async (input, context): Promise => {

  const testCase = input as ToolCallTestCase["input"];

  const result = await context.utils.callTool(testCase.operation, {

    a: testCase.a,

    b: testCase.b,

  });

  return {

    result: result.content[0].text,

    toolCalls: [{ name: testCase.operation }],

    success: true,

    latency: Date.now() - startTime,

  };

};

```

### 7.6 Advanced Usage

#### **Dynamic Test Generation**

```typescript

describeEval({

  name: "Dynamic Calculator Tests",

  data: async () => {

    const operations = ["add", "subtract", "multiply", "divide"];

    return operations.map((op) => ({

      name: `Test ${op}`,

      input: { operation: op, a: 10, b: 2 },

      expected: { tools: [op] },

    }));

  },

});

```

#### **Debug Mode**

```bash

# Enable detailed logging

VITEST_MCP_DEBUG=true vitest run

# Shows:

# - Individual test scores and explanations

# - Performance metrics

# - Pass/fail reasons

# - Server lifecycle events

```

### 7.7 Integration Patterns

#### **Unit Testing Individual Tools**

```typescript

describe("Individual Tool Tests", () => {

  beforeAll(() => setupMCPServer(config));

  afterAll(() => teardownMCPServer());

  mcpTest("calculator addition", async (utils) => {

    const result = await utils.callTool("add", { a: 2, b: 3 });

    expect(result.content[0].text).toBe("5");

  });

  mcpTest("error handling", async (utils) => {

    try {

      await utils.callTool("divide", { a: 10, b: 0 });

      throw new Error("Should have failed");

    } catch (error) {

      expect(error.message).toContain("division by zero");

    }

  });

});

```

#### **Integration Testing with Workflows**

```typescript

mcpTest("complex workflow", async (utils) => {

  const workflow = await utils.runWorkflow([

    {

      user: "Calculate 2+3 then multiply by 4",

      expectTools: ["add", "multiply"],

    },

  ]);

  await expect(workflow).toHaveSuccessfulWorkflow();

  await expect(workflow).toCallTools(["add", "multiply"]);

  expect(workflow.messages).toHaveLength(2);

});

```

#### **Performance Benchmarking**

```typescript

describeEval({

  name: "Performance Benchmarks",

  threshold: 0.9, // High threshold for performance tests

  scorers: [

    new LatencyScorer({

      maxLatencyMs: 100, // Strict latency requirement

      penaltyThreshold: 50,

    }),

    new ToolCallScorer({ allowExtraTools: false }), // No unnecessary calls

    new ContentScorer({ patterns: [/^\d+$/] }), // Validate output format

  ],

});

```

#### **Multi-Server Testing**

```typescript

describe("Multi-Server Comparison", () => {

  const servers = [

    { name: "Server A", command: "./server-a.js" },

    { name: "Server B", command: "./server-b.js" },

  ];

  servers.forEach((server) => {

    describe(server.name, () => {

      beforeAll(() =>

        setupMCPServer({

          transport: "stdio",

          command: "node",

          args: [server.command],

        }),

      );

      afterAll(() => teardownMCPServer());

      mcpTest("standard test", async (utils) => {

        const result = await utils.callTool("test", {});

        expect(result).toBeDefined();

      });

    });

  });

});

```

### 7.8 Best Practices

1. **Use `beforeAll`/`afterAll`**: Always properly setup and teardown MCP servers

2. **Leverage TypeScript**: Use concrete types for better development experience

3. **Test individual tools first**: Use `mcpTest` for unit testing, `describeEval` for integration

4. **Set appropriate thresholds**: Start with 0.8, adjust based on your quality requirements

5. **Combine scorers**: Use multiple scorers to evaluate different aspects (functionality, performance, content)

6. **Enable debug mode**: Use `VITEST_MCP_DEBUG=true` when troubleshooting

7. **Write realistic test data**: Create test cases that reflect real-world usage

8. **Use custom matchers**: Leverage MCP-specific matchers for readable assertions

### 7.9 Example: Complete Test Suite

```typescript

import { describe, it, expect, beforeAll, afterAll } from "vitest";

import {

  setupMCPServer,

  teardownMCPServer,

  mcpTest,

  describeEval,

  ToolCallScorer,

  WorkflowScorer,

  LatencyScorer,

  ContentScorer,

  type ToolCallTestCase,

  type MCPToolResult,

} from "mcpvals/vitest";

describe("Production Calculator Server", () => {

  beforeAll(async () => {

    await setupMCPServer(

      {

        transport: "stdio",

        command: "node",

        args: ["./dist/calculator-server.js"],

      },

      {

        timeout: 10000,

        debug: process.env.CI !== "true",

      },

    );

  });

  afterAll(async () => {

    await teardownMCPServer();

  });

  // Unit tests for individual operations

  mcpTest("addition works correctly", async (utils) => {

    const result = await utils.callTool("add", { a: 5, b: 3 });

    expect(result.content[0].text).toBe("8");

    await expect(result).toCallTool("add");

    await expect(result).toHaveLatencyBelow(100);

  });

  mcpTest("handles division by zero", async (utils) => {

    try {

      await utils.callTool("divide", { a: 10, b: 0 });

      throw new Error("Expected division by zero error");

    } catch (error) {

      expect(error.message).toContain("division by zero");

    }

  });

  // Comprehensive evaluation suite

  describeEval({

    name: "Calculator Performance Suite",

    server: {

      transport: "stdio",

      command: "node",

      args: ["./dist/calculator-server.js"],

    },

    threshold: 0.85,

    timeout: 30000,

    data: async (): Promise => [

      {

        name: "Basic Addition",

        input: { operation: "add", a: 10, b: 5 },

        expected: { result: "15", tools: ["add"] },

      },

      {

        name: "Complex Multiplication",

        input: { operation: "multiply", a: 7, b: 8 },

        expected: { result: "56", tools: ["multiply"] },

      },

      {

        name: "Subtraction Test",

        input: { operation: "subtract", a: 20, b: 8 },

        expected: { result: "12", tools: ["subtract"] },

      },

    ],

    task: async (input, context): Promise => {

      const testCase = input as ToolCallTestCase["input"];

      const startTime = Date.now();

      try {

        const result = await context.utils.callTool(testCase.operation, {

          a: testCase.a,

          b: testCase.b,

        });

        return {

          result: result.content[0].text,

          toolCalls: [{ name: testCase.operation }],

          success: true,

          latency: Date.now() - startTime,

          executionTime: Date.now() - startTime,

        };

      } catch (error) {

        return {

          result: null,

          toolCalls: [],

          success: false,

          error: error.message,

          latency: Date.now() - startTime,

          executionTime: Date.now() - startTime,

        };

      }

    },

    scorers: [

      new ToolCallScorer({

        expectedOrder: true,

        allowExtraTools: false,

      }),

      new WorkflowScorer({

        requireSuccess: true,

        checkMessages: false,

      }),

      new LatencyScorer({

        maxLatencyMs: 500,

        penaltyThreshold: 200,

      }),

      new ContentScorer({

        exactMatch: false,

        caseSensitive: false,

        patterns: [/^\d+$/], // Results should be numbers

      }),

    ],

  });

  // Integration test with workflows

  mcpTest("multi-step calculation workflow", async (utils) => {

    const workflow = await utils.runWorkflow([

      {

        user: "Calculate 5 plus 3, then multiply the result by 2",

        expectTools: ["add", "multiply"],

      },

    ]);

    await expect(workflow).toHaveSuccessfulWorkflow();

    await expect(workflow).toCallTools(["add", "multiply"]);

    await expect(workflow).toHaveToolCallOrder(["add", "multiply"]);

    // Verify final result

    const finalMessage = workflow.messages[workflow.messages.length - 1];

    expect(finalMessage.content).toContain("16");

  });

});

```

**Run the tests:**

```bash

# Run all tests

vitest run

# Run with debug output

VITEST_MCP_DEBUG=true vitest run

# Run in watch mode during development

vitest

# Generate coverage report

vitest run --coverage

```

This Vitest integration makes MCP server testing **accessible, automated, and reliable** - combining the speed and developer experience of Vitest with specialized tools for comprehensive MCP server evaluation.

---

## 8. Extensibility & Troubleshooting

- **Custom Reporters**: Import `ConsoleReporter` for reference and implement your own `.report()` method.

- **Server Hangs**: Increase the `timeout` value in your config. Ensure your server writes MCP messages to `stdout`.

- **LLM Judge Fails**: Use `--debug` to inspect the raw model output for malformed JSON.

---

## 9 Complete Example Configuration

Here's a comprehensive example showcasing all evaluation types:

```typescript

import type { Config } from "mcpvals";

export default {

  server: {

    transport: "stdio", // Also supports "shttp" and "sse"

    command: "node",

    args: ["./my-mcp-server.js"],

  },

  // Alternative SSE server configuration:

  // server: {

  //   transport: "sse",

  //   url: "https://api.example.com/mcp/sse",

  //   headers: {

  //     "Accept": "text/event-stream",

  //     "Cache-Control": "no-cache",

  //     "Authorization": "Bearer ${API_TOKEN}"

  //   },

  //   reconnect: true,

  //   reconnectInterval: 5000,

  //   maxReconnectAttempts: 10

  // },

  // Test tools

  toolHealthSuites: [

    {

      name: "Core Functions",

      tests: [

        { name: "add", args: { a: 5, b: 3 }, expectedResult: 8 },

        {

          name: "divide",

          args: { a: 10, b: 0 },

          expectedError: "division by zero",

        },

      ],

    },

  ],

  // Test workflows

  workflows: [

    {

      name: "Complete Workflow",

      steps: [{ user: "Process user data and generate a report" }],

      expectTools: ["fetch-data", "process", "generate-report"],

    },

  ],

  llmJudge: true,

  openaiKey: process.env.OPENAI_API_KEY,

  timeout: 30000,

} satisfies Config;

```

---

## 10. Acknowledgements

- [Model Context Protocol](https://modelcontextprotoco.lol) – for the SDK

- [Vercel AI SDK](https://sdk.vercel.ai) – for LLM integration

- [chalk](https://github.com/chalk/chalk) – for terminal colors

Enjoy testing your MCP servers – PRs, issues & feedback welcome! ✨
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kylejeong2/mcpvals

Awesome Lists containing this project

README