{"id":30863392,"url":"https://github.com/kylejeong2/mcpvals","last_synced_at":"2026-03-05T16:03:47.359Z","repository":{"id":304541580,"uuid":"1009996543","full_name":"Kylejeong2/mcpvals","owner":"Kylejeong2","description":"An MCP Evaluation Library","archived":false,"fork":false,"pushed_at":"2025-11-04T06:56:41.000Z","size":6106,"stargazers_count":46,"open_issues_count":12,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-11-04T08:30:26.669Z","etag":null,"topics":["evals","mcp"],"latest_commit_sha":null,"homepage":"https://www.npmjs.com/package/mcpvals?activeTab=readme","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Kylejeong2.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-06-28T05:57:46.000Z","updated_at":"2025-11-04T07:09:07.000Z","dependencies_parsed_at":"2025-09-07T18:52:02.532Z","dependency_job_id":"6f8fb98a-0ad0-495a-870c-4fdd8ca3dfe0","html_url":"https://github.com/Kylejeong2/mcpvals","commit_stats":null,"previous_names":["kylejeong2/mcpvals"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Kylejeong2/mcpvals","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kylejeong2%2Fmcpvals","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kylejeong2%2Fmcpvals/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kylejeong2%2Fmcpvals/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kylejeong2%2Fmcpvals/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Kylejeong2","download_url":"https://codeload.github.com/Kylejeong2/mcpvals/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Kylejeong2%2Fmcpvals/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30134580,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-05T15:35:27.018Z","status":"ssl_error","status_checked_at":"2026-03-05T15:35:23.768Z","response_time":93,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evals","mcp"],"created_at":"2025-09-07T18:51:35.264Z","updated_at":"2026-03-05T16:03:47.345Z","avatar_url":"https://github.com/Kylejeong2.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# MCPVals\n\nA comprehensive evaluation library for Model Context Protocol (MCP) servers. Test and validate your MCP servers with complete MCP specification coverage including Tools with deterministic metrics, security validation, and optional LLM-based evaluation.\n\n\u003e **Status**: MVP – API **stable**, minor breaking changes possible before 1.0.0\n\n---\n\n## 0. Quick Start\n\n### 1. Installation\n\n```bash\n# Install – pick your favourite package manager\npnpm add -D mcpvals            # dev-dependency is typical\n```\n\n### 2. Create a config file\n\nCreate a config file (e.g., `mcp-eval.config.ts`):\n\n```typescript\nimport type { Config } from \"mcpvals\";\n\nexport default {\n  server: {\n    transport: \"stdio\",\n    command: \"node\",\n    args: [\"./example/simple-mcp-server.js\"],\n  },\n\n  // Test individual tools directly\n  toolHealthSuites: [\n    {\n      name: \"Calculator Health Tests\",\n      tests: [\n        {\n          name: \"add\",\n          args: { a: 5, b: 3 },\n          expectedResult: 8,\n          maxLatency: 500,\n        },\n        {\n          name: \"divide\",\n          args: { a: 10, b: 0 },\n          expectedError: \"division by zero\",\n        },\n      ],\n    },\n  ],\n\n  // Test multi-step, LLM-driven workflows\n  workflows: [\n    {\n      name: \"Multi-step Calculation\",\n      steps: [\n        {\n          user: \"Calculate (5 + 3) * 2, then divide by 4\",\n          expectedState: \"4\",\n        },\n      ],\n      expectTools: [\"add\", \"multiply\", \"divide\"],\n    },\n  ],\n\n  // Optional LLM judge\n  llmJudge: true,\n  openaiKey: process.env.OPENAI_API_KEY,\n  passThreshold: 0.8,\n} satisfies Config;\n```\n\n### 3. Run Evaluation\n\n```bash\n# Required for workflow execution\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n\n# Optional for LLM judge\nexport OPENAI_API_KEY=\"sk-...\"\n\n# Run everything\nnpx mcpvals eval mcp-eval.config.ts\n\n# Run only tool health tests\nnpx mcpvals eval mcp-eval.config.ts --tool-health-only\n\n# Run with LLM judge and save report\nnpx mcpvals eval mcp-eval.config.ts --llm-judge --reporter json \u003e report.json\n```\n\n---\n\n## 1. Core Concepts\n\nMCPVals provides comprehensive testing for all MCP specification primitives:\n\n1.  **Tool Health Testing**: Directly calls individual tools with specific arguments to verify their correctness, performance, and error handling. This is ideal for unit testing and regression checking.\n\n2.  **Workflow Evaluation**: Uses a large language model (LLM) to interpret natural language prompts and execute a series of tool calls to achieve a goal. This tests the integration of your MCP primitives from an LLM's perspective.\n\n---\n\n## 2. Installation \u0026 Runtime Requirements\n\n1.  **Node.js ≥ 18** – we rely on native `fetch`, `EventSource`, and `fs/promises`.\n2.  **pnpm / npm / yarn** – whichever you prefer, MCPVals is published as an ESM‐only package.\n3.  **MCP Server** – a local `stdio` binary **or** a remote Streaming-HTTP endpoint.\n4.  **Anthropic API Key** – Required for workflow execution (uses Claude to drive tool calls). Set via `ANTHROPIC_API_KEY` environment variable.\n5.  **(Optional) OpenAI key** – Only required if using the LLM judge feature. Set via `OPENAI_API_KEY`.\n\n\u003e **ESM-only**: You **cannot** `require(\"mcpvals\")` from a CommonJS project. Either enable `\"type\": \"module\"` in your `package.json` or use dynamic `import()`.\n\n---\n\n## 3. CLI Reference\n\n```\nUsage: mcpvals \u003ccommand\u003e\n\nCommands:\n  eval \u003cconfig\u003e   Evaluate MCP servers using workflows and/or tool health tests\n  list \u003cconfig\u003e   List workflows in a config file\n  help [command]  Show help                                [default]\n\nEvaluation options:\n  -d, --debug              Verbose logging (child-process stdout/stderr is piped)\n  -r, --reporter \u003cfmt\u003e     console | json | junit (JUnit coming soon)\n  --llm-judge              Enable LLM judge (requires llmJudge:true + key)\n  --tool-health-only       Run only tool health tests, skip others\n  --workflows-only         Run only workflows, skip other test types\n```\n\n### 3.1 `eval`\n\nRuns tests specified in the config file. It will run all configured test types (`toolHealthSuites` and `workflows`) by default. Use flags to run only specific types. Exits **0** on success or **1** on any failure – perfect for CI.\n\n### 3.2 `list`\n\nStatic inspection – prints workflows without starting the server. Handy when iterating on test coverage.\n\n---\n\n## 4. Configuration\n\nMCPVals loads **either** a `.json` file **or** a `.ts/.js` module that `export default` an object. Any string value in the config supports **Bash-style environment variable interpolation** `${VAR}`.\n\n### 4.1 `server`\n\nDefines how to connect to your MCP server.\n\n- `transport`: `stdio`, `shttp` (Streaming HTTP), or `sse` (Server-Sent Events).\n- `command`/`args`: (for `stdio`) The command to execute your server.\n- `env`: (for `stdio`) Environment variables to set for the child process.\n- `url`/`headers`: (for `shttp` and `sse`) The endpoint and headers for a remote server.\n- `reconnect`/`reconnectInterval`/`maxReconnectAttempts`: (for `sse`) Reconnection settings for SSE connections.\n\n**Example `shttp` with Authentication:**\n\n```json\n{\n  \"server\": {\n    \"transport\": \"shttp\",\n    \"url\": \"https://api.example.com/mcp\",\n    \"headers\": {\n      \"Authorization\": \"Bearer ${API_TOKEN}\",\n      \"X-API-Key\": \"${API_KEY}\"\n    }\n  }\n}\n```\n\n**Example `sse` with Reconnection:**\n\n```json\n{\n  \"server\": {\n    \"transport\": \"sse\",\n    \"url\": \"https://api.example.com/mcp/sse\",\n    \"headers\": {\n      \"Accept\": \"text/event-stream\",\n      \"Cache-Control\": \"no-cache\",\n      \"Authorization\": \"Bearer ${API_TOKEN}\"\n    },\n    \"reconnect\": true,\n    \"reconnectInterval\": 5000,\n    \"maxReconnectAttempts\": 10\n  }\n}\n```\n\n### 4.2 `toolHealthSuites[]`\n\nAn array of suites for testing tools directly. Each suite contains:\n\n- `name`: Identifier for the test suite.\n- `tests`: An array of individual tool tests.\n- `parallel`: (boolean) Whether to run tests in the suite in parallel (default: `false`).\n- `timeout`: (number) Override the global timeout for this suite.\n\n#### Tool Test Schema\n\n| Field            | Type      | Description                                                            |\n| ---------------- | --------- | ---------------------------------------------------------------------- |\n| `name`           | `string`  | Tool name to test (must match an available MCP tool).                  |\n| `description`    | `string`? | What this test validates.                                              |\n| `args`           | `object`  | Arguments to pass to the tool.                                         |\n| `expectedResult` | `any`?    | Expected result. Uses deep equality for objects, contains for strings. |\n| `expectedError`  | `string`? | Expected error message if the tool should fail.                        |\n| `maxLatency`     | `number`? | Maximum acceptable latency in milliseconds.                            |\n| `retries`        | `number`? | Retries on failure (0-5, default: 0).                                  |\n\n### 4.3 `workflows[]`\n\nAn array of LLM-driven test workflows. Each workflow contains:\n\n- `name`: Identifier for the workflow.\n- `steps`: An array of user interactions (usually just one for a high-level goal).\n- `expectTools`: An array of tool names expected to be called during the workflow.\n\n#### Workflow Step Schema\n\n| Field           | Type      | Description                                                                         |\n| --------------- | --------- | ----------------------------------------------------------------------------------- |\n| `user`          | `string`  | High-level user intent. The LLM will plan how to accomplish this.                   |\n| `expectedState` | `string`? | A sub-string the evaluator looks for in the final assistant message or tool result. |\n\n#### Workflow Best Practices\n\n1.  **Write natural prompts**: Instead of micro-managing tool calls, give the LLM a complete task (e.g., \"Book a flight from SF to NY for next Tuesday and then find a hotel near the airport.\").\n2.  **Use workflow-level `expectTools`**: List all tools you expect to be used across the entire workflow to verify the LLM's plan.\n\n### 4.4 Global Options\n\n- `timeout`: (number) Global timeout in ms for server startup and individual tool calls. Default: `30000`.\n- `llmJudge`: (boolean) Enables the LLM Judge feature. Default: `false`.\n- `openaiKey`: (string) OpenAI API key for the LLM Judge.\n- `judgeModel`: (string) The model to use for judging. Default: `\"gpt-4o\"`.\n- `passThreshold`: (number) The minimum score (0-1) from the LLM Judge to pass. Default: `0.8`.\n\n---\n\n## 5. Evaluation \u0026 Metrics\n\n### 5.1 Tool Health Metrics\n\nWhen running tool health tests, the following is assessed for each test:\n\n- **Result Correctness**: Does the output match `expectedResult`?\n- **Error Correctness**: If `expectedError` is set, did the tool fail with a matching error?\n- **Latency**: Did the tool respond within `maxLatency`?\n- **Success**: Did the tool call complete without unexpected errors?\n\n### 5.2 Workflow Metrics (Deterministic)\n\nFor each workflow, a trace of the LLM interaction is recorded and evaluated against 3 metrics:\n\n| #   | Metric                | Pass Criteria                                                               |\n| --- | --------------------- | --------------------------------------------------------------------------- |\n| 1   | End-to-End Success    | `expectedState` is found in the final response.                             |\n| 2   | Tool Invocation Order | The tools listed in `expectTools` were called in the exact order specified. |\n| 3   | Tool Call Health      | All tool calls completed successfully (no errors, HTTP 2xx, etc.).          |\n\nThe overall score is an arithmetic mean. The **evaluation fails** if _any_ metric fails.\n\n### 5.7 LLM Judge (Optional)\n\nAdd subjective grading when deterministic checks are not enough (e.g., checking tone, or conversational quality).\n\n- Set `\"llmJudge\": true` in the config and provide an OpenAI key.\n- Use the `--llm-judge` CLI flag.\n\nThe judge asks the specified `judgeModel` for a score and a reason. A 4th metric, _LLM Judge_, is added to the workflow results, which passes if `score \u003e= passThreshold`.\n\n---\n\n## 6. Library API\n\nYou can run evaluations programmatically.\n\n```ts\nimport { evaluate } from \"mcpvals\";\n\nconst report = await evaluate(\"./mcp-eval.config.ts\", {\n  debug: process.env.CI === undefined,\n  reporter: \"json\",\n  llmJudge: true,\n});\n\nif (!report.passed) {\n  process.exit(1);\n}\n```\n\n## 7. Vitest Integration\n\nMCPVals provides a complete **Vitest integration** for writing MCP server tests using the popular Vitest testing framework. This integration offers both individual test utilities and comprehensive evaluation suites with built-in scoring and custom matchers.\n\n### 7.1 Quick Start\n\n```bash\n# Install vitest alongside mcpvals\npnpm add -D mcpvals vitest\n```\n\n```typescript\n// tests/calculator.test.ts\nimport { describe, it, expect, beforeAll, afterAll } from \"vitest\";\nimport {\n  setupMCPServer,\n  teardownMCPServer,\n  mcpTest,\n  describeEval,\n  ToolCallScorer,\n  LatencyScorer,\n  ContentScorer,\n} from \"mcpvals/vitest\";\n\ndescribe(\"Calculator MCP Server\", () =\u003e {\n  beforeAll(async () =\u003e {\n    await setupMCPServer({\n      transport: \"stdio\",\n      command: \"node\",\n      args: [\"./calculator-server.js\"],\n    });\n  });\n\n  afterAll(async () =\u003e {\n    await teardownMCPServer();\n  });\n\n  // Individual test\n  mcpTest(\"should add numbers\", async (utils) =\u003e {\n    const result = await utils.callTool(\"add\", { a: 5, b: 3 });\n    expect(result.content[0].text).toBe(\"8\");\n\n    // Custom matchers\n    await expect(result).toCallTool(\"add\");\n    await expect(result).toHaveLatencyBelow(1000);\n  });\n});\n```\n\n### 7.2 Core Functions\n\n#### **`setupMCPServer(config, options?)`**\n\nStarts an MCP server and returns utilities for testing.\n\n```typescript\nconst utils = await setupMCPServer(\n  {\n    transport: \"stdio\",\n    command: \"node\",\n    args: [\"./server.js\"],\n  },\n  {\n    timeout: 30000, // Server startup timeout\n    debug: false, // Enable debug logging\n  },\n);\n\n// Returns utility functions:\nutils.callTool(name, args); // Call MCP tools\nutils.runWorkflow(steps); // Execute LLM workflows\n```\n\n#### **`teardownMCPServer()`**\n\nCleanly shuts down the MCP server (call in `afterAll`).\n\n#### **`mcpTest(name, testFn, timeout?)`**\n\nConvenient wrapper for individual MCP tests.\n\n```typescript\nmcpTest(\n  \"tool test\",\n  async (utils) =\u003e {\n    const result = await utils.callTool(\"echo\", { message: \"hello\" });\n    expect(result).toBeDefined();\n  },\n  10000,\n); // Optional timeout\n```\n\n#### **`describeEval(config)`**\n\nComprehensive evaluation suite with automated scoring.\n\n```typescript\ndescribeEval({\n  name: \"Calculator Evaluation\",\n  server: { transport: \"stdio\", command: \"node\", args: [\"./calc.js\"] },\n  threshold: 0.8, // 80% score required to pass\n\n  data: async () =\u003e [\n    {\n      input: { operation: \"add\", a: 5, b: 3 },\n      expected: { result: \"8\", tools: [\"add\"] },\n    },\n  ],\n\n  task: async (input, context) =\u003e {\n    const result = await context.utils.callTool(input.operation, {\n      a: input.a,\n      b: input.b,\n    });\n    return {\n      result: result.content[0].text,\n      toolCalls: [{ name: input.operation }],\n      latency: Date.now() - startTime,\n    };\n  },\n\n  scorers: [\n    new ToolCallScorer({ expectedOrder: true }),\n    new LatencyScorer({ maxLatencyMs: 1000 }),\n    new ContentScorer({ patterns: [/\\d+/] }),\n  ],\n});\n```\n\n### 7.3 Built-in Scorers\n\nScorers automatically evaluate different aspects of MCP server behavior, returning scores from 0-1.\n\n#### **`ToolCallScorer`** - Tool Usage Evaluation\n\n```typescript\nnew ToolCallScorer({\n  expectedTools: [\"add\", \"multiply\"], // Tools that should be called\n  expectedOrder: true, // Whether order matters\n  allowExtraTools: false, // Penalize unexpected tools\n});\n```\n\n**Scoring Algorithm:**\n\n- 70% for calling expected tools\n- 20% for correct order (if enabled)\n- 10% penalty for extra tools (if disabled)\n\n#### **`LatencyScorer`** - Performance Evaluation\n\n```typescript\nnew LatencyScorer({\n  maxLatencyMs: 1000, // Maximum acceptable latency\n  penaltyThreshold: 500, // Start penalizing after this\n});\n```\n\n**Scoring Logic:**\n\n- Perfect score (1.0) for latency ≤ threshold\n- Linear penalty between threshold and max\n- Severe penalty (0.1) for exceeding max\n- Perfect score for 0ms latency\n\n#### **`WorkflowScorer`** - Workflow Success Evaluation\n\n```typescript\nnew WorkflowScorer({\n  requireSuccess: true, // Must have success: true\n  checkMessages: true, // Validate message structure\n  minMessages: 2, // Minimum message count\n});\n```\n\n#### **`ContentScorer`** - Output Quality Assessment\n\n```typescript\nnew ContentScorer({\n  exactMatch: false, // Exact content matching\n  caseSensitive: false, // Case sensitivity\n  patterns: [/\\d+/, /success/], // RegExp patterns to match\n  requiredKeywords: [\"result\"], // Must contain these\n  forbiddenKeywords: [\"error\", \"fail\"], // Penalize these\n});\n```\n\n**Multi-dimensional Scoring:**\n\n- 40% pattern matching\n- 40% required keywords\n- -20% forbidden keywords penalty\n- 20% content relevance\n\n### 7.4 Custom Matchers\n\nMCPVals extends Vitest with MCP-specific assertion matchers:\n\n```typescript\n// Tool call assertions\nawait expect(result).toCallTool(\"add\");\nawait expect(result).toCallTools([\"add\", \"multiply\"]);\nawait expect(result).toHaveToolCallOrder([\"add\", \"multiply\"]);\n\n// Workflow assertions\nawait expect(workflow).toHaveSuccessfulWorkflow();\n\n// Performance assertions\nawait expect(result).toHaveLatencyBelow(1000);\n\n// Content assertions\nawait expect(result).toContainKeywords([\"success\", \"complete\"]);\nawait expect(result).toMatchPattern(/result: \\d+/);\n```\n\n**Smart Content Extraction**: Matchers automatically handle various output formats:\n\n- MCP server responses (`content[0].text`)\n- Custom result objects (`{ result, toolCalls, latency }`)\n- String outputs\n- Workflow results (`{ success, messages, toolCalls }`)\n\n### 7.5 TypeScript Support\n\nComplete type safety with concrete types for common use cases:\n\n```typescript\nimport type {\n  MCPTestConfig,\n  MCPTestContext,\n  ToolCallTestCase,\n  MCPToolResult,\n  MCPWorkflowResult,\n  ToolCallScorerOptions,\n  LatencyScorerOptions,\n  ContentScorerOptions,\n  WorkflowScorerOptions,\n} from \"mcpvals/vitest\";\n\n// Typed test case\nconst testCase: ToolCallTestCase = {\n  input: { operation: \"add\", a: 5, b: 3 },\n  expected: { result: \"8\", tools: [\"add\"] },\n};\n\n// Typed scorer options\nconst scorer = new ToolCallScorer({\n  expectedOrder: true,\n  allowExtraTools: false,\n} satisfies ToolCallScorerOptions);\n\n// Typed task function\ntask: async (input, context): Promise\u003cMCPToolResult\u003e =\u003e {\n  const testCase = input as ToolCallTestCase[\"input\"];\n  const result = await context.utils.callTool(testCase.operation, {\n    a: testCase.a,\n    b: testCase.b,\n  });\n  return {\n    result: result.content[0].text,\n    toolCalls: [{ name: testCase.operation }],\n    success: true,\n    latency: Date.now() - startTime,\n  };\n};\n```\n\n### 7.6 Advanced Usage\n\n#### **Dynamic Test Generation**\n\n```typescript\ndescribeEval({\n  name: \"Dynamic Calculator Tests\",\n  data: async () =\u003e {\n    const operations = [\"add\", \"subtract\", \"multiply\", \"divide\"];\n    return operations.map((op) =\u003e ({\n      name: `Test ${op}`,\n      input: { operation: op, a: 10, b: 2 },\n      expected: { tools: [op] },\n    }));\n  },\n});\n```\n\n#### **Debug Mode**\n\n```bash\n# Enable detailed logging\nVITEST_MCP_DEBUG=true vitest run\n\n# Shows:\n# - Individual test scores and explanations\n# - Performance metrics\n# - Pass/fail reasons\n# - Server lifecycle events\n```\n\n### 7.7 Integration Patterns\n\n#### **Unit Testing Individual Tools**\n\n```typescript\ndescribe(\"Individual Tool Tests\", () =\u003e {\n  beforeAll(() =\u003e setupMCPServer(config));\n  afterAll(() =\u003e teardownMCPServer());\n\n  mcpTest(\"calculator addition\", async (utils) =\u003e {\n    const result = await utils.callTool(\"add\", { a: 2, b: 3 });\n    expect(result.content[0].text).toBe(\"5\");\n  });\n\n  mcpTest(\"error handling\", async (utils) =\u003e {\n    try {\n      await utils.callTool(\"divide\", { a: 10, b: 0 });\n      throw new Error(\"Should have failed\");\n    } catch (error) {\n      expect(error.message).toContain(\"division by zero\");\n    }\n  });\n});\n```\n\n#### **Integration Testing with Workflows**\n\n```typescript\nmcpTest(\"complex workflow\", async (utils) =\u003e {\n  const workflow = await utils.runWorkflow([\n    {\n      user: \"Calculate 2+3 then multiply by 4\",\n      expectTools: [\"add\", \"multiply\"],\n    },\n  ]);\n\n  await expect(workflow).toHaveSuccessfulWorkflow();\n  await expect(workflow).toCallTools([\"add\", \"multiply\"]);\n  expect(workflow.messages).toHaveLength(2);\n});\n```\n\n#### **Performance Benchmarking**\n\n```typescript\ndescribeEval({\n  name: \"Performance Benchmarks\",\n  threshold: 0.9, // High threshold for performance tests\n  scorers: [\n    new LatencyScorer({\n      maxLatencyMs: 100, // Strict latency requirement\n      penaltyThreshold: 50,\n    }),\n    new ToolCallScorer({ allowExtraTools: false }), // No unnecessary calls\n    new ContentScorer({ patterns: [/^\\d+$/] }), // Validate output format\n  ],\n});\n```\n\n#### **Multi-Server Testing**\n\n```typescript\ndescribe(\"Multi-Server Comparison\", () =\u003e {\n  const servers = [\n    { name: \"Server A\", command: \"./server-a.js\" },\n    { name: \"Server B\", command: \"./server-b.js\" },\n  ];\n\n  servers.forEach((server) =\u003e {\n    describe(server.name, () =\u003e {\n      beforeAll(() =\u003e\n        setupMCPServer({\n          transport: \"stdio\",\n          command: \"node\",\n          args: [server.command],\n        }),\n      );\n      afterAll(() =\u003e teardownMCPServer());\n\n      mcpTest(\"standard test\", async (utils) =\u003e {\n        const result = await utils.callTool(\"test\", {});\n        expect(result).toBeDefined();\n      });\n    });\n  });\n});\n```\n\n### 7.8 Best Practices\n\n1. **Use `beforeAll`/`afterAll`**: Always properly setup and teardown MCP servers\n2. **Leverage TypeScript**: Use concrete types for better development experience\n3. **Test individual tools first**: Use `mcpTest` for unit testing, `describeEval` for integration\n4. **Set appropriate thresholds**: Start with 0.8, adjust based on your quality requirements\n5. **Combine scorers**: Use multiple scorers to evaluate different aspects (functionality, performance, content)\n6. **Enable debug mode**: Use `VITEST_MCP_DEBUG=true` when troubleshooting\n7. **Write realistic test data**: Create test cases that reflect real-world usage\n8. **Use custom matchers**: Leverage MCP-specific matchers for readable assertions\n\n### 7.9 Example: Complete Test Suite\n\n```typescript\nimport { describe, it, expect, beforeAll, afterAll } from \"vitest\";\nimport {\n  setupMCPServer,\n  teardownMCPServer,\n  mcpTest,\n  describeEval,\n  ToolCallScorer,\n  WorkflowScorer,\n  LatencyScorer,\n  ContentScorer,\n  type ToolCallTestCase,\n  type MCPToolResult,\n} from \"mcpvals/vitest\";\n\ndescribe(\"Production Calculator Server\", () =\u003e {\n  beforeAll(async () =\u003e {\n    await setupMCPServer(\n      {\n        transport: \"stdio\",\n        command: \"node\",\n        args: [\"./dist/calculator-server.js\"],\n      },\n      {\n        timeout: 10000,\n        debug: process.env.CI !== \"true\",\n      },\n    );\n  });\n\n  afterAll(async () =\u003e {\n    await teardownMCPServer();\n  });\n\n  // Unit tests for individual operations\n  mcpTest(\"addition works correctly\", async (utils) =\u003e {\n    const result = await utils.callTool(\"add\", { a: 5, b: 3 });\n    expect(result.content[0].text).toBe(\"8\");\n    await expect(result).toCallTool(\"add\");\n    await expect(result).toHaveLatencyBelow(100);\n  });\n\n  mcpTest(\"handles division by zero\", async (utils) =\u003e {\n    try {\n      await utils.callTool(\"divide\", { a: 10, b: 0 });\n      throw new Error(\"Expected division by zero error\");\n    } catch (error) {\n      expect(error.message).toContain(\"division by zero\");\n    }\n  });\n\n  // Comprehensive evaluation suite\n  describeEval({\n    name: \"Calculator Performance Suite\",\n    server: {\n      transport: \"stdio\",\n      command: \"node\",\n      args: [\"./dist/calculator-server.js\"],\n    },\n    threshold: 0.85,\n    timeout: 30000,\n\n    data: async (): Promise\u003cToolCallTestCase[]\u003e =\u003e [\n      {\n        name: \"Basic Addition\",\n        input: { operation: \"add\", a: 10, b: 5 },\n        expected: { result: \"15\", tools: [\"add\"] },\n      },\n      {\n        name: \"Complex Multiplication\",\n        input: { operation: \"multiply\", a: 7, b: 8 },\n        expected: { result: \"56\", tools: [\"multiply\"] },\n      },\n      {\n        name: \"Subtraction Test\",\n        input: { operation: \"subtract\", a: 20, b: 8 },\n        expected: { result: \"12\", tools: [\"subtract\"] },\n      },\n    ],\n\n    task: async (input, context): Promise\u003cMCPToolResult\u003e =\u003e {\n      const testCase = input as ToolCallTestCase[\"input\"];\n      const startTime = Date.now();\n\n      try {\n        const result = await context.utils.callTool(testCase.operation, {\n          a: testCase.a,\n          b: testCase.b,\n        });\n\n        return {\n          result: result.content[0].text,\n          toolCalls: [{ name: testCase.operation }],\n          success: true,\n          latency: Date.now() - startTime,\n          executionTime: Date.now() - startTime,\n        };\n      } catch (error) {\n        return {\n          result: null,\n          toolCalls: [],\n          success: false,\n          error: error.message,\n          latency: Date.now() - startTime,\n          executionTime: Date.now() - startTime,\n        };\n      }\n    },\n\n    scorers: [\n      new ToolCallScorer({\n        expectedOrder: true,\n        allowExtraTools: false,\n      }),\n      new WorkflowScorer({\n        requireSuccess: true,\n        checkMessages: false,\n      }),\n      new LatencyScorer({\n        maxLatencyMs: 500,\n        penaltyThreshold: 200,\n      }),\n      new ContentScorer({\n        exactMatch: false,\n        caseSensitive: false,\n        patterns: [/^\\d+$/], // Results should be numbers\n      }),\n    ],\n  });\n\n  // Integration test with workflows\n  mcpTest(\"multi-step calculation workflow\", async (utils) =\u003e {\n    const workflow = await utils.runWorkflow([\n      {\n        user: \"Calculate 5 plus 3, then multiply the result by 2\",\n        expectTools: [\"add\", \"multiply\"],\n      },\n    ]);\n\n    await expect(workflow).toHaveSuccessfulWorkflow();\n    await expect(workflow).toCallTools([\"add\", \"multiply\"]);\n    await expect(workflow).toHaveToolCallOrder([\"add\", \"multiply\"]);\n\n    // Verify final result\n    const finalMessage = workflow.messages[workflow.messages.length - 1];\n    expect(finalMessage.content).toContain(\"16\");\n  });\n});\n```\n\n**Run the tests:**\n\n```bash\n# Run all tests\nvitest run\n\n# Run with debug output\nVITEST_MCP_DEBUG=true vitest run\n\n# Run in watch mode during development\nvitest\n\n# Generate coverage report\nvitest run --coverage\n```\n\nThis Vitest integration makes MCP server testing **accessible, automated, and reliable** - combining the speed and developer experience of Vitest with specialized tools for comprehensive MCP server evaluation.\n\n---\n\n## 8. Extensibility \u0026 Troubleshooting\n\n- **Custom Reporters**: Import `ConsoleReporter` for reference and implement your own `.report()` method.\n- **Server Hangs**: Increase the `timeout` value in your config. Ensure your server writes MCP messages to `stdout`.\n- **LLM Judge Fails**: Use `--debug` to inspect the raw model output for malformed JSON.\n\n---\n\n## 9 Complete Example Configuration\n\nHere's a comprehensive example showcasing all evaluation types:\n\n```typescript\nimport type { Config } from \"mcpvals\";\n\nexport default {\n  server: {\n    transport: \"stdio\", // Also supports \"shttp\" and \"sse\"\n    command: \"node\",\n    args: [\"./my-mcp-server.js\"],\n  },\n\n  // Alternative SSE server configuration:\n  // server: {\n  //   transport: \"sse\",\n  //   url: \"https://api.example.com/mcp/sse\",\n  //   headers: {\n  //     \"Accept\": \"text/event-stream\",\n  //     \"Cache-Control\": \"no-cache\",\n  //     \"Authorization\": \"Bearer ${API_TOKEN}\"\n  //   },\n  //   reconnect: true,\n  //   reconnectInterval: 5000,\n  //   maxReconnectAttempts: 10\n  // },\n\n  // Test tools\n  toolHealthSuites: [\n    {\n      name: \"Core Functions\",\n      tests: [\n        { name: \"add\", args: { a: 5, b: 3 }, expectedResult: 8 },\n        {\n          name: \"divide\",\n          args: { a: 10, b: 0 },\n          expectedError: \"division by zero\",\n        },\n      ],\n    },\n  ],\n\n  // Test workflows\n  workflows: [\n    {\n      name: \"Complete Workflow\",\n      steps: [{ user: \"Process user data and generate a report\" }],\n      expectTools: [\"fetch-data\", \"process\", \"generate-report\"],\n    },\n  ],\n\n  llmJudge: true,\n  openaiKey: process.env.OPENAI_API_KEY,\n  timeout: 30000,\n} satisfies Config;\n```\n\n---\n\n## 10. Acknowledgements\n\n- [Model Context Protocol](https://modelcontextprotoco.lol) – for the SDK\n- [Vercel AI SDK](https://sdk.vercel.ai) – for LLM integration\n- [chalk](https://github.com/chalk/chalk) – for terminal colors\n\nEnjoy testing your MCP servers – PRs, issues \u0026 feedback welcome! ✨\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkylejeong2%2Fmcpvals","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkylejeong2%2Fmcpvals","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkylejeong2%2Fmcpvals/lists"}