{"id":39712536,"url":"https://github.com/greynewell/mcpbr","last_synced_at":"2026-02-12T21:14:51.035Z","repository":{"id":333215412,"uuid":"1136086369","full_name":"greynewell/mcpbr","owner":"greynewell","description":"Evaluate MCP servers with Model Context Protocol Benchmark Runner","archived":false,"fork":false,"pushed_at":"2026-01-31T18:28:21.000Z","size":8455,"stargazers_count":20,"open_issues_count":201,"forks_count":9,"subscribers_count":3,"default_branch":"main","last_synced_at":"2026-02-01T05:18:28.426Z","etag":null,"topics":["ai-tools","anthropic","benchmarking","benchmarks","claude-code","claude-code-plugin","claude-code-skills","cli","cybergym","evaluations","llm-agents","mcp-server","swe-bench"],"latest_commit_sha":null,"homepage":"https://greynewell.github.io/mcpbr/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/greynewell.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":"AGENTS.md","dco":null,"cla":null}},"created_at":"2026-01-17T03:27:24.000Z","updated_at":"2026-01-31T18:46:56.000Z","dependencies_parsed_at":"2026-01-24T02:01:59.402Z","dependency_job_id":null,"html_url":"https://github.com/greynewell/mcpbr","commit_stats":null,"previous_names":["greynewell/mcpbr"],"tags_count":38,"template":false,"template_full_name":null,"purl":"pkg:github/greynewell/mcpbr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greynewell%2Fmcpbr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greynewell%2Fmcpbr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greynewell%2Fmcpbr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greynewell%2Fmcpbr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/greynewell","download_url":"https://codeload.github.com/greynewell/mcpbr/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/greynewell%2Fmcpbr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28984814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-01T17:52:09.146Z","status":"ssl_error","status_checked_at":"2026-02-01T17:49:53.529Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-tools","anthropic","benchmarking","benchmarks","claude-code","claude-code-plugin","claude-code-skills","cli","cybergym","evaluations","llm-agents","mcp-server","swe-bench"],"created_at":"2026-01-18T10:39:41.906Z","updated_at":"2026-02-06T00:03:00.067Z","avatar_url":"https://github.com/greynewell.png","language":"Python","funding_links":[],"categories":["Tools","Testing Tools"],"sub_categories":["Community","Common Lisp"],"readme":"# mcpbr\n\n```bash\n# One-liner install (installs + runs quick test)\ncurl -sSL https://raw.githubusercontent.com/greynewell/mcpbr/main/install.sh | bash\n\n# Or install and run manually\npip install mcpbr \u0026\u0026 mcpbr run -n 1\n```\n\nBenchmark your MCP server against real GitHub issues. One command, hard numbers.\n\n---\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-logo.jpg\" alt=\"MCPBR Logo\" width=\"400\"\u003e\n\u003c/p\u003e\n\n**Model Context Protocol Benchmark Runner**\n\n[![PyPI version](https://badge.fury.io/py/mcpbr.svg)](https://pypi.org/project/mcpbr/)\n[![npm version](https://badge.fury.io/js/mcpbr-cli.svg)](https://www.npmjs.com/package/mcpbr-cli)\n[![Python 3.11+](https://img.shields.io/badge/python-3.11+-blue.svg)](https://www.python.org/downloads/)\n[![CI](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml/badge.svg)](https://github.com/greynewell/mcpbr/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Documentation](https://img.shields.io/badge/docs-greynewell.github.io%2Fmcpbr-blue)](https://greynewell.github.io/mcpbr/)\n![CodeRabbit Pull Request Reviews](https://img.shields.io/coderabbit/prs/github/greynewell/mcpbr?utm_source=oss\u0026utm_medium=github\u0026utm_campaign=greynewell%2Fmcpbr\u0026labelColor=171717\u0026color=FF570A\u0026link=https%3A%2F%2Fcoderabbit.ai\u0026label=CodeRabbit+Reviews)\n\n[![good first issues](https://img.shields.io/github/issues/greynewell/mcpbr/good%20first%20issue?label=good%20first%20issues\u0026color=7057ff)](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)\n[![help wanted](https://img.shields.io/github/issues/greynewell/mcpbr/help%20wanted?label=help%20wanted\u0026color=008672)](https://github.com/greynewell/mcpbr/labels/help%20wanted)\n[![roadmap](https://img.shields.io/badge/roadmap-200%2B%20features-blue)](https://github.com/users/greynewell/projects/2)\n\n\u003e Stop guessing if your MCP server actually helps. Get hard numbers comparing tool-assisted vs. baseline agent performance on real GitHub issues.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-demo.gif\" alt=\"mcpbr in action\" width=\"700\"\u003e\n\u003c/p\u003e\n\n## What You Get\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://raw.githubusercontent.com/greynewell/mcpbr/main/assets/mcpbr-eval-results.png\" alt=\"MCPBR Evaluation Results\" width=\"600\"\u003e\n\u003c/p\u003e\n\nReal metrics showing whether your MCP server improves agent performance on SWE-bench tasks. No vibes, just data.\n\n## Why mcpbr?\n\nMCP servers promise to make LLMs better at coding tasks. But how do you *prove* it?\n\nmcpbr runs controlled experiments: same model, same tasks, same environment - the only variable is your MCP server. You get:\n\n- **Apples-to-apples comparison** against a baseline agent\n- **Real GitHub issues** from SWE-bench (not toy examples)\n- **Reproducible results** via Docker containers with pinned dependencies\n\n## Supported Benchmarks\n\nmcpbr supports 30+ benchmarks across 10 categories through a flexible abstraction layer:\n\n| Category | Benchmarks |\n|----------|-----------|\n| **Software Engineering** | [SWE-bench](https://greynewell.github.io/mcpbr/benchmarks/swe-bench/) (Verified/Lite/Full), [APPS](https://greynewell.github.io/mcpbr/benchmarks/apps/), [CodeContests](https://greynewell.github.io/mcpbr/benchmarks/codecontests/), [BigCodeBench](https://greynewell.github.io/mcpbr/benchmarks/bigcodebench/), [LeetCode](https://greynewell.github.io/mcpbr/benchmarks/leetcode/), [CoderEval](https://greynewell.github.io/mcpbr/benchmarks/codereval/), [Aider Polyglot](https://greynewell.github.io/mcpbr/benchmarks/aider-polyglot/) |\n| **Code Generation** | [HumanEval](https://greynewell.github.io/mcpbr/benchmarks/humaneval/), [MBPP](https://greynewell.github.io/mcpbr/benchmarks/mbpp/) |\n| **Math \u0026 Reasoning** | [GSM8K](https://greynewell.github.io/mcpbr/benchmarks/gsm8k/), [MATH](https://greynewell.github.io/mcpbr/benchmarks/math/), [BigBench-Hard](https://greynewell.github.io/mcpbr/benchmarks/bigbench-hard/) |\n| **Knowledge \u0026 QA** | [TruthfulQA](https://greynewell.github.io/mcpbr/benchmarks/truthfulqa/), [HellaSwag](https://greynewell.github.io/mcpbr/benchmarks/hellaswag/), [ARC](https://greynewell.github.io/mcpbr/benchmarks/arc/), [GAIA](https://greynewell.github.io/mcpbr/benchmarks/gaia/) |\n| **Tool Use \u0026 Agents** | [MCPToolBench++](https://greynewell.github.io/mcpbr/benchmarks/mcptoolbench/), [ToolBench](https://greynewell.github.io/mcpbr/benchmarks/toolbench/), [AgentBench](https://greynewell.github.io/mcpbr/benchmarks/agentbench/), [WebArena](https://greynewell.github.io/mcpbr/benchmarks/webarena/), [TerminalBench](https://greynewell.github.io/mcpbr/benchmarks/terminalbench/), [InterCode](https://greynewell.github.io/mcpbr/benchmarks/intercode/) |\n| **ML Research** | [MLAgentBench](https://greynewell.github.io/mcpbr/benchmarks/mlagentbench/) |\n| **Code Understanding** | [RepoQA](https://greynewell.github.io/mcpbr/benchmarks/repoqa/) |\n| **Multimodal** | MMMU |\n| **Long Context** | LongBench |\n| **Safety \u0026 Adversarial** | Adversarial (HarmBench) |\n| **Security** | [CyberGym](https://greynewell.github.io/mcpbr/benchmarks/cybergym/) |\n| **Custom** | User-defined benchmarks via YAML |\n\n### Featured Benchmarks\n\n**SWE-bench** (Default) - Real GitHub issues requiring bug fixes. Three variants: Verified (500 manually validated), Lite (300 curated), and Full (2,294 complete). Pre-built Docker images available.\n\n**CyberGym** - Security vulnerabilities requiring PoC exploits. 4 difficulty levels controlling context. Uses AddressSanitizer for crash detection.\n\n**MCPToolBench++** - Large-scale MCP tool use evaluation across 45+ categories. Tests tool discovery, selection, invocation, and result interpretation.\n\n**GSM8K** - Grade-school math word problems testing chain-of-thought reasoning with numeric answer matching.\n\n```bash\n# Run SWE-bench Verified (default)\nmcpbr run -c config.yaml\n\n# Run any benchmark\nmcpbr run -c config.yaml --benchmark humaneval -n 20\nmcpbr run -c config.yaml --benchmark gsm8k -n 50\nmcpbr run -c config.yaml --benchmark cybergym --level 2\n\n# List all available benchmarks\nmcpbr benchmarks\n```\n\nSee the **[benchmarks guide](https://greynewell.github.io/mcpbr/benchmarks/)** for details on each benchmark and how to configure them.\n\n## Overview\n\nThis harness runs two parallel evaluations for each task:\n\n1. **MCP Agent**: LLM with access to tools from your MCP server\n2. **Baseline Agent**: LLM without tools (single-shot generation)\n\nBy comparing these, you can measure the effectiveness of your MCP server for different software engineering tasks. See the **[MCP integration guide](https://greynewell.github.io/mcpbr/mcp-integration/)** for tips on testing your server.\n\n## Regression Detection\n\nmcpbr includes built-in regression detection to catch performance degradations between MCP server versions:\n\n### Key Features\n\n- **Automatic Detection**: Compare current results against a baseline to identify regressions\n- **Detailed Reports**: See exactly which tasks regressed and which improved\n- **Threshold-Based Exit Codes**: Fail CI/CD pipelines when regression rate exceeds acceptable limits\n- **Multi-Channel Alerts**: Send notifications via Slack, Discord, or email\n\n### How It Works\n\nA regression is detected when a task that passed in the baseline now fails in the current run. This helps you catch issues before deploying new versions of your MCP server.\n\n```bash\n# First, run a baseline evaluation and save results\nmcpbr run -c config.yaml -o baseline.json\n\n# Later, compare a new version against the baseline\nmcpbr run -c config.yaml --baseline-results baseline.json --regression-threshold 0.1\n\n# With notifications\nmcpbr run -c config.yaml --baseline-results baseline.json \\\n  --regression-threshold 0.1 \\\n  --slack-webhook https://hooks.slack.com/services/YOUR/WEBHOOK/URL\n```\n\n### Use Cases\n\n- **CI/CD Integration**: Automatically detect regressions in pull requests\n- **Version Comparison**: Compare different versions of your MCP server\n- **Performance Monitoring**: Track MCP server performance over time\n- **Team Notifications**: Alert your team when regressions are detected\n\n### Example Output\n\n```\n======================================================================\nREGRESSION DETECTION REPORT\n======================================================================\n\nTotal tasks compared: 25\nRegressions detected: 2\nImprovements detected: 5\nRegression rate: 8.0%\n\nREGRESSIONS (previously passed, now failed):\n----------------------------------------------------------------------\n  - django__django-11099\n    Error: Timeout\n  - sympy__sympy-18087\n    Error: Test suite failed\n\nIMPROVEMENTS (previously failed, now passed):\n----------------------------------------------------------------------\n  - astropy__astropy-12907\n  - pytest-dev__pytest-7373\n  - scikit-learn__scikit-learn-25570\n  - matplotlib__matplotlib-23913\n  - requests__requests-3362\n\n======================================================================\n```\n\nFor CI/CD integration, use `--regression-threshold` to fail the build when regressions exceed an acceptable rate:\n\n```yaml\n# .github/workflows/test-mcp.yml\n- name: Run mcpbr with regression detection\n  run: |\n    mcpbr run -c config.yaml \\\n      --baseline-results baseline.json \\\n      --regression-threshold 0.1 \\\n      -o current.json\n```\n\nThis will exit with code 1 if the regression rate exceeds 10%, failing the CI job.\n\n## Installation\n\n\u003e **[Full installation guide](https://greynewell.github.io/mcpbr/installation/)** with detailed setup instructions.\n\n\u003cdetails\u003e\n\u003csummary\u003ePrerequisites\u003c/summary\u003e\n\n- Python 3.11+\n- Docker (running)\n- `ANTHROPIC_API_KEY` environment variable\n- Claude Code CLI (`claude`) installed\n- Network access (for pulling Docker images and API calls)\n\n**Supported Models (aliases or full names):**\n- Claude Opus 4.5: `opus` or `claude-opus-4-5-20251101`\n- Claude Sonnet 4.5: `sonnet` or `claude-sonnet-4-5-20250929`\n- Claude Haiku 4.5: `haiku` or `claude-haiku-4-5-20251001`\n\nRun `mcpbr models` to see the full list.\n\n\u003c/details\u003e\n\n### via npm\n\n[![npm package](https://img.shields.io/npm/v/mcpbr-cli.svg)](https://www.npmjs.com/package/mcpbr-cli)\n\n```bash\n# Run with npx (no installation)\nnpx mcpbr-cli run -c config.yaml\n\n# Or install globally\nnpm install -g mcpbr-cli\nmcpbr run -c config.yaml\n```\n\n\u003e **Package**: [`mcpbr-cli`](https://www.npmjs.com/package/mcpbr-cli) on npm\n\u003e\n\u003e **Note**: The npm package requires Python 3.11+ and the mcpbr Python package (`pip install mcpbr`)\n\n### via pip\n\n```bash\n# Install from PyPI\npip install mcpbr\n\n# Or install from source\ngit clone https://github.com/greynewell/mcpbr.git\ncd mcpbr\npip install -e .\n\n# Or with uv\nuv pip install -e .\n```\n\n\u003e **Note for Apple Silicon users**: The harness automatically uses x86_64 Docker images via emulation. This may be slower than native ARM64 images but ensures compatibility with all SWE-bench tasks.\n\n## Quick Start\n\n### Option 1: Use Example Configurations (Recommended)\n\nGet started in seconds with our example configurations:\n\n```bash\n# Set your API key\nexport ANTHROPIC_API_KEY=\"your-api-key\"\n\n# Run your first evaluation using an example config\nmcpbr run -c examples/quick-start/getting-started.yaml -v\n```\n\nThis runs 5 SWE-bench tasks with the filesystem server. Expected runtime: 15-30 minutes, cost: $2-5.\n\n**Explore 25+ example configurations** in the [`examples/`](examples/) directory:\n- **Quick Start**: Getting started, testing servers, comparing models\n- **Benchmarks**: SWE-bench Lite/Full, CyberGym basic/advanced\n- **MCP Servers**: Filesystem, GitHub, Brave Search, databases, custom servers\n- **Scenarios**: Cost-optimized, performance-optimized, CI/CD, regression detection\n\nSee the **[Examples README](examples/README.md)** for the complete guide.\n\n### Option 2: Generate Custom Configuration\n\n1. **Set your API key:**\n\n```bash\nexport ANTHROPIC_API_KEY=\"your-api-key\"\n```\n\n2. **Run mcpbr (config auto-created if missing):**\n\n```bash\n# Config is auto-created on first run\nmcpbr run -n 1\n\n# Or explicitly generate a config file first\nmcpbr init\n```\n\n3. **Edit the configuration** to point to your MCP server:\n\n```yaml\nmcp_server:\n  command: \"npx\"\n  args:\n    - \"-y\"\n    - \"@modelcontextprotocol/server-filesystem\"\n    - \"{workdir}\"\n  env: {}\n\nprovider: \"anthropic\"\nagent_harness: \"claude-code\"\n\nmodel: \"sonnet\"  # or full name: \"claude-sonnet-4-5-20250929\"\ndataset: \"SWE-bench/SWE-bench_Lite\"\nsample_size: 10\ntimeout_seconds: 300\nmax_concurrent: 4\n\n# Optional: disable default logging (logs are saved to output_dir/logs/ by default)\n# disable_logs: true\n```\n\n4. **Run the evaluation:**\n\n```bash\nmcpbr run --config config.yaml\n```\n\n## Infrastructure Modes\n\nmcpbr supports running evaluations on different infrastructure platforms, allowing you to scale evaluations or offload compute-intensive tasks to cloud VMs.\n\n### Local (Default)\n\nRun evaluations on your local machine:\n\n```yaml\ninfrastructure:\n  mode: local  # default\n```\n\nThis is the default mode - evaluations run directly on your machine using local Docker containers.\n\n### Azure VM\n\nRun evaluations on Azure Virtual Machines with automatic provisioning and cleanup:\n\n```yaml\ninfrastructure:\n  mode: azure\n  azure:\n    resource_group: mcpbr-benchmarks\n    location: eastus\n    cpu_cores: 10\n    memory_gb: 40\n```\n\n**Key features:**\n- Zero manual VM setup - provisioned automatically from config\n- Automatic Docker, Python, and mcpbr installation\n- Test task validation before full evaluation\n- Auto-cleanup after completion (configurable)\n- Cost-optimized with automatic VM deletion\n\n**Example usage:**\n```bash\n# Run evaluation on Azure VM\nmcpbr run -c azure-config.yaml\n\n# VM is automatically created, evaluation runs, results are downloaded, VM is deleted\n```\n\nSee [docs/infrastructure/azure.md](docs/infrastructure/azure.md) for full documentation including:\n- Prerequisites and authentication\n- VM sizing and cost estimation\n- Debugging with `preserve_on_error`\n- Troubleshooting guide\n\n## Side-by-Side Server Comparison\n\nCompare two MCP servers head-to-head in a single evaluation run to see which implementation performs better.\n\n### Quick Example\n\n```yaml\n# comparison-config.yaml\ncomparison_mode: true\n\nmcp_server_a:\n  name: \"Task Queries\"\n  command: node\n  args: [build/index.js]\n  cwd: /path/to/task-queries\n\nmcp_server_b:\n  name: \"Edge Identity\"\n  command: node\n  args: [build/index.js]\n  cwd: /path/to/edge-identity\n\nbenchmark: swe-bench-lite\nsample_size: 10\n```\n\n```bash\nmcpbr run -c comparison-config.yaml -o results.json\n```\n\n### Results Output\n\n```text\nSide-by-Side MCP Server Comparison\n\n┏━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━┓\n┃ Metric            ┃ Task Queries ┃ Edge Identity┃ Δ (A - B)┃\n┡━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━┩\n│ Resolved Tasks    │ 4/10         │ 2/10         │ +2       │\n│ Resolution Rate   │ 40.0%        │ 20.0%        │ +100.0%  │\n└───────────────────┴──────────────┴──────────────┴──────────┘\n\n✓ Task Queries unique wins: 2 tasks\n  - django__django-12286\n  - astropy__astropy-7606\n```\n\n**Use cases:**\n- **A/B testing**: Compare optimized vs. baseline implementations\n- **Tool evaluation**: Test different MCP tool sets\n- **Version comparison**: Benchmark v2.0 vs. v1.5\n\nSee [docs/comparison-mode.md](docs/comparison-mode.md) for complete documentation.\n\n## Claude Code Integration\n\n[![Claude Code Ready](https://img.shields.io/badge/Claude_Code-Ready-5865F2?style=flat\u0026logo=anthropic)](https://claude.ai/download)\n\nmcpbr includes a built-in Claude Code plugin that makes Claude an expert at running benchmarks correctly. The plugin provides specialized skills and knowledge about mcpbr configuration, execution, and troubleshooting.\n\n### Installation Options\n\nYou have three ways to enable the mcpbr plugin in Claude Code:\n\n#### Option 1: Clone Repository (Automatic Detection)\n\nWhen you clone this repository, Claude Code automatically detects and loads the plugin:\n\n```bash\ngit clone https://github.com/greynewell/mcpbr.git\ncd mcpbr\n\n# Plugin is now active - try asking Claude:\n# \"Run the SWE-bench Lite eval with 5 tasks\"\n```\n\n**Best for**: Contributors, developers testing changes, or users who want the latest unreleased features.\n\n#### Option 2: npm Global Install (Planned for v0.4.0)\n\nInstall the plugin globally via npm for use across any project:\n\n```bash\n# Planned for v0.4.0 (not yet released)\nnpm install -g @mcpbr/claude-code-plugin\n```\n\n\u003e **Note**: The npm package is not yet published. This installation method will be available in a future release. Track progress in [issue #265](https://github.com/greynewell/mcpbr/issues/265).\n\n**Best for**: Users who want plugin features available in any directory.\n\n#### Option 3: Claude Code Plugin Manager (Planned for v0.4.0)\n\nInstall via Claude Code's built-in plugin manager:\n\n1. Open Claude Code settings\n2. Navigate to Plugins \u003e Browse\n3. Search for \"mcpbr\"\n4. Click Install\n\n\u003e **Note**: Plugin manager installation is not yet available. This installation method will be available after plugin marketplace submission. Track progress in [issue #267](https://github.com/greynewell/mcpbr/issues/267).\n\n**Best for**: Users who prefer a GUI and want automatic updates.\n\n### Installation Comparison\n\n| Method | Availability | Auto-updates | Works Anywhere | Latest Features |\n|--------|-------------|--------------|----------------|-----------------|\n| Clone Repository | Available now | Manual (git pull) | No (repo only) | Yes (unreleased) |\n| npm Global Install | Planned (not yet released) | Via npm | Yes | Yes (published) |\n| Plugin Manager | Planned (not yet released) | Automatic | Yes | Yes (published) |\n\n### What You Get\n\nThe plugin includes three specialized skills that enhance Claude's ability to work with mcpbr:\n\n#### 1. run-benchmark\nExpert at running evaluations with proper validation and error handling.\n\n**Capabilities**:\n- Validates prerequisites (Docker running, API keys set, config files exist)\n- Constructs correct `mcpbr run` commands with appropriate flags\n- Handles errors gracefully with actionable troubleshooting steps\n- Monitors progress and provides meaningful status updates\n\n**Example interactions**:\n- \"Run the SWE-bench Lite benchmark with 10 tasks\"\n- \"Evaluate my MCP server using CyberGym level 2\"\n- \"Test my config with a single task\"\n\n#### 2. generate-config\nGenerates valid mcpbr configuration files with benchmark-specific templates.\n\n**Capabilities**:\n- Ensures required `{workdir}` placeholder is included in MCP server args\n- Validates MCP server command syntax\n- Provides templates for different benchmarks (SWE-bench, CyberGym, MCPToolBench++)\n- Suggests appropriate timeouts and concurrency settings\n\n**Example interactions**:\n- \"Generate a config for the filesystem MCP server\"\n- \"Create a config for testing my custom MCP server\"\n- \"Set up a CyberGym evaluation config\"\n\n#### 3. swe-bench-lite\nQuick-start command for running SWE-bench Lite evaluations.\n\n**Capabilities**:\n- Pre-configured for 5-task evaluation (fast testing)\n- Includes sensible defaults for output files and logging\n- Perfect for demonstrations and initial testing\n- Automatically sets up verbose output for debugging\n\n**Example interactions**:\n- \"Run a quick SWE-bench Lite test\"\n- \"Show me how mcpbr works\"\n- \"Test the filesystem server\"\n\n### Benefits\n\nWhen using Claude Code with the mcpbr plugin active, Claude will automatically:\n\n- Verify Docker is running before starting evaluations\n- Check for required API keys (`ANTHROPIC_API_KEY`)\n- Generate valid configurations with proper `{workdir}` placeholders\n- Use correct CLI flags and avoid deprecated options\n- Provide contextual troubleshooting when issues occur\n- Follow mcpbr best practices for optimal results\n\n### Troubleshooting\n\n**Plugin not detected in cloned repository**:\n- Ensure you're in the repository root directory\n- Verify the `claude-code.json` file exists in the repo\n- Try restarting Claude Code\n\n**Skills not appearing**:\n- Check Claude Code version (requires v2.0+)\n- Verify plugin is listed in Settings \u003e Plugins\n- Try running `/reload-plugins` in Claude Code\n\n**Commands failing**:\n- Ensure mcpbr is installed: `pip install mcpbr`\n- Verify Docker is running: `docker info`\n- Check API key is set: `echo $ANTHROPIC_API_KEY`\n\nFor more help, see the [troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/) or [open an issue](https://github.com/greynewell/mcpbr/issues).\n\n## Configuration\n\n\u003e **[Full configuration reference](https://greynewell.github.io/mcpbr/configuration/)** with all options and examples.\n\n### MCP Server Configuration\n\nThe `mcp_server` section defines how to start your MCP server:\n\n| Field | Description |\n|-------|-------------|\n| `command` | Executable to run (e.g., `npx`, `uvx`, `python`) |\n| `args` | Command arguments. Use `{workdir}` as placeholder for the task repository path |\n| `env` | Additional environment variables |\n\n### Example Configurations\n\n**Anthropic Filesystem Server:**\n\n```yaml\nmcp_server:\n  command: \"npx\"\n  args: [\"-y\", \"@modelcontextprotocol/server-filesystem\", \"{workdir}\"]\n```\n\n**Custom Python MCP Server:**\n\n```yaml\nmcp_server:\n  command: \"python\"\n  args: [\"-m\", \"my_mcp_server\", \"--workspace\", \"{workdir}\"]\n  env:\n    LOG_LEVEL: \"debug\"\n```\n\n**Supermodel Codebase Analysis Server:**\n\n```yaml\nmcp_server:\n  command: \"npx\"\n  args: [\"-y\", \"@supermodeltools/mcp-server\"]\n  env:\n    SUPERMODEL_API_KEY: \"${SUPERMODEL_API_KEY}\"\n```\n\n### MCP Timeout Configuration\n\nmcpbr supports configurable timeouts for MCP server operations to handle different server types and workloads.\n\n#### Configuration Fields\n\n| Field | Description | Default |\n|-------|-------------|---------|\n| `startup_timeout_ms` | Timeout in milliseconds for MCP server startup | 60000 (60s) |\n| `tool_timeout_ms` | Timeout in milliseconds for MCP tool execution | 900000 (15 min) |\n\nThese fields map to the `MCP_TIMEOUT` and `MCP_TOOL_TIMEOUT` environment variables used by Claude Code. See the [Claude Code settings documentation](https://code.claude.com/docs/en/settings.md) for more details.\n\n#### Example Configuration\n\n```yaml\nmcp_server:\n  command: \"npx\"\n  args: [\"-y\", \"@modelcontextprotocol/server-filesystem\", \"{workdir}\"]\n  startup_timeout_ms: 60000      # 60 seconds for server to start\n  tool_timeout_ms: 900000        # 15 minutes for long-running tools\n```\n\n#### Common Timeout Values\n\nDifferent server types require different timeout settings based on their operational characteristics:\n\n| Server Type | startup_timeout_ms | tool_timeout_ms | Notes |\n|-------------|-------------------|-----------------|-------|\n| Fast (filesystem, git) | 10000 (10s) | 30000 (30s) | Local operations with minimal overhead |\n| Medium (web search, APIs) | 30000 (30s) | 120000 (2m) | Network I/O with moderate latency |\n| Slow (code analysis, databases) | 60000 (60s) | 900000 (15m) | Complex processing or large datasets |\n\n**When to adjust timeouts:**\n\n- **Increase `startup_timeout_ms`** if your server takes longer to initialize (e.g., loading large models, establishing database connections)\n- **Increase `tool_timeout_ms`** if your tools perform long-running operations (e.g., codebase analysis, file processing, AI inference)\n- **Decrease timeouts** for fast servers to fail quickly on connection issues\n\n### Custom Agent Prompt\n\nYou can customize the prompt sent to the agent using the `agent_prompt` field:\n\n```yaml\nagent_prompt: |\n  Fix the following bug in this repository:\n\n  {problem_statement}\n\n  Make the minimal changes necessary to fix the issue.\n  Focus on the root cause, not symptoms.\n```\n\nUse `{problem_statement}` as a placeholder for the SWE-bench issue text. You can also override the prompt via CLI with `--prompt`.\n\n### Evaluation Parameters\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `provider` | `anthropic` | LLM provider |\n| `agent_harness` | `claude-code` | Agent backend |\n| `benchmark` | `swe-bench` | Benchmark to run (`swe-bench`, `cybergym`, or `mcptoolbench`) |\n| `agent_prompt` | `null` | Custom prompt template (use `{problem_statement}` placeholder) |\n| `model` | `sonnet` | Model alias or full ID |\n| `dataset` | `null` | HuggingFace dataset (optional, benchmark provides default) |\n| `cybergym_level` | `1` | CyberGym difficulty level (0-3, only for CyberGym benchmark) |\n| `sample_size` | `null` | Number of tasks (null = full dataset) |\n| `timeout_seconds` | `300` | Timeout per task |\n| `max_concurrent` | `4` | Parallel task limit |\n| `max_iterations` | `10` | Max agent iterations per task |\n\n## CLI Reference\n\n\u003e **[Full CLI documentation](https://greynewell.github.io/mcpbr/cli/)** with all commands and options.\n\nGet help for any command with `--help` or `-h`:\n\n```bash\nmcpbr --help\nmcpbr run --help\nmcpbr init --help\n```\n\n### Commands Overview\n\n| Command | Description |\n|---------|-------------|\n| `mcpbr run` | Run benchmark evaluation with configured MCP server |\n| `mcpbr init` | Generate an example configuration file |\n| `mcpbr models` | List supported models for evaluation |\n| `mcpbr providers` | List available model providers |\n| `mcpbr harnesses` | List available agent harnesses |\n| `mcpbr benchmarks` | List available benchmarks (SWE-bench, CyberGym, MCPToolBench++) |\n| `mcpbr cleanup` | Remove orphaned mcpbr Docker containers |\n\n### `mcpbr run`\n\nRun SWE-bench evaluation with the configured MCP server.\n\n\u003cdetails\u003e\n\u003csummary\u003eAll options\u003c/summary\u003e\n\n| Option | Short | Description |\n|--------|-------|-------------|\n| `--config PATH` | `-c` | Path to YAML configuration file (default: `mcpbr.yaml`, auto-created if missing) |\n| `--model TEXT` | `-m` | Override model from config |\n| `--benchmark TEXT` | `-b` | Override benchmark from config (`swe-bench`, `cybergym`, or `mcptoolbench`) |\n| `--level INTEGER` | | Override CyberGym difficulty level (0-3) |\n| `--sample INTEGER` | `-n` | Override sample size from config |\n| `--mcp-only` | `-M` | Run only MCP evaluation (skip baseline) |\n| `--baseline-only` | `-B` | Run only baseline evaluation (skip MCP) |\n| `--no-prebuilt` | | Disable pre-built SWE-bench images (build from scratch) |\n| `--output PATH` | `-o` | Path to save JSON results |\n| `--report PATH` | `-r` | Path to save Markdown report |\n| `--output-junit PATH` | | Path to save JUnit XML report (for CI/CD integration) |\n| `--verbose` | `-v` | Verbose output (`-v` summary, `-vv` detailed) |\n| `--log-file PATH` | `-l` | Path to write raw JSON log output (single file) |\n| `--log-dir PATH` | | Directory to write per-instance JSON log files (default: `output_dir/logs/`) |\n| `--disable-logs` | | Disable detailed execution logs (overrides default and config) |\n| `--task TEXT` | `-t` | Run specific task(s) by instance_id (repeatable) |\n| `--prompt TEXT` | | Override agent prompt (use `{problem_statement}` placeholder) |\n| `--baseline-results PATH` | | Path to baseline results JSON for regression detection |\n| `--regression-threshold FLOAT` | | Maximum acceptable regression rate (0-1). Exit with code 1 if exceeded. |\n| `--slack-webhook URL` | | Slack webhook URL for regression notifications |\n| `--discord-webhook URL` | | Discord webhook URL for regression notifications |\n| `--email-to EMAIL` | | Email address for regression notifications |\n| `--email-from EMAIL` | | Sender email address for notifications |\n| `--smtp-host HOST` | | SMTP server hostname for email notifications |\n| `--smtp-port PORT` | | SMTP server port (default: 587) |\n| `--smtp-user USER` | | SMTP username for authentication |\n| `--smtp-password PASS` | | SMTP password for authentication |\n| `--profile` | | Enable comprehensive performance profiling (tool latency, memory, overhead) |\n| `--help` | `-h` | Show help message |\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\u003csummary\u003eExamples\u003c/summary\u003e\n\n```bash\n# Full evaluation (MCP + baseline)\nmcpbr run -c config.yaml\n\n# Run only MCP evaluation\nmcpbr run -c config.yaml -M\n\n# Run only baseline evaluation\nmcpbr run -c config.yaml -B\n\n# Override model\nmcpbr run -c config.yaml -m claude-3-5-sonnet-20241022\n\n# Override sample size\nmcpbr run -c config.yaml -n 50\n\n# Save results and report\nmcpbr run -c config.yaml -o results.json -r report.md\n\n# Save JUnit XML for CI/CD\nmcpbr run -c config.yaml --output-junit junit.xml\n\n# Run specific tasks\nmcpbr run -c config.yaml -t astropy__astropy-12907 -t django__django-11099\n\n# Verbose output with per-instance logs\nmcpbr run -c config.yaml -v --log-dir logs/\n\n# Very verbose output\nmcpbr run -c config.yaml -vv\n\n# Run CyberGym benchmark\nmcpbr run -c config.yaml --benchmark cybergym --level 2\n\n# Run CyberGym with specific tasks\nmcpbr run -c config.yaml --benchmark cybergym --level 3 -n 5\n\n# Regression detection - compare against baseline\nmcpbr run -c config.yaml --baseline-results baseline.json\n\n# Regression detection with threshold (exit 1 if exceeded)\nmcpbr run -c config.yaml --baseline-results baseline.json --regression-threshold 0.1\n\n# Regression detection with Slack notifications\nmcpbr run -c config.yaml --baseline-results baseline.json --slack-webhook https://hooks.slack.com/...\n\n# Regression detection with Discord notifications\nmcpbr run -c config.yaml --baseline-results baseline.json --discord-webhook https://discord.com/api/webhooks/...\n\n# Regression detection with email notifications\nmcpbr run -c config.yaml --baseline-results baseline.json \\\n  --email-to team@example.com --email-from mcpbr@example.com \\\n  --smtp-host smtp.gmail.com --smtp-port 587 \\\n  --smtp-user user@gmail.com --smtp-password \"app-password\"\n```\n\n\u003c/details\u003e\n\n### `mcpbr init`\n\nGenerate an example configuration file.\n\n\u003cdetails\u003e\n\u003csummary\u003eOptions and examples\u003c/summary\u003e\n\n| Option | Short | Description |\n|--------|-------|-------------|\n| `--output PATH` | `-o` | Path to write example config (default: `mcpbr.yaml`) |\n| `--help` | `-h` | Show help message |\n\n```bash\nmcpbr init\nmcpbr init -o my-config.yaml\n```\n\n\u003c/details\u003e\n\n### `mcpbr models`\n\nList supported Anthropic models for evaluation.\n\n### `mcpbr cleanup`\n\nRemove orphaned mcpbr Docker containers that were not properly cleaned up.\n\n\u003cdetails\u003e\n\u003csummary\u003eOptions and examples\u003c/summary\u003e\n\n| Option | Short | Description |\n|--------|-------|-------------|\n| `--dry-run` | | Show containers that would be removed without removing them |\n| `--force` | `-f` | Skip confirmation prompt |\n| `--help` | `-h` | Show help message |\n\n```bash\n# Preview containers to remove\nmcpbr cleanup --dry-run\n\n# Remove containers with confirmation\nmcpbr cleanup\n\n# Remove containers without confirmation\nmcpbr cleanup -f\n```\n\n\u003c/details\u003e\n\n## Performance Profiling\n\nmcpbr includes comprehensive performance profiling to understand MCP server overhead and identify optimization opportunities.\n\n### Enable Profiling\n\n```bash\n# Via CLI flag\nmcpbr run -c config.yaml --profile\n\n# Or in config.yaml\nenable_profiling: true\n```\n\n### What Gets Measured\n\n- **Tool call latencies** with percentiles (p50, p95, p99)\n- **Memory usage** (peak and average RSS/VMS)\n- **Infrastructure overhead** (Docker and MCP server startup times)\n- **Tool discovery speed** (time to first tool use)\n- **Tool switching overhead** (time between tool calls)\n- **Automated insights** from profiling data\n\n### Example Profiling Output\n\n```json\n{\n  \"profiling\": {\n    \"task_duration_seconds\": 140.5,\n    \"tool_call_latencies\": {\n      \"Read\": {\"count\": 15, \"avg_seconds\": 0.8, \"p95_seconds\": 1.5},\n      \"Bash\": {\"avg_seconds\": 2.3, \"p95_seconds\": 5.1}\n    },\n    \"memory_profile\": {\"peak_rss_mb\": 512.3, \"avg_rss_mb\": 387.5},\n    \"docker_startup_seconds\": 2.1,\n    \"mcp_server_startup_seconds\": 1.8\n  }\n}\n```\n\n### Automated Insights\n\nThe profiler automatically identifies performance issues:\n\n```text\n- Bash is the slowest tool (avg: 2.3s, p95: 5.1s)\n- Docker startup adds 2.1s overhead per task\n- Fast tool discovery: first tool use in 8.3s\n```\n\nSee [docs/profiling.md](docs/profiling.md) for complete profiling documentation.\n\n## Example Run\n\nHere's what a typical evaluation looks like:\n\n```bash\n$ mcpbr run -c config.yaml -v -o results.json --log-dir my-logs\n\nmcpbr Evaluation\n  Config: config.yaml\n  Provider: anthropic\n  Model: sonnet\n  Agent Harness: claude-code\n  Dataset: SWE-bench/SWE-bench_Lite\n  Sample size: 10\n  Run MCP: True, Run Baseline: True\n  Pre-built images: True\n  Log dir: my-logs\n\nLoading dataset: SWE-bench/SWE-bench_Lite\nEvaluating 10 tasks\nProvider: anthropic, Harness: claude-code\n14:23:15 [MCP] Starting mcp run for astropy-12907:mcp\n14:23:22 astropy-12907:mcp    \u003e TodoWrite\n14:23:22 astropy-12907:mcp    \u003c Todos have been modified successfully...\n14:23:26 astropy-12907:mcp    \u003e Glob\n14:23:26 astropy-12907:mcp    \u003e Grep\n14:23:27 astropy-12907:mcp    \u003c $WORKDIR/astropy/modeling/separable.py\n14:23:27 astropy-12907:mcp    \u003c Found 5 files: astropy/modeling/tests/test_separable.py...\n...\n14:27:43 astropy-12907:mcp    * done turns=31 tokens=115/6,542\n14:28:30 [BASELINE] Starting baseline run for astropy-12907:baseline\n...\n```\n\n## Output\n\n\u003e **[Understanding evaluation results](https://greynewell.github.io/mcpbr/evaluation-results/)** - detailed guide to interpreting output.\n\n### Console Output\n\nThe harness displays real-time progress with verbose mode (`-v`) and a final summary table:\n\n```text\nEvaluation Results\n\n                 Summary\n+-----------------+-----------+----------+\n| Metric          | MCP Agent | Baseline |\n+-----------------+-----------+----------+\n| Resolved        | 8/25      | 5/25     |\n| Resolution Rate | 32.0%     | 20.0%    |\n+-----------------+-----------+----------+\n\nImprovement: +60.0%\n\nPer-Task Results\n+------------------------+------+----------+-------+\n| Instance ID            | MCP  | Baseline | Error |\n+------------------------+------+----------+-------+\n| astropy__astropy-12907 | PASS |   PASS   |       |\n| django__django-11099   | PASS |   FAIL   |       |\n| sympy__sympy-18087     | FAIL |   FAIL   |       |\n+------------------------+------+----------+-------+\n\nResults saved to results.json\n```\n\n### JSON Output (`--output`)\n\n```json\n{\n  \"metadata\": {\n    \"timestamp\": \"2026-01-17T07:23:39.871437+00:00\",\n    \"config\": {\n      \"model\": \"sonnet\",\n      \"provider\": \"anthropic\",\n      \"agent_harness\": \"claude-code\",\n      \"dataset\": \"SWE-bench/SWE-bench_Lite\",\n      \"sample_size\": 25,\n      \"timeout_seconds\": 600,\n      \"max_iterations\": 30\n    },\n    \"mcp_server\": {\n      \"command\": \"npx\",\n      \"args\": [\"-y\", \"@modelcontextprotocol/server-filesystem\", \"{workdir}\"]\n    }\n  },\n  \"summary\": {\n    \"mcp\": {\"resolved\": 8, \"total\": 25, \"rate\": 0.32},\n    \"baseline\": {\"resolved\": 5, \"total\": 25, \"rate\": 0.20},\n    \"improvement\": \"+60.0%\"\n  },\n  \"tasks\": [\n    {\n      \"instance_id\": \"astropy__astropy-12907\",\n      \"mcp\": {\n        \"patch_generated\": true,\n        \"tokens\": {\"input\": 115, \"output\": 6542},\n        \"iterations\": 30,\n        \"tool_calls\": 72,\n        \"tool_usage\": {\n          \"TodoWrite\": 4, \"Task\": 1, \"Glob\": 4,\n          \"Grep\": 11, \"Bash\": 27, \"Read\": 22,\n          \"Write\": 2, \"Edit\": 1\n        },\n        \"resolved\": true,\n        \"patch_applied\": true,\n        \"fail_to_pass\": {\"passed\": 2, \"total\": 2},\n        \"pass_to_pass\": {\"passed\": 10, \"total\": 10}\n      },\n      \"baseline\": {\n        \"patch_generated\": true,\n        \"tokens\": {\"input\": 63, \"output\": 7615},\n        \"iterations\": 30,\n        \"tool_calls\": 57,\n        \"tool_usage\": {\n          \"TodoWrite\": 4, \"Glob\": 3, \"Grep\": 4,\n          \"Read\": 14, \"Bash\": 26, \"Write\": 4, \"Edit\": 1\n        },\n        \"resolved\": true,\n        \"patch_applied\": true\n      }\n    }\n  ]\n}\n```\n\n### Output Directory Structure\n\nBy default, mcpbr consolidates all outputs into a single timestamped directory:\n\n```text\n.mcpbr_run_20260126_133000/\n├── config.yaml                # Copy of configuration used\n├── evaluation_state.json      # Task results and state\n├── logs/                      # Detailed MCP server logs\n│   ├── task_1_mcp.log\n│   ├── task_2_mcp.log\n│   └── ...\n└── README.txt                 # Auto-generated explanation\n```\n\nThis makes it easy to:\n- **Archive results**: `tar -czf results.tar.gz .mcpbr_run_*`\n- **Clean up**: `rm -rf .mcpbr_run_*`\n- **Share**: Just zip one directory\n\nYou can customize the output directory:\n\n```bash\n# Custom output directory\nmcpbr run -c config.yaml --output-dir ./my-results\n\n# Or in config.yaml\noutput_dir: \"./my-results\"\n```\n\n**Note:** The `--output-dir` CLI flag takes precedence over the `output_dir` config setting. This ensures that the README.txt file in the output directory reflects the final effective configuration values after all CLI overrides are applied.\n\n### Markdown Report (`--report`)\n\nGenerates a human-readable report with:\n- Summary statistics\n- Per-task results table\n- Analysis of which tasks each agent solved\n\n### Per-Instance Logs (`--log-dir`)\n\n**Logging is enabled by default** to prevent data loss. Detailed execution traces are automatically saved to `output_dir/logs/` unless disabled.\n\nTo disable logging:\n```bash\n# Via CLI flag\nmcpbr run -c config.yaml --disable-logs\n\n# Or in config file\ndisable_logs: true\n```\n\nCreates a directory with detailed JSON log files for each task run. Filenames include timestamps to prevent overwrites:\n\n```text\nmy-logs/\n  astropy__astropy-12907_mcp_20260117_143052.json\n  astropy__astropy-12907_baseline_20260117_143156.json\n  django__django-11099_mcp_20260117_144023.json\n  django__django-11099_baseline_20260117_144512.json\n```\n\nEach log file contains the full stream of events from the agent CLI:\n\n```json\n{\n  \"instance_id\": \"astropy__astropy-12907\",\n  \"run_type\": \"mcp\",\n  \"events\": [\n    {\n      \"type\": \"system\",\n      \"subtype\": \"init\",\n      \"cwd\": \"/workspace\",\n      \"tools\": [\"Task\", \"Bash\", \"Glob\", \"Grep\", \"Read\", \"Edit\", \"Write\", \"TodoWrite\"],\n      \"model\": \"claude-sonnet-4-5-20250929\",\n      \"claude_code_version\": \"2.1.12\"\n    },\n    {\n      \"type\": \"assistant\",\n      \"message\": {\n        \"content\": [{\"type\": \"text\", \"text\": \"I'll help you fix this bug...\"}]\n      }\n    },\n    {\n      \"type\": \"assistant\",\n      \"message\": {\n        \"content\": [{\"type\": \"tool_use\", \"name\": \"Grep\", \"input\": {\"pattern\": \"separability\"}}]\n      }\n    },\n    {\n      \"type\": \"result\",\n      \"num_turns\": 31,\n      \"usage\": {\"input_tokens\": 115, \"output_tokens\": 6542}\n    }\n  ]\n}\n```\n\nThis is useful for debugging failed runs or analyzing agent behavior in detail.\n\n### JUnit XML Output (`--output-junit`)\n\nThe harness can generate JUnit XML reports for integration with CI/CD systems like GitHub Actions, GitLab CI, and Jenkins. Each task is represented as a test case, with resolved/unresolved tasks mapped to pass/fail states.\n\n```bash\nmcpbr run -c config.yaml --output-junit junit.xml\n```\n\nThe JUnit XML report includes:\n\n- **Test Suites**: Separate suites for MCP and baseline evaluations\n- **Test Cases**: Each task is a test case with timing information\n- **Failures**: Unresolved tasks with detailed error messages\n- **Properties**: Metadata about model, provider, benchmark configuration\n- **System Output**: Token usage, tool calls, and test results per task\n\n#### CI/CD Integration Examples\n\n**GitHub Actions:**\n\n```yaml\nname: MCP Benchmark\n\non: [push, pull_request]\n\njobs:\n  benchmark:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions/checkout@v3\n\n      - name: Set up Python\n        uses: actions/setup-python@v4\n        with:\n          python-version: '3.11'\n\n      - name: Install mcpbr\n        run: pip install mcpbr\n\n      - name: Run benchmark\n        env:\n          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}\n        run: |\n          mcpbr run -c config.yaml --output-junit junit.xml\n\n      - name: Publish Test Results\n        uses: EnricoMi/publish-unit-test-result-action@v2\n        if: always()\n        with:\n          files: junit.xml\n```\n\n**GitLab CI:**\n\n```yaml\nbenchmark:\n  image: python:3.11\n  services:\n    - docker:dind\n  script:\n    - pip install mcpbr\n    - mcpbr run -c config.yaml --output-junit junit.xml\n  artifacts:\n    reports:\n      junit: junit.xml\n```\n\n**Jenkins:**\n\n```groovy\npipeline {\n    agent any\n    stages {\n        stage('Benchmark') {\n            steps {\n                sh 'pip install mcpbr'\n                sh 'mcpbr run -c config.yaml --output-junit junit.xml'\n            }\n        }\n    }\n    post {\n        always {\n            junit 'junit.xml'\n        }\n    }\n}\n```\n\nThe JUnit XML format enables native test result visualization in your CI/CD dashboard, making it easy to track benchmark performance over time and identify regressions.\n\n## How It Works\n\n\u003e **[Architecture deep dive](https://greynewell.github.io/mcpbr/architecture/)** - learn how mcpbr works internally.\n\n1. **Load Tasks**: Fetches tasks from the selected benchmark (SWE-bench, CyberGym, or MCPToolBench++) via HuggingFace\n2. **Create Environment**: For each task, creates an isolated Docker environment with the repository and dependencies\n3. **Run MCP Agent**: Invokes Claude Code CLI **inside the Docker container**, letting it explore and generate a solution (patch or PoC)\n4. **Run Baseline**: Same as MCP agent but without the MCP server\n5. **Evaluate**: Runs benchmark-specific evaluation (test suites for SWE-bench, crash detection for CyberGym, tool use accuracy for MCPToolBench++)\n6. **Report**: Aggregates results and calculates improvement\n\n### Pre-built Docker Images\n\nThe harness uses pre-built SWE-bench Docker images from [Epoch AI's registry](https://github.com/orgs/Epoch-Research/packages) when available. These images come with:\n\n- The repository checked out at the correct commit\n- All project dependencies pre-installed and validated\n- A consistent environment for reproducible evaluations\n\nThe agent (Claude Code CLI) runs **inside the container**, which means:\n- Python imports work correctly (e.g., `from astropy import ...`)\n- The agent can run tests and verify fixes\n- No dependency conflicts with the host machine\n\nIf a pre-built image is not available for a task, the harness falls back to cloning the repository and attempting to install dependencies (less reliable).\n\n## Architecture\n\n```\nmcpbr/\n├── src/mcpbr/\n│   ├── cli.py           # Command-line interface\n│   ├── config.py        # Configuration models\n│   ├── models.py        # Supported model registry\n│   ├── providers.py     # LLM provider abstractions (extensible)\n│   ├── harnesses.py     # Agent harness implementations (extensible)\n│   ├── benchmarks/      # Benchmark abstraction layer (25+ benchmarks)\n│   │   ├── __init__.py      # Registry and factory\n│   │   ├── base.py          # Benchmark protocol\n│   │   ├── swebench.py      # SWE-bench (Verified/Lite/Full)\n│   │   ├── cybergym.py      # CyberGym security\n│   │   ├── humaneval.py     # HumanEval code generation\n│   │   ├── gsm8k.py         # GSM8K math reasoning\n│   │   ├── mcptoolbench.py  # MCPToolBench++ tool use\n│   │   ├── apps.py          # APPS coding problems\n│   │   ├── mbpp.py          # MBPP Python problems\n│   │   ├── math_benchmark.py # MATH competition math\n│   │   └── ...              # 15+ more benchmarks\n│   ├── harness.py       # Main orchestrator\n│   ├── agent.py         # Baseline agent implementation\n│   ├── docker_env.py    # Docker environment management + in-container execution\n│   ├── evaluation.py    # Patch application and testing\n│   ├── log_formatter.py # Log formatting and per-instance logging\n│   └── reporting.py     # Output formatting\n├── tests/\n│   ├── test_*.py        # Unit tests\n│   ├── test_benchmarks.py # Benchmark tests\n│   └── test_integration.py  # Integration tests\n├── Dockerfile           # Fallback image for task environments\n└── config/\n    └── example.yaml     # Example configuration\n```\n\nThe architecture uses Protocol-based abstractions for providers, harnesses, and **benchmarks**, making it easy to add support for additional LLM providers, agent backends, or software engineering benchmarks in the future. See the **[API reference](https://greynewell.github.io/mcpbr/api/)** and **[benchmarks guide](https://greynewell.github.io/mcpbr/benchmarks/)** for more details.\n\n### Execution Flow\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                         Host Machine                            │\n│  ┌───────────────────────────────────────────────────────────┐  │\n│  │                    mcpbr Harness (Python)                 │  │\n│  │  - Loads SWE-bench tasks from HuggingFace                 │  │\n│  │  - Pulls pre-built Docker images                          │  │\n│  │  - Orchestrates agent runs                                │  │\n│  │  - Collects results and generates reports                 │  │\n│  └─────────────────────────┬─────────────────────────────────┘  │\n│                            │ docker exec                        │\n│  ┌─────────────────────────▼─────────────────────────────────┐  │\n│  │              Docker Container (per task)                  │  │\n│  │  ┌─────────────────────────────────────────────────────┐  │  │\n│  │  │  Pre-built SWE-bench Image                          │  │  │\n│  │  │  - Repository at correct commit                     │  │  │\n│  │  │  - All dependencies installed (astropy, django...)  │  │  │\n│  │  │  - Node.js + Claude CLI (installed at startup)      │  │  │\n│  │  └─────────────────────────────────────────────────────┘  │  │\n│  │                                                           │  │\n│  │  Agent (Claude Code CLI) runs HERE:                       │  │\n│  │  - Makes API calls to Anthropic                           │  │\n│  │  - Executes Bash commands (with working imports!)         │  │\n│  │  - Reads/writes files                                     │  │\n│  │  - Generates patches                                      │  │\n│  │                                                           │  │\n│  │  Evaluation runs HERE:                                    │  │\n│  │  - Applies patch via git                                  │  │\n│  │  - Runs pytest with task's test suite                     │  │\n│  └───────────────────────────────────────────────────────────┘  │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Troubleshooting\n\n\u003e **[FAQ](https://greynewell.github.io/mcpbr/FAQ/)** - Quick answers to common questions\n\u003e\n\u003e **[Full troubleshooting guide](https://greynewell.github.io/mcpbr/troubleshooting/)** - Detailed solutions to common issues\n\n### Docker Issues\n\nEnsure Docker is running:\n```bash\ndocker info\n```\n\n### Pre-built Image Not Found\n\nIf the harness can't pull a pre-built image for a task, it will fall back to building from scratch. You can also manually pull images:\n```bash\ndocker pull ghcr.io/epoch-research/swe-bench.eval.x86_64.astropy__astropy-12907\n```\n\n### Slow on Apple Silicon\n\nOn ARM64 Macs, x86_64 Docker images run via emulation which is slower. This is normal. If you're experiencing issues, ensure you have Rosetta 2 installed:\n```bash\nsoftwareupdate --install-rosetta\n```\n\n### MCP Server Not Starting\n\nTest your MCP server independently:\n```bash\nnpx -y @modelcontextprotocol/server-filesystem /tmp/test\n```\n\n### API Key Issues\n\nEnsure your Anthropic API key is set:\n\n```bash\nexport ANTHROPIC_API_KEY=\"sk-ant-...\"\n```\n\n### Timeout Issues\n\nIncrease the timeout in your config:\n```yaml\ntimeout_seconds: 600\n```\n\n### Claude CLI Not Found\n\nEnsure the Claude Code CLI is installed and in your PATH:\n```bash\nwhich claude  # Should return the path to the CLI\n```\n\n## Development\n\n```bash\n# Install dev dependencies\npip install -e \".[dev]\"\n\n# Run unit tests\npytest -m \"not integration\"\n\n# Run integration tests (requires API keys and Docker)\npytest -m integration\n\n# Run all tests\npytest\n\n# Lint\nruff check src/\n```\n\n### Creating Releases\n\nWe use an automated workflow for releases. See the **[Release Guide](docs/RELEASE.md)** for full details.\n\n**Quick start for maintainers:**\n```bash\n# Patch release (bug fixes) - most common\ngh workflow run release.yml -f version_bump=patch\n\n# Minor release (new features)\ngh workflow run release.yml -f version_bump=minor\n\n# Major release (breaking changes)\ngh workflow run release.yml -f version_bump=major\n```\n\n**For AI agents:** See the **[AI Agent Guide](docs/AI_AGENT_GUIDE.md)** for a quick reference.\n\nThe workflow automatically:\n- Bumps version in `pyproject.toml`\n- Syncs version to all package files\n- Creates git tag and GitHub release\n- Triggers PyPI and npm publication\n\n## Roadmap\n\nWe're building the defacto standard for MCP server benchmarking! Our [v1.0 Roadmap](https://github.com/greynewell/mcpbr/projects/2) includes 200+ features across 11 strategic categories:\n\n🎯 **[Good First Issues](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)** | 🙋 **[Help Wanted](https://github.com/greynewell/mcpbr/labels/help%20wanted)** | 📋 **[View Roadmap](https://github.com/greynewell/mcpbr/projects/2)**\n\n[![good first issues](https://img.shields.io/github/issues/greynewell/mcpbr/good%20first%20issue?label=good%20first%20issues\u0026color=7057ff)](https://github.com/greynewell/mcpbr/labels/good%20first%20issue)\n[![help wanted](https://img.shields.io/github/issues/greynewell/mcpbr/help%20wanted?label=help%20wanted\u0026color=008672)](https://github.com/greynewell/mcpbr/labels/help%20wanted)\n[![roadmap progress](https://img.shields.io/github/issues-pr-closed/greynewell/mcpbr?label=roadmap%20progress)](https://github.com/greynewell/mcpbr/projects/2)\n\n### Roadmap Highlights\n\n**Phase 1: Foundation** (v0.3.0)\n- ✅ JUnit XML output format for CI/CD integration\n- CSV, YAML, XML output formats\n- Config validation and templates\n- Results persistence and recovery\n- Cost analysis in reports\n\n**Phase 2: Benchmarks** (v0.4.0)\n- ✅ 30+ benchmarks across 10 categories\n- ✅ Custom benchmark YAML support\n- ✅ Custom metrics, failure analysis, sampling strategies\n- ✅ Dataset versioning, latency metrics, GPU support, few-shot learning\n\n**Phase 3: Developer Experience** (v0.5.0)\n- Real-time dashboard\n- Interactive config wizard\n- Shell completion\n- Pre-flight checks\n\n**Phase 4: Platform Expansion** (v0.6.0)\n- NPM package\n- GitHub Action for CI/CD\n- Homebrew formula\n- Official Docker image\n\n**Phase 5: MCP Testing Suite** (v1.0.0)\n- Tool coverage analysis\n- Performance profiling\n- Error rate monitoring\n- Security scanning\n\n### Get Involved\n\nWe welcome contributions! Check out our **30+ good first issues** perfect for newcomers:\n\n- **Output Formats**: CSV/YAML/XML export\n- **Configuration**: Validation, templates, shell completion\n- **Platform**: Homebrew formula, Conda package\n- **Documentation**: Best practices, examples, guides\n\nSee the [contributing guide](https://greynewell.github.io/mcpbr/contributing/) to get started!\n\n## Best Practices\n\nNew to mcpbr or want to optimize your workflow? Check out the **[Best Practices Guide](https://greynewell.github.io/mcpbr/best-practices/)** for:\n\n- Benchmark selection guidelines\n- MCP server configuration tips\n- Performance optimization strategies\n- Cost management techniques\n- CI/CD integration patterns\n- Debugging workflows\n- Common pitfalls to avoid\n\n## Contributing\n\nPlease see [CONTRIBUTING.md](CONTRIBUTING.md) or the **[contributing guide](https://greynewell.github.io/mcpbr/contributing/)** for guidelines on how to contribute.\n\nAll contributors are expected to follow our [Community Guidelines](CODE_OF_CONDUCT.md).\n\n## License\n\nMIT - see [LICENSE](LICENSE) for details.\n\n\n---\n\nBuilt by [Grey Newell](https://greynewell.com)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreynewell%2Fmcpbr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgreynewell%2Fmcpbr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgreynewell%2Fmcpbr/lists"}