https://github.com/eval-sys/mcpmark

MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.
https://github.com/eval-sys/mcpmark

agentic benchmark eval-sys mcp mcp-servers mcpmark tool-use

Last synced: 4 months ago
JSON representation

MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.

Host: GitHub
URL: https://github.com/eval-sys/mcpmark
Owner: eval-sys
License: apache-2.0
Created: 2025-07-01T07:39:50.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-09-05T06:24:12.000Z (4 months ago)
Last Synced: 2025-09-05T08:28:04.736Z (4 months ago)
Topics: agentic, benchmark, eval-sys, mcp, mcp-servers, mcpmark, tool-use
Language: Python
Homepage: https://mcpmark.ai
Size: 16.7 MB
Stars: 117
Watchers: 1
Forks: 4
Open Issues: 2
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing/make-contribution.md
- License: LICENSE

Awesome Lists containing this project

awesome-mcp - mcpmark - ⭐ 347 (📚 Projects (1974 total) / MCP Servers)

README

          


# MCPMark: Stress-Testing Comprehensive MCP Use

[![Website](https://img.shields.io/badge/Website-mcpmark.ai-4285F4?style=for-the-badge&logo=google-chrome&logoColor=white)](https://mcpmark.ai)

[![Discord](https://img.shields.io/badge/Join_our_discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https://discord.gg/HrKkJAxDnA)

[![Docs](https://img.shields.io/badge/Docs-000000?style=for-the-badge&logo=mdbook&color=105864)](https://mcpmark.ai/docs)

[![Hugging Face](https://img.shields.io/badge/Trajectory_Logs-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/datasets/Jakumetsu/mcpmark-trajectory-log)



An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).

MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.

[![MCPMark](https://github.com/user-attachments/assets/dfc06a41-e387-45e3-bc98-db7097ffa3dc)](https://mcpmark.ai)

## What you can do with MCPMark

- **Evaluate real tool usage** across multiple MCP services: `Notion`, `GitHub`, `Filesystem`, `Postgres`, `Playwright`.

- **Use ready-to-run tasks** covering practical workflows, each with strict automated verification.

- **Reliable and reproducible**: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.

- **Unified metrics and aggregation**: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.

- **Flexible deployment**: local or Docker; fully validated on macOS and Linux.

---

## Quickstart (5 minutes)

### 1) Clone the repository

```bash

git clone https://github.com/eval-sys/mcpmark.git

cd mcpmark

```

### 2) Set environment variables (create `.mcp_env` at repo root)

Only set what you need. Add service credentials when running tasks for that service.

```env

# Example: OpenAI

OPENAI_BASE_URL="https://api.openai.com/v1"

OPENAI_API_KEY="sk-..."

# Optional: Notion (only for Notion tasks)

SOURCE_NOTION_API_KEY="your-source-notion-api-key"

EVAL_NOTION_API_KEY="your-eval-notion-api-key"

EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"

PLAYWRIGHT_BROWSER="chromium"   # chromium | firefox

PLAYWRIGHT_HEADLESS="True"

# Optional: GitHub (only for GitHub tasks)

GITHUB_TOKENS="token1,token2"   # token pooling for rate limits

GITHUB_EVAL_ORG="your-eval-org"

# Optional: Postgres (only for Postgres tasks)

POSTGRES_HOST="localhost"

POSTGRES_PORT="5432"

POSTGRES_USERNAME="postgres"

POSTGRES_PASSWORD="password"

```

See `docs/introduction.md` and the service guides below for more details.

### 3) Install and run a minimal example

Local (Recommended)

```bash

pip install -e .

# If you'll use browser-based tasks, install Playwright browsers first

playwright install

```

Docker

```bash

./build-docker.sh

```

Run a filesystem task (no external accounts required):

```bash

python -m pipeline \

  --mcp filesystem \

  --k 1 \ # run once to quick start

  --models gpt-5  \ # or any model you configured

  --tasks file_property/size_classification

```

Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).

---

## Run your evaluations

### Single run (k=1)

```bash

# Run ALL tasks for a service

python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1

# Run a task group

python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1

# Run a specific task

python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1

# Evaluate multiple models

python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1

```

### Multiple runs (k>1) for pass@k

```bash

# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)

python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL

# Aggregate results (pass@1 / pass@k / pass^k / avg@k)

python -m src.aggregators.aggregate_results --exp-name exp

```

### Run with Docker

```bash

# Run all tasks for a service

./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all

# Cross-service benchmark

./run-benchmark.sh --models MODEL --exp-name exp --docker

```

Please visit `docs/introduction.md` for choices of *MODEL*.

Tip: MCPMark supports **auto-resume**. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see [RETRYABLE_PATTERNS](src/errors.py)) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.

---

## Service setup and authentication

| Service     | Setup summary                                                                                                  | Docs                                  |

|-------------|-----------------------------------------------------------------------------------------------------------------|---------------------------------------|

| Notion      | Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification.     | [Guide](docs/mcp/notion.md)           |

| GitHub      | Multi-account token pooling recommended; import pre-exported repo state if needed.                              | [Guide](docs/mcp/github.md)           |

| Postgres    | Start via Docker and import sample databases.                                                                   | [Setup](docs/mcp/postgres.md)         |

| Playwright  | Install browsers before first run; defaults to `chromium`.                                                      | [Setup](docs/mcp/playwright.md)       |

| Filesystem  | Zero-configuration, run directly.                                                                               | [Config](docs/mcp/filesystem.md)      |

You can also follow [Quickstart](docs/quickstart.md) for the shortest end-to-end path.

---

## Results and metrics

- Results are organized under `./results/{exp_name}/{model}__{mcp}/run-*/` (JSON + CSV per task).

- Generate a summary with:

```bash

# Basic usage

python -m src.aggregators.aggregate_results --exp-name exp

# For k-run experiments with single-run models

python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1

```

- Only models with complete results across all tasks and runs are included in the final summary.

- Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.

---

## Model and Tasks

- **Model support**: MCPMark calls models via LiteLLM — see the LiteLLM docs: [`LiteLLM Doc`](https://docs.litellm.ai/docs/). For Anthropic (Claude) extended thinking mode (enabled via `--reasoning-effort`), we use Anthropic’s native API.

- See `docs/introduction.md` for details and configuration of supported models in MCPMark.

- To add a new model, edit `src/model_config.py`. Before adding, check LiteLLM supported models/providers. See [`LiteLLM Doc`](https://docs.litellm.ai/docs/).

- Task design principles in `docs/datasets/task.md`. Each task ships with an automated `verify.py` for objective, reproducible evaluation, see `docs/task.md` for details.

---

## Contributing

Contributions are welcome:

1. Add a new task under `tasks///` with `meta.json`, `description.md` and `verify.py`.

2. Ensure local checks pass and open a PR.

3. See `docs/contributing/make-contribution.md`.

---

## Citation

If you find our works useful for your research, please consider citing:

```bibtex

@misc{mcpmark_2025,

  title        = {MCPMark: Stress-Testing Comprehensive MCP Use},

  author       = {The MCPMark Team},

  howpublished = {\url{https://github.com/eval-sys/mcpmark}},

  year         = {2025}

}

```

## License

This project is licensed under the Apache License 2.0 — see `LICENSE`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/eval-sys/mcpmark

Awesome Lists containing this project

README