https://github.com/eval-sys/mcpmark
MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.
https://github.com/eval-sys/mcpmark
agentic benchmark eval-sys mcp mcp-servers mcpmark tool-use
Last synced: 4 months ago
JSON representation
MCP Servers are shaping the future of software. MCPMark is a comprehensive, stress-testing benchmark designed to evaluate model and agent capabilities in real-world MCP use.
- Host: GitHub
- URL: https://github.com/eval-sys/mcpmark
- Owner: eval-sys
- License: apache-2.0
- Created: 2025-07-01T07:39:50.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-09-05T06:24:12.000Z (4 months ago)
- Last Synced: 2025-09-05T08:28:04.736Z (4 months ago)
- Topics: agentic, benchmark, eval-sys, mcp, mcp-servers, mcpmark, tool-use
- Language: Python
- Homepage: https://mcpmark.ai
- Size: 16.7 MB
- Stars: 117
- Watchers: 1
- Forks: 4
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: docs/contributing/make-contribution.md
- License: LICENSE
Awesome Lists containing this project
- awesome-mcp - mcpmark - ⭐ 347 (📚 Projects (1974 total) / MCP Servers)
README
# MCPMark: Stress-Testing Comprehensive MCP Use
[](https://mcpmark.ai)
[](https://discord.gg/HrKkJAxDnA)
[](https://mcpmark.ai/docs)
[](https://huggingface.co/datasets/Jakumetsu/mcpmark-trajectory-log)
An evaluation suite for agentic models in real MCP tool environments (Notion / GitHub / Filesystem / Postgres / Playwright).
MCPMark provides a reproducible, extensible benchmark for researchers and engineers: one-command tasks, isolated sandboxes, auto-resume for failures, unified metrics, and aggregated reports.
[](https://mcpmark.ai)
## What you can do with MCPMark
- **Evaluate real tool usage** across multiple MCP services: `Notion`, `GitHub`, `Filesystem`, `Postgres`, `Playwright`.
- **Use ready-to-run tasks** covering practical workflows, each with strict automated verification.
- **Reliable and reproducible**: isolated environments that do not pollute your accounts/data; failed tasks auto-retry and resume.
- **Unified metrics and aggregation**: single/multi-run (pass@k, avg@k, etc.) with automated results aggregation.
- **Flexible deployment**: local or Docker; fully validated on macOS and Linux.
---
## Quickstart (5 minutes)
### 1) Clone the repository
```bash
git clone https://github.com/eval-sys/mcpmark.git
cd mcpmark
```
### 2) Set environment variables (create `.mcp_env` at repo root)
Only set what you need. Add service credentials when running tasks for that service.
```env
# Example: OpenAI
OPENAI_BASE_URL="https://api.openai.com/v1"
OPENAI_API_KEY="sk-..."
# Optional: Notion (only for Notion tasks)
SOURCE_NOTION_API_KEY="your-source-notion-api-key"
EVAL_NOTION_API_KEY="your-eval-notion-api-key"
EVAL_PARENT_PAGE_TITLE="MCPMark Eval Hub"
PLAYWRIGHT_BROWSER="chromium" # chromium | firefox
PLAYWRIGHT_HEADLESS="True"
# Optional: GitHub (only for GitHub tasks)
GITHUB_TOKENS="token1,token2" # token pooling for rate limits
GITHUB_EVAL_ORG="your-eval-org"
# Optional: Postgres (only for Postgres tasks)
POSTGRES_HOST="localhost"
POSTGRES_PORT="5432"
POSTGRES_USERNAME="postgres"
POSTGRES_PASSWORD="password"
```
See `docs/introduction.md` and the service guides below for more details.
### 3) Install and run a minimal example
Local (Recommended)
```bash
pip install -e .
# If you'll use browser-based tasks, install Playwright browsers first
playwright install
```
Docker
```bash
./build-docker.sh
```
Run a filesystem task (no external accounts required):
```bash
python -m pipeline \
--mcp filesystem \
--k 1 \ # run once to quick start
--models gpt-5 \ # or any model you configured
--tasks file_property/size_classification
```
Results are saved to `./results/{exp_name}/{model}__{mcp}/run-*/...` (e.g., `./results/test-run/gpt-5__filesystem/run-1/...`).
---
## Run your evaluations
### Single run (k=1)
```bash
# Run ALL tasks for a service
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL --k 1
# Run a task group
python -m pipeline --exp-name exp --mcp notion --tasks online_resume --models MODEL --k 1
# Run a specific task
python -m pipeline --exp-name exp --mcp notion --tasks online_resume/daily_itinerary_overview --models MODEL --k 1
# Evaluate multiple models
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL1,MODEL2,MODEL3 --k 1
```
### Multiple runs (k>1) for pass@k
```bash
# Run k=4 to compute stability metrics (requires --exp-name to aggregate final results)
python -m pipeline --exp-name exp --mcp notion --tasks all --models MODEL
# Aggregate results (pass@1 / pass@k / pass^k / avg@k)
python -m src.aggregators.aggregate_results --exp-name exp
```
### Run with Docker
```bash
# Run all tasks for a service
./run-task.sh --mcp notion --models MODEL --exp-name exp --tasks all
# Cross-service benchmark
./run-benchmark.sh --models MODEL --exp-name exp --docker
```
Please visit `docs/introduction.md` for choices of *MODEL*.
Tip: MCPMark supports **auto-resume**. When re-running, only unfinished tasks will execute. Failures matching our retryable patterns (see [RETRYABLE_PATTERNS](src/errors.py)) are retried automatically. Models may emit different error strings—if you encounter a new resumable error, please open a PR or issue.
---
## Service setup and authentication
| Service | Setup summary | Docs |
|-------------|-----------------------------------------------------------------------------------------------------------------|---------------------------------------|
| Notion | Environment isolation (Source Hub / Eval Hub), integration creation and grants, browser login verification. | [Guide](docs/mcp/notion.md) |
| GitHub | Multi-account token pooling recommended; import pre-exported repo state if needed. | [Guide](docs/mcp/github.md) |
| Postgres | Start via Docker and import sample databases. | [Setup](docs/mcp/postgres.md) |
| Playwright | Install browsers before first run; defaults to `chromium`. | [Setup](docs/mcp/playwright.md) |
| Filesystem | Zero-configuration, run directly. | [Config](docs/mcp/filesystem.md) |
You can also follow [Quickstart](docs/quickstart.md) for the shortest end-to-end path.
---
## Results and metrics
- Results are organized under `./results/{exp_name}/{model}__{mcp}/run-*/` (JSON + CSV per task).
- Generate a summary with:
```bash
# Basic usage
python -m src.aggregators.aggregate_results --exp-name exp
# For k-run experiments with single-run models
python -m src.aggregators.aggregate_results --exp-name exp --k 4 --single-run-models claude-opus-4-1
```
- Only models with complete results across all tasks and runs are included in the final summary.
- Includes multi-run metrics (pass@k, pass^k) for stability comparisons when k > 1.
---
## Model and Tasks
- **Model support**: MCPMark calls models via LiteLLM — see the LiteLLM docs: [`LiteLLM Doc`](https://docs.litellm.ai/docs/). For Anthropic (Claude) extended thinking mode (enabled via `--reasoning-effort`), we use Anthropic’s native API.
- See `docs/introduction.md` for details and configuration of supported models in MCPMark.
- To add a new model, edit `src/model_config.py`. Before adding, check LiteLLM supported models/providers. See [`LiteLLM Doc`](https://docs.litellm.ai/docs/).
- Task design principles in `docs/datasets/task.md`. Each task ships with an automated `verify.py` for objective, reproducible evaluation, see `docs/task.md` for details.
---
## Contributing
Contributions are welcome:
1. Add a new task under `tasks///` with `meta.json`, `description.md` and `verify.py`.
2. Ensure local checks pass and open a PR.
3. See `docs/contributing/make-contribution.md`.
---
## Citation
If you find our works useful for your research, please consider citing:
```bibtex
@misc{mcpmark_2025,
title = {MCPMark: Stress-Testing Comprehensive MCP Use},
author = {The MCPMark Team},
howpublished = {\url{https://github.com/eval-sys/mcpmark}},
year = {2025}
}
```
## License
This project is licensed under the Apache License 2.0 — see `LICENSE`.