{"id":28702275,"url":"https://github.com/modelscope/mcpbench","last_synced_at":"2025-06-14T12:32:20.606Z","repository":{"id":287806842,"uuid":"965403736","full_name":"modelscope/MCPBench","owner":"modelscope","description":"The evaluation benchmark on MCP servers","archived":false,"fork":false,"pushed_at":"2025-04-30T03:52:42.000Z","size":2499,"stargazers_count":93,"open_issues_count":1,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-30T04:31:26.616Z","etag":null,"topics":["benchmark","database","mcp","mcp-server","websearch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/modelscope.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-04-13T04:51:27.000Z","updated_at":"2025-04-30T03:52:48.000Z","dependencies_parsed_at":"2025-04-30T04:35:49.021Z","dependency_job_id":null,"html_url":"https://github.com/modelscope/MCPBench","commit_stats":null,"previous_names":["modelscope/mcpbench"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/modelscope/MCPBench","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FMCPBench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FMCPBench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FMCPBench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FMCPBench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/modelscope","download_url":"https://codeload.github.com/modelscope/MCPBench/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2FMCPBench/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259816213,"owners_count":22915837,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","database","mcp","mcp-server","websearch"],"created_at":"2025-06-14T12:31:01.519Z","updated_at":"2025-06-14T12:32:20.588Z","avatar_url":"https://github.com/modelscope.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ch1 align=\"center\"\u003e\n\t🦊 MCPBench: A Benchmark for Evaluating MCP Servers\n\u003c/h1\u003e\n\n\n\n\u003cdiv align=\"center\"\u003e\n\n[![Documentation][docs-image]][docs-url]\n[![Package License][package-license-image]][package-license-url]\n\n\u003c/div\u003e\n\n\u003cdiv align=\"center\"\u003e\n\u003ch4 align=\"center\"\u003e\n\n[中文](https://github.com/modelscope/MCPBench/blob/main/README_zh.md) |\n[English](https://github.com/modelscope/MCPBench/blob/main/README.md)\n\n\u003c/h4\u003e\n\u003c/div\u003e\n\nMCPBench is an evaluation framework for MCP Servers. It supports the evaluation of three types of servers: Web Search, Database Query and GAIA, and is compatible with both local and remote MCP Servers. The framework primarily evaluates different MCP Servers (such as Brave Search, DuckDuckGo, etc.) in terms of task completion accuracy, latency, and token consumption under the same LLM and Agent configurations. Here is the [evaluation report](https://arxiv.org/abs/2504.11094).\n\n\u003cimg src=\"assets/figure1.png\" alt=\"MCPBench Overview\" width=\"600\"/\u003e\n\n\u003e The implementation refers to [LangProBe: a Language Programs Benchmark](https://arxiv.org/abs/2502.20315).\\\n\u003e Big thanks to Qingxu Fu for the initial implementation!\n\n\u003chr\u003e\n\n\n\n# 📋 Table of Contents\n\n- [🔥 News](#news)\n- [🛠️ Installation](#installation)\n- [🚀 Quick Start](#quick-start)\n  - [Launch MCP Server](#launch-mcp-server)\n  - [Launch Evaluation](#launch-evaluation)\n- [🧂 Datasets and Experiments](#datasets-and-experiments)\n- [🚰 Cite](#cite)\n\n# 🔥 News\n+ `Apr. 29, 2025` 🌟 Update the code for evaluating the MCP Server Package within GAIA.\n+ `Apr. 14, 2025` 🌟 We are proud to announce that MCPBench is now open-sourced.\n\n# 🛠️ Installation\nThe framework requires Python version \u003e= 3.11, nodejs and jq.\n\n```bash\nconda create -n mcpbench python=3.11 -y\nconda activate mcpbench\npip install -r requirements.txt\n```\n# 🚀 Quick Start\nPlease first determine the type of MCP server you want to use:\n- If it is a remote host (accessed via **SSE**, such as [ModelScope](https://modelscope.cn/mcp), [Smithery](https://smithery.ai), or localhost), you can directly conduct the [evaluation](#launch-evaluation).\n- If it is started locally (accessed via npx using **STDIO**), you need to launch it.\n\n## Launch MCP Server (optional for stdio)\nFirst, you need to write the following configuration:\n```json\n{\n    \"mcp_pool\": [\n        {\n            \"name\": \"firecrawl\",\n            \"run_config\": [\n                {\n                    \"command\": \"npx -y firecrawl-mcp\",\n                    \"args\": \"FIRECRAWL_API_KEY=xxx\",\n                    \"port\": 8005\n                }\n            ]\n        }  \n    ]\n}\n```\nSave this config file in the `configs` folder and launch it using:\n\n```bash\nsh launch_mcps_as_sse.sh YOUR_CONFIG_FILE\n```\n\nFor example, save the above configuration in the `configs/firecrawl.json` file and launch it using:\n\n```bash\nsh launch_mcps_as_sse.sh firecrawl.json\n```\n\n## Launch Evaluation\nTo evaluate the MCP Server's performance, you need to set up the necessary MCP Server information. the code will automatically detect the tools and parameters in the Server, so you don't need to configure them manually, like:\n```json\n{\n    \"mcp_pool\": [\n        {\n            \"name\": \"Remote MCP example\",\n            \"url\": \"url from https://modelscope.cn/mcp or https://smithery.ai\"\n        },\n        {\n            \"name\": \"firecrawl (Local run example)\",\n            \"run_config\": [\n                {\n                    \"command\": \"npx -y firecrawl-mcp\",\n                    \"args\": \"FIRECRAWL_API_KEY=xxx\",\n                    \"port\": 8005\n                }\n            ]\n        }  \n    ]\n}\n```\n\nTo evaluate the MCP Server's performance on WebSearch tasks:\n```bash\nsh evaluation_websearch.sh YOUR_CONFIG_FILE\n```\n\nTo evaluate the MCP Server's performance on Database Query tasks:\n```bash\nsh evaluation_db.sh YOUR_CONFIG_FILE\n```\n\nTo evaluate the MCP Server's performance on GAIA tasks:\n```bash\nsh evaluation_gaia.sh YOUR_CONFIG_FILE\n```\n\nFor example, save the above configuration in the `configs/firecrawl.json` file and launch it using:\n\n```bash\nsh evaluation_websearch.sh firecrawl.json\n```\n\n# Datasets and Experimental Results\nOur framework provides two datasets for evaluation. For the WebSearch task, the dataset is located at `MCPBench/langProBe/WebSearch/data/websearch_600.jsonl`, containing 200 QA pairs each from [Frames](https://arxiv.org/abs/2409.12941), news, and technology domains. Our framework for automatically constructing evaluation datasets will be open-sourced later.\n\nFor the Database Query task, the dataset is located at `MCPBench/langProBe/DB/data/car_bi.jsonl`. You can add your own dataset in the following format:\n\n```json\n{\n  \"unique_id\": \"\",\n  \"Prompt\": \"\",\n  \"Answer\": \"\"\n}\n```\n\nWe have evaluated mainstream MCP Servers on both tasks. For detailed experimental results, please refer to [Documentation](https://arxiv.org/abs/2504.11094)\n\n# 🚰 Cite\nIf you find this work useful, please consider citing our project or giving us a 🌟:\n\n```bibtex\n@misc{mcpbench,\n  title={MCPBench: A Benchmark for Evaluating MCP Servers},\n  author={Zhiling Luo, Xiaorong Shi, Xuanrui Lin, Jinyang Gao},\n  howpublished = {\\url{https://github.com/modelscope/MCPBench}},\n  year={2025}\n}\n```\n\nAlternatively, you may reference our report.\n```bibtex\n@article{mcpbench_report,\n      title={Evaluation Report on MCP Servers}, \n      author={Zhiling Luo, Xiaorong Shi, Xuanrui Lin, Jinyang Gao},\n      year={2025},\n      journal={arXiv preprint arXiv:2504.11094},\n      url={https://arxiv.org/abs/2504.11094},\n      primaryClass={cs.AI}\n}\n```\n\n[docs-image]: https://img.shields.io/badge/Documentation-EB3ECC\n[docs-url]: https://arxiv.org/abs/2504.11094\n[package-license-image]: https://img.shields.io/badge/License-Apache_2.0-blue.svg\n[package-license-url]: https://github.com/modelscope/MCPBench/blob/main/LICENSE\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2Fmcpbench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmodelscope%2Fmcpbench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2Fmcpbench/lists"}