{"id":14081239,"url":"https://github.com/modelscope/evalscope","last_synced_at":"2026-01-05T09:15:23.403Z","repository":{"id":212316127,"uuid":"728528910","full_name":"modelscope/evalscope","owner":"modelscope","description":"A streamlined and customizable framework for efficient large model evaluation and performance benchmarking","archived":false,"fork":false,"pushed_at":"2025-05-13T12:21:29.000Z","size":61008,"stargazers_count":941,"open_issues_count":55,"forks_count":103,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-05-13T13:27:41.273Z","etag":null,"topics":["evaluation","llm","performance","rag","vlm"],"latest_commit_sha":null,"homepage":"https://evalscope.readthedocs.io/en/latest/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/modelscope.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-07T06:10:49.000Z","updated_at":"2025-05-13T10:17:26.000Z","dependencies_parsed_at":"2024-02-01T07:30:41.449Z","dependency_job_id":"1c0cb949-861a-485b-9f29-f22d0f39a7e9","html_url":"https://github.com/modelscope/evalscope","commit_stats":{"total_commits":176,"total_committers":10,"mean_commits":17.6,"dds":0.3579545454545454,"last_synced_commit":"365aa5966bb4f58762f5647d97094200fe4940b3"},"previous_names":["modelscope/llmuses","modelscope/eval-scope"],"tags_count":27,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2Fevalscope","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2Fevalscope/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2Fevalscope/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/modelscope%2Fevalscope/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/modelscope","download_url":"https://codeload.github.com/modelscope/evalscope/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254101558,"owners_count":22014908,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["evaluation","llm","performance","rag","vlm"],"created_at":"2024-08-13T13:00:35.094Z","updated_at":"2026-01-05T09:15:23.396Z","avatar_url":"https://github.com/modelscope.png","language":"Python","funding_links":[],"categories":["A01_文本生成_文本对话","Python","Evaluation and Monitoring","Tools","5. 数据集","评估 Evaluation","9. Evaluation, Benchmarks \u0026 Datasets"],"sub_categories":["大语言对话模型及数据","LLM Evaluations and Benchmarks","5.1 评测基准"],"readme":"\u003cp align=\"center\"\u003e\n    \u003cbr\u003e\n    \u003cimg src=\"docs/en/_static/images/evalscope_logo.png\"/\u003e\n    \u003cbr\u003e\n\u003cp\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"README_zh.md\"\u003e中文\u003c/a\u003e \u0026nbsp ｜ \u0026nbsp English \u0026nbsp\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"https://img.shields.io/badge/python-%E2%89%A53.10-5be.svg\"\u003e\n\u003ca href=\"https://badge.fury.io/py/evalscope\"\u003e\u003cimg src=\"https://badge.fury.io/py/evalscope.svg\" alt=\"PyPI version\" height=\"18\"\u003e\u003c/a\u003e\n\u003ca href=\"https://pypi.org/project/evalscope\"\u003e\u003cimg alt=\"PyPI - Downloads\" src=\"https://static.pepy.tech/badge/evalscope\"\u003e\u003c/a\u003e\n\u003ca href=\"https://github.com/modelscope/evalscope/pulls\"\u003e\u003cimg src=\"https://img.shields.io/badge/PR-welcome-55EB99.svg\"\u003e\u003c/a\u003e\n\u003ca href='https://evalscope.readthedocs.io/en/latest/?badge=latest'\u003e\u003cimg src='https://readthedocs.org/projects/evalscope/badge/?version=latest' alt='Documentation Status' /\u003e\u003c/a\u003e\n\u003cp\u003e\n\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://evalscope.readthedocs.io/zh-cn/latest/\"\u003e 📖  Chinese Documentation\u003c/a\u003e \u0026nbsp ｜ \u0026nbsp \u003ca href=\"https://evalscope.readthedocs.io/en/latest/\"\u003e 📖  English Documentation\u003c/a\u003e\n\u003cp\u003e\n\n\n\u003e ⭐ If you like this project, please click the \"Star\" button in the upper right corner to support us. Your support is our motivation to move forward!\n\n## 📝 Introduction\n\nEvalScope is a powerful and easily extensible model evaluation framework created by the [ModelScope Community](https://modelscope.cn/), aiming to provide a one-stop evaluation solution for large model developers.\n\nWhether you want to evaluate the general capabilities of models, conduct multi-model performance comparisons, or need to stress test models, EvalScope can meet your needs.\n\n## ✨ Key Features\n\n- **📚 Comprehensive Evaluation Benchmarks**: Built-in multiple industry-recognized evaluation benchmarks including MMLU, C-Eval, GSM8K, and more.\n- **🧩 Multi-modal and Multi-domain Support**: Supports evaluation of various model types including Large Language Models (LLM), Vision Language Models (VLM), Embedding, Reranker, AIGC, and more.\n- **🚀 Multi-backend Integration**: Seamlessly integrates multiple evaluation backends including OpenCompass, VLMEvalKit, RAGEval to meet different evaluation needs.\n- **⚡ Inference Performance Testing**: Provides powerful model service stress testing tools, supporting multiple performance metrics such as TTFT, TPOT.\n- **📊 Interactive Reports**: Provides WebUI visualization interface, supporting multi-dimensional model comparison, report overview and detailed inspection.\n- **⚔️ Arena Mode**: Supports multi-model battles (Pairwise Battle), intuitively ranking and evaluating models.\n- **🔧 Highly Extensible**: Developers can easily add custom datasets, models and evaluation metrics.\n\n\u003cdetails\u003e\u003csummary\u003e🏛️ Overall Architecture\u003c/summary\u003e\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"https://sail-moe.oss-cn-hangzhou.aliyuncs.com/yunlin/images/evalscope/doc/EvalScope%E6%9E%B6%E6%9E%84%E5%9B%BE.png\" style=\"width: 70%;\"\u003e\n    \u003cbr\u003eEvalScope Overall Architecture.\n\u003c/p\u003e\n\n1.  **Input Layer**\n    - **Model Sources**: API models (OpenAI API), Local models (ModelScope)\n    - **Datasets**: Standard evaluation benchmarks (MMLU/GSM8k etc.), Custom data (MCQ/QA)\n\n2.  **Core Functions**\n    - **Multi-backend Evaluation**: Native backend, OpenCompass, MTEB, VLMEvalKit, RAGAS\n    - **Performance Monitoring**: Supports multiple model service APIs and data formats, tracking TTFT/TPOP and other metrics\n    - **Tool Extensions**: Integrates Tool-Bench, Needle-in-a-Haystack, etc.\n\n3.  **Output Layer**\n    - **Structured Reports**: Supports JSON, Table, Logs\n    - **Visualization Platform**: Supports Gradio, Wandb, SwanLab\n\n\u003c/details\u003e\n\n## 🎉 What's New\n\n\u003e [!IMPORTANT]\n\u003e **Version 1.0 Refactoring**\n\u003e\n\u003e Version 1.0 introduces a major overhaul of the evaluation framework, establishing a new, more modular and extensible API layer under `evalscope/api`. Key improvements include standardized data models for benchmarks, samples, and results; a registry-based design for components such as benchmarks and metrics; and a rewritten core evaluator that orchestrates the new architecture. Existing benchmark adapters have been migrated to this API, resulting in cleaner, more consistent, and easier-to-maintain implementations.\n\n- 🔥 **[2025.12.26]** Added support for Terminal-Bench-2.0, which evaluates AI Agent performance on 89 real-world multi-step terminal tasks. Refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/third_party/terminal_bench.html).\n- 🔥 **[2025.12.18]** Added support for SLA auto-tuning model API services, automatically testing the maximum concurrency of model services under specific latency, TTFT, and throughput conditions. Refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/sla_auto_tune.html).\n- 🔥 **[2025.12.16]** Added support for audio evaluation benchmarks such as Fleurs, LibriSpeech; added support for multilingual code evaluation benchmarks such as MultiplE, MBPP.\n- 🔥 **[2025.12.02]** Added support for custom multimodal VQA evaluation; refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/vlm.html). Added support for visualizing model service stress testing in ClearML; refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#clearml).\n- 🔥 **[2025.11.26]** Added support for OpenAI-MRCR, GSM8K-V, MGSM, MicroVQA, IFBench, SciCode benchmarks.\n- 🔥 **[2025.11.18]** Added support for custom Function-Call (tool invocation) datasets to test whether models can timely and correctly call tools. Refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#function-calling-format-fc).\n- 🔥 **[2025.11.14]** Added support for SWE-bench_Verified, SWE-bench_Lite, SWE-bench_Verified_mini code evaluation benchmarks. Refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/third_party/swe_bench.html).\n- 🔥 **[2025.11.12]** Added `pass@k`, `vote@k`, `pass^k` and other metric aggregation methods; added support for multimodal evaluation benchmarks such as A_OKVQA, CMMU, ScienceQA, V*Bench.\n- 🔥 **[2025.11.07]** Added support for τ²-bench, an extended and enhanced version of τ-bench that includes a series of code fixes and adds telecom domain troubleshooting scenarios. Refer to the [usage documentation](https://evalscope.readthedocs.io/en/latest/third_party/tau2_bench.html).\n- 🔥 **[2025.10.30]** Added support for BFCL-v4, enabling evaluation of agent capabilities including web search and long-term memory. See the [usage documentation](https://evalscope.readthedocs.io/en/latest/third_party/bfcl_v4.html).\n- 🔥 **[2025.10.27]** Added support for LogiQA, HaluEval, MathQA, MRI-QA, PIQA, QASC, CommonsenseQA and other evaluation benchmarks. Thanks to @[penguinwang96825](https://github.com/penguinwang96825) for the code implementation.\n- 🔥 **[2025.10.26]** Added support for Conll-2003, CrossNER, Copious, GeniaNER, HarveyNER, MIT-Movie-Trivia, MIT-Restaurant, OntoNotes5, WNUT2017 and other Named Entity Recognition evaluation benchmarks. Thanks to @[penguinwang96825](https://github.com/penguinwang96825) for the code implementation.\n- 🔥 **[2025.10.21]** Optimized sandbox environment usage in code evaluation, supporting both local and remote operation modes. For details, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/sandbox.html).\n- 🔥 **[2025.10.20]** Added support for evaluation benchmarks including PolyMath, SimpleVQA, MathVerse, MathVision, AA-LCR; optimized evalscope perf performance to align with vLLM Bench. For details, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/vs_vllm_bench.html).\n- 🔥 **[2025.10.14]** Added support for OCRBench, OCRBench-v2, DocVQA, InfoVQA, ChartQA, and BLINK multimodal image-text evaluation benchmarks.\n- 🔥 **[2025.09.22]** Code evaluation benchmarks (HumanEval, LiveCodeBench) now support running in a sandbox environment. To use this feature, please install [ms-enclave](https://github.com/modelscope/ms-enclave) first.\n- 🔥 **[2025.09.19]** Added support for multimodal image-text evaluation benchmarks including RealWorldQA, AI2D, MMStar, MMBench, and OmniBench, as well as pure text evaluation benchmarks such as Multi-IF, HealthBench, and AMC.\n- 🔥 **[2025.09.05]** Added support for vision-language multimodal model evaluation tasks, such as MathVista and MMMU. For more supported datasets, please [refer to the documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/vlm.html).\n- 🔥 **[2025.09.04]** Added support for image editing task evaluation, including the [GEdit-Bench](https://modelscope.cn/datasets/stepfun-ai/GEdit-Bench) benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/image_edit.html).\n- 🔥 **[2025.08.22]** Version 1.0 Refactoring. Break changes, please [refer to](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#switching-to-version-v1-0).\n\u003cdetails\u003e\u003csummary\u003eMore\u003c/summary\u003e\n\n- 🔥 **[2025.07.18]** The model stress testing now supports randomly generating image-text data for multimodal model evaluation. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#id4).\n- 🔥 **[2025.07.16]** Support for [τ-bench](https://github.com/sierra-research/tau-bench) has been added, enabling the evaluation of AI Agent performance and reliability in real-world scenarios involving dynamic user and tool interactions. For usage instructions, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#bench).\n- 🔥 **[2025.07.14]** Support for \"Humanity's Last Exam\" ([Humanity's-Last-Exam](https://modelscope.cn/datasets/cais/hle)), a highly challenging evaluation benchmark. For usage instructions, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/llm.html#humanity-s-last-exam).\n- 🔥 **[2025.07.03]** Refactored Arena Mode: now supports custom model battles, outputs a model leaderboard, and provides battle result visualization. See [reference](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html) for details.\n- 🔥 **[2025.06.28]** Optimized custom dataset evaluation: now supports evaluation without reference answers. Enhanced LLM judge usage, with built-in modes for \"scoring directly without reference answers\" and \"checking answer consistency with reference answers\". See [reference](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/llm.html#qa) for details.\n- 🔥 **[2025.06.19]** Added support for the [BFCL-v3](https://modelscope.cn/datasets/AI-ModelScope/bfcl_v3) benchmark, designed to evaluate model function-calling capabilities across various scenarios. For more information, refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/bfcl_v3.html).\n- 🔥 **[2025.06.02]** Added support for the Needle-in-a-Haystack test. Simply specify `needle_haystack` to conduct the test, and a corresponding heatmap will be generated in the `outputs/reports` folder, providing a visual representation of the model's performance. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/needle_haystack.html) for more details.\n- 🔥 **[2025.05.29]** Added support for two long document evaluation benchmarks: [DocMath](https://modelscope.cn/datasets/yale-nlp/DocMath-Eval/summary) and [FRAMES](https://modelscope.cn/datasets/iic/frames/summary). For usage guidelines, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/index.html).\n- 🔥 **[2025.05.16]** Model service performance stress testing now supports setting various levels of concurrency and outputs a performance test report. [Reference example](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/quick_start.html#id3).\n- 🔥 **[2025.05.13]** Added support for the [ToolBench-Static](https://modelscope.cn/datasets/AI-ModelScope/ToolBench-Static) dataset to evaluate model's tool-calling capabilities. Refer to the [documentation](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html) for usage instructions. Also added support for the [DROP](https://modelscope.cn/datasets/AI-ModelScope/DROP/dataPeview) and [Winogrande](https://modelscope.cn/datasets/AI-ModelScope/winogrande_val) benchmarks to assess the reasoning capabilities of models.\n- 🔥 **[2025.04.29]** Added Qwen3 Evaluation Best Practices, [welcome to read 📖](https://evalscope.readthedocs.io/en/latest/best_practice/qwen3.html)\n- 🔥 **[2025.04.27]** Support for text-to-image evaluation: Supports 8 metrics including MPS, HPSv2.1Score, etc., and evaluation benchmarks such as EvalMuse, GenAI-Bench. Refer to the [user documentation](https://evalscope.readthedocs.io/en/latest/user_guides/aigc/t2i.html) for more details.\n- 🔥 **[2025.04.10]** Model service stress testing tool now supports the `/v1/completions` endpoint (the default endpoint for vLLM benchmarking)\n- 🔥 **[2025.04.08]** Support for evaluating embedding model services compatible with the OpenAI API has been added. For more details, check the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html#configure-evaluation-parameters).\n- 🔥 **[2025.03.27]** Added support for [AlpacaEval](https://www.modelscope.cn/datasets/AI-ModelScope/alpaca_eval/dataPeview) and [ArenaHard](https://modelscope.cn/datasets/AI-ModelScope/arena-hard-auto-v0.1/summary) evaluation benchmarks. For usage notes, please refer to the [documentation](https://evalscope.readthedocs.io/en/latest/get_started/supported_dataset/index.html)\n- 🔥 **[2025.03.20]** The model inference service stress testing now supports generating prompts of specified length using random values. Refer to the [user guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/examples.html#using-the-random-dataset) for more details.\n- 🔥 **[2025.03.13]** Added support for the [LiveCodeBench](https://www.modelscope.cn/datasets/AI-ModelScope/code_generation_lite/summary) code evaluation benchmark, which can be used by specifying `live_code_bench`. Supports evaluating QwQ-32B on LiveCodeBench, refer to the [best practices](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html).\n- 🔥 **[2025.03.11]** Added support for the [SimpleQA](https://modelscope.cn/datasets/AI-ModelScope/SimpleQA/summary) and [Chinese SimpleQA](https://modelscope.cn/datasets/AI-ModelScope/Chinese-SimpleQA/summary) evaluation benchmarks. These are used to assess the factual accuracy of models, and you can specify `simple_qa` and `chinese_simpleqa` for use. Support for specifying a judge model is also available. For more details, refer to the [relevant parameter documentation](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html).\n- 🔥 **[2025.03.07]** Added support for the [QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B/summary) model, evaluate the model's reasoning ability and reasoning efficiency, refer to [📖 Best Practices for QwQ-32B Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/eval_qwq.html) for more details.\n- 🔥 **[2025.03.04]** Added support for the [SuperGPQA](https://modelscope.cn/datasets/m-a-p/SuperGPQA/summary) dataset, which covers 13 categories, 72 first-level disciplines, and 285 second-level disciplines, totaling 26,529 questions. You can use it by specifying `super_gpqa`.\n- 🔥 **[2025.03.03]** Added support for evaluating the IQ and EQ of models. Refer to [📖 Best Practices for IQ and EQ Evaluation](https://evalscope.readthedocs.io/en/latest/best_practice/iquiz.html) to find out how smart your AI is!\n- 🔥 **[2025.02.27]** Added support for evaluating the reasoning efficiency of models. Refer to [📖 Best Practices for Evaluating Thinking Efficiency](https://evalscope.readthedocs.io/en/latest/best_practice/think_eval.html). This implementation is inspired by the works [Overthinking](https://doi.org/10.48550/arXiv.2412.21187) and [Underthinking](https://doi.org/10.48550/arXiv.2501.18585).\n- 🔥 **[2025.02.25]** Added support for two model inference-related evaluation benchmarks: [MuSR](https://modelscope.cn/datasets/AI-ModelScope/MuSR) and [ProcessBench](https://www.modelscope.cn/datasets/Qwen/ProcessBench/summary). To use them, simply specify `musr` and `process_bench` respectively in the datasets parameter.\n- 🔥 **[2025.02.18]** Supports the AIME25 dataset, which contains 15 questions (Grok3 scored 93 on this dataset).\n- 🔥 **[2025.02.13]** Added support for evaluating DeepSeek distilled models, including AIME24, MATH-500, and GPQA-Diamond datasets，refer to [best practice](https://evalscope.readthedocs.io/en/latest/best_practice/deepseek_r1_distill.html); Added support for specifying the `eval_batch_size` parameter to accelerate model evaluation.\n- 🔥 **[2025.01.20]** Support for visualizing evaluation results, including single model evaluation results and multi-model comparison, refer to the [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html) for more details; Added [`iquiz`](https://modelscope.cn/datasets/AI-ModelScope/IQuiz/summary) evaluation example, evaluating the IQ and EQ of the model.\n- 🔥 **[2025.01.07]** Native backend: Support for model API evaluation is now available. Refer to the [📖 Model API Evaluation Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html#api) for more details. Additionally, support for the `ifeval` evaluation benchmark has been added.\n- 🔥🔥 **[2024.12.31]** Support for adding benchmark evaluations, refer to the [📖 Benchmark Evaluation Addition Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html); support for custom mixed dataset evaluations, allowing for more comprehensive model evaluations with less data, refer to the [📖 Mixed Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/collection/index.html).\n- 🔥 **[2024.12.13]** Model evaluation optimization: no need to pass the `--template-type` parameter anymore; supports starting evaluation with `evalscope eval --args`. Refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/get_started/basic_usage.html) for more details.\n- 🔥 **[2024.11.26]** The model inference service performance evaluator has been completely refactored: it now supports local inference service startup and Speed Benchmark; asynchronous call error handling has been optimized. For more details, refer to the [📖 User Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).\n- 🔥 **[2024.10.31]** The best practice for evaluating Multimodal-RAG has been updated, please check the [📖 Blog](https://evalscope.readthedocs.io/zh-cn/latest/blog/RAG/multimodal_RAG.html#multimodal-rag) for more details.\n- 🔥 **[2024.10.23]** Supports multimodal RAG evaluation, including the assessment of image-text retrieval using [CLIP_Benchmark](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/clip_benchmark.html), and extends [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html) to support end-to-end multimodal metrics evaluation.\n- 🔥 **[2024.10.8]** Support for RAG evaluation, including independent evaluation of embedding models and rerankers using [MTEB/CMTEB](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/mteb.html), as well as end-to-end evaluation using [RAGAS](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/ragas.html).\n- 🔥 **[2024.09.18]** Our documentation has been updated to include a blog module, featuring some technical research and discussions related to evaluations. We invite you to [📖 read it](https://evalscope.readthedocs.io/en/refact_readme/blog/index.html).\n- 🔥 **[2024.09.12]** Support for LongWriter evaluation, which supports 10,000+ word generation. You can use the benchmark [LongBench-Write](evalscope/third_party/longbench_write/README.md) to measure the long output quality as well as the output length.\n- 🔥 **[2024.08.30]** Support for custom dataset evaluations, including text datasets and multimodal image-text datasets.\n- 🔥 **[2024.08.20]** Updated the official documentation, including getting started guides, best practices, and FAQs. Feel free to [📖read it here](https://evalscope.readthedocs.io/en/latest/)!\n- 🔥 **[2024.08.09]** Simplified the installation process, allowing for pypi installation of vlmeval dependencies; optimized the multimodal model evaluation experience, achieving up to 10x acceleration based on the OpenAI API evaluation chain.\n- 🔥 **[2024.07.31]** Important change: The package name `llmuses` has been changed to `evalscope`. Please update your code accordingly.\n- 🔥 **[2024.07.26]** Support for **VLMEvalKit** as a third-party evaluation framework to initiate multimodal model evaluation tasks.\n- 🔥 **[2024.06.29]** Support for **OpenCompass** as a third-party evaluation framework, which we have encapsulated at a higher level, supporting pip installation and simplifying evaluation task configuration.\n- 🔥 **[2024.06.13]** EvalScope seamlessly integrates with the fine-tuning framework SWIFT, providing full-chain support from LLM training to evaluation.\n- 🔥 **[2024.06.13]** Integrated the Agent evaluation dataset ToolBench.\n\n\u003c/details\u003e\n\n## ❤️ Community \u0026 Support\n\nWelcome to join our community to communicate with other developers and get help.\n\n[Discord Group](https://discord.com/invite/D27yfEFVz5)              |  WeChat Group | DingTalk Group\n:-------------------------:|:-------------------------:|:-------------------------:\n\u003cimg src=\"docs/asset/discord_qr.jpg\" width=\"160\" height=\"160\"\u003e  |  \u003cimg src=\"docs/asset/wechat.png\" width=\"160\" height=\"160\"\u003e | \u003cimg src=\"docs/asset/dingding.png\" width=\"160\" height=\"160\"\u003e\n\n\n\n## 🛠️ Environment Setup\n\nWe recommend using `conda` to create a virtual environment and install with `pip`.\n\n1.  **Create and Activate Conda Environment** (Python 3.10 recommended)\n    ```shell\n    conda create -n evalscope python=3.10\n    conda activate evalscope\n    ```\n\n2.  **Install EvalScope**\n\n    - **Method 1: Install via PyPI (Recommended)**\n      ```shell\n      pip install evalscope\n      ```\n\n    - **Method 2: Install from Source (For Development)**\n      ```shell\n      git clone https://github.com/modelscope/evalscope.git\n      cd evalscope\n      pip install -e .\n      ```\n\n3.  **Install Additional Dependencies** (Optional)\n    Install corresponding feature extensions according to your needs:\n    ```shell\n    # Performance testing\n    pip install 'evalscope[perf]'\n\n    # Visualization App\n    pip install 'evalscope[app]'\n\n    # Other evaluation backends\n    pip install 'evalscope[opencompass]'\n    pip install 'evalscope[vlmeval]'\n    pip install 'evalscope[rag]'\n\n    # Install all dependencies\n    pip install 'evalscope[all]'\n    ```\n    \u003e If you installed from source, please replace `evalscope` with `.`, for example `pip install '.[perf]'`.\n\n\u003e [!NOTE]\n\u003e This project was formerly known as `llmuses`. If you need to use `v0.4.3` or earlier versions, please run `pip install llmuses\u003c=0.4.3` and use `from llmuses import ...` for imports.\n\n\n## 🚀 Quick Start\n\nYou can start evaluation tasks in two ways: **command line** or **Python code**.\n\n### Method 1. Using Command Line\n\nExecute the `evalscope eval` command in any path to start evaluation. The following command will evaluate the `Qwen/Qwen2.5-0.5B-Instruct` model on `gsm8k` and `arc` datasets, taking only 5 samples from each dataset.\n\n```bash\nevalscope eval \\\n --model Qwen/Qwen2.5-0.5B-Instruct \\\n --datasets gsm8k arc \\\n --limit 5\n```\n\n### Method 2. Using Python Code\n\nUse the `run_task` function and `TaskConfig` object to configure and start evaluation tasks.\n\n```python\nfrom evalscope import run_task, TaskConfig\n\n# Configure evaluation task\ntask_cfg = TaskConfig(\n    model='Qwen/Qwen2.5-0.5B-Instruct',\n    datasets=['gsm8k', 'arc'],\n    limit=5\n)\n\n# Start evaluation\nrun_task(task_cfg)\n```\n\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003e💡 Tip:\u003c/b\u003e `run_task` also supports dictionaries, YAML or JSON files as configuration.\u003c/summary\u003e\n\n**Using Python Dictionary**\n\n```python\nfrom evalscope.run import run_task\n\ntask_cfg = {\n    'model': 'Qwen/Qwen2.5-0.5B-Instruct',\n    'datasets': ['gsm8k', 'arc'],\n    'limit': 5\n}\nrun_task(task_cfg=task_cfg)\n```\n\n**Using YAML File** (`config.yaml`)\n```yaml\nmodel: Qwen/Qwen2.5-0.5B-Instruct\ndatasets:\n  - gsm8k\n  - arc\nlimit: 5\n```\n```python\nfrom evalscope.run import run_task\n\nrun_task(task_cfg=\"config.yaml\")\n```\n\u003c/details\u003e\n\n### Output Results\nAfter evaluation completion, you will see a report in the terminal in the following format:\n```text\n+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+\n| Model Name            | Dataset Name   | Metric Name     | Category Name   | Subset Name   |   Num |   Score |\n+=======================+================+=================+=================+===============+=======+=========+\n| Qwen2.5-0.5B-Instruct | gsm8k          | AverageAccuracy | default         | main          |     5 |     0.4 |\n+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+\n| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Easy      |     5 |     0.8 |\n+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+\n| Qwen2.5-0.5B-Instruct | ai2_arc        | AverageAccuracy | default         | ARC-Challenge |     5 |     0.4 |\n+-----------------------+----------------+-----------------+-----------------+---------------+-------+---------+\n```\n\n## 📈 Advanced Usage\n\n### Custom Evaluation Parameters\n\nYou can fine-tune model loading, inference, and dataset configuration through command line parameters.\n\n```shell\nevalscope eval \\\n --model Qwen/Qwen3-0.6B \\\n --model-args '{\"revision\": \"master\", \"precision\": \"torch.float16\", \"device_map\": \"auto\"}' \\\n --generation-config '{\"do_sample\":true,\"temperature\":0.6,\"max_tokens\":512}' \\\n --dataset-args '{\"gsm8k\": {\"few_shot_num\": 0, \"few_shot_random\": false}}' \\\n --datasets gsm8k \\\n --limit 10\n```\n\n- `--model-args`: Model loading parameters such as `revision`, `precision`, etc.\n- `--generation-config`: Model generation parameters such as `temperature`, `max_tokens`, etc.\n- `--dataset-args`: Dataset configuration parameters such as `few_shot_num`, etc.\n\nFor details, please refer to [📖 Complete Parameter Guide](https://evalscope.readthedocs.io/en/latest/get_started/parameters.html).\n\n### Evaluating Online Model APIs\n\nEvalScope supports evaluating model services deployed via APIs (such as services deployed with vLLM). Simply specify the service address and API Key.\n\n1.  **Start Model Service** (using vLLM as example)\n    ```shell\n    export VLLM_USE_MODELSCOPE=True\n    python -m vllm.entrypoints.openai.api_server \\\n      --model Qwen/Qwen2.5-0.5B-Instruct \\\n      --served-model-name qwen2.5 \\\n      --port 8801\n    ```\n\n2.  **Run Evaluation**\n    ```shell\n    evalscope eval \\\n     --model qwen2.5 \\\n     --eval-type openai_api \\\n     --api-url http://127.0.0.1:8801/v1 \\\n     --api-key EMPTY \\\n     --datasets gsm8k \\\n     --limit 10\n    ```\n\n### ⚔️ Arena Mode\n\nArena mode evaluates model performance through pairwise battles between models, providing win rates and rankings, perfect for horizontal comparison of multiple models.\n\n```text\n# Example evaluation results\nModel           WinRate (%)  CI (%)\n------------  -------------  ---------------\nqwen2.5-72b            69.3  (-13.3 / +12.2)\nqwen2.5-7b             50    (+0.0 / +0.0)\nqwen2.5-0.5b            4.7  (-2.5 / +4.4)\n```\nFor details, please refer to [📖 Arena Mode Usage Guide](https://evalscope.readthedocs.io/en/latest/user_guides/arena.html).\n\n### 🖊️ Custom Dataset Evaluation\n\nEvalScope allows you to easily add and evaluate your own datasets. For details, please refer to [📖 Custom Dataset Evaluation Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/custom_dataset/index.html).\n\n\n## 🧪 Other Evaluation Backends\nEvalScope supports launching evaluation tasks through third-party evaluation frameworks (we call them \"backends\") to meet diverse evaluation needs.\n\n- **Native**: EvalScope's default evaluation framework with comprehensive functionality.\n- **OpenCompass**: Focuses on text-only evaluation. [📖 Usage Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/opencompass_backend.html)\n- **VLMEvalKit**: Focuses on multi-modal evaluation. [📖 Usage Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/vlmevalkit_backend.html)\n- **RAGEval**: Focuses on RAG evaluation, supporting Embedding and Reranker models. [📖 Usage Guide](https://evalscope.readthedocs.io/en/latest/user_guides/backend/rageval_backend/index.html)\n- **Third-party Evaluation Tools**: Supports evaluation tasks like [ToolBench](https://evalscope.readthedocs.io/en/latest/third_party/toolbench.html).\n\n## ⚡ Inference Performance Evaluation Tool\nEvalScope provides a powerful stress testing tool for evaluating the performance of large language model services.\n\n- **Key Metrics**: Supports throughput (Tokens/s), first token latency (TTFT), token generation latency (TPOT), etc.\n- **Result Recording**: Supports recording results to `wandb` and `swanlab`.\n- **Speed Benchmarks**: Can generate speed benchmark results similar to official reports.\n\nFor details, please refer to [📖 Performance Testing Usage Guide](https://evalscope.readthedocs.io/en/latest/user_guides/stress_test/index.html).\n\nExample output is shown below:\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"docs/en/user_guides/stress_test/images/multi_perf.png\" style=\"width: 80%;\"\u003e\n\u003c/p\u003e\n\n\n## 📊 Visualizing Evaluation Results\n\nEvalScope provides a Gradio-based WebUI for interactive analysis and comparison of evaluation results.\n\n1.  **Install Dependencies**\n    ```bash\n    pip install 'evalscope[app]'\n    ```\n\n2.  **Start Service**\n    ```bash\n    evalscope app\n    ```\n    Visit `http://127.0.0.1:7861` to open the visualization interface.\n\n\u003ctable\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\n      \u003cimg src=\"docs/en/get_started/images/setting.png\" alt=\"Setting\" style=\"width: 85%;\" /\u003e\n      \u003cp\u003eSettings Interface\u003c/p\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\n      \u003cimg src=\"docs/en/get_started/images/model_compare.png\" alt=\"Model Compare\" style=\"width: 100%;\" /\u003e\n      \u003cp\u003eModel Comparison\u003c/p\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\n      \u003cimg src=\"docs/en/get_started/images/report_overview.png\" alt=\"Report Overview\" style=\"width: 100%;\" /\u003e\n      \u003cp\u003eReport Overview\u003c/p\u003e\n    \u003c/td\u003e\n    \u003ctd style=\"text-align: center;\"\u003e\n      \u003cimg src=\"docs/en/get_started/images/report_details.png\" alt=\"Report Details\" style=\"width: 85%;\" /\u003e\n      \u003cp\u003eReport Details\u003c/p\u003e\n    \u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\nFor details, please refer to [📖 Visualizing Evaluation Results](https://evalscope.readthedocs.io/en/latest/get_started/visualization.html).\n\n## 👷‍♂️ Contributing\n\nWe welcome any contributions from the community! If you want to add new evaluation benchmarks, models, or features, please refer to our [Contributing Guide](https://evalscope.readthedocs.io/en/latest/advanced_guides/add_benchmark.html).\n\nThanks to all developers who have contributed to EvalScope!\n\n\u003ca href=\"https://github.com/modelscope/evalscope/graphs/contributors\" target=\"_blank\"\u003e\n  \u003ctable\u003e\n    \u003ctr\u003e\n      \u003cth colspan=\"2\"\u003e\n        \u003cbr\u003e\u003cimg src=\"https://contrib.rocks/image?repo=modelscope/evalscope\"\u003e\u003cbr\u003e\u003cbr\u003e\n      \u003c/th\u003e\n    \u003c/tr\u003e\n  \u003c/table\u003e\n\u003c/a\u003e\n\n\n## 📚 Citation\n\nIf you use EvalScope in your research, please cite our work:\n```bibtex\n@misc{evalscope_2024,\n    title={{EvalScope}: Evaluation Framework for Large Models},\n    author={ModelScope Team},\n    year={2024},\n    url={https://github.com/modelscope/evalscope}\n}\n```\n\n\n## ⭐ Star History\n\n[![Star History Chart](https://api.star-history.com/svg?repos=modelscope/evalscope\u0026type=Date)](https://star-history.com/#modelscope/evalscope\u0026Date)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2Fevalscope","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmodelscope%2Fevalscope","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmodelscope%2Fevalscope/lists"}