{"id":13652661,"url":"https://github.com/OpenBMB/UltraEval","last_synced_at":"2025-04-23T03:31:17.585Z","repository":{"id":208819854,"uuid":"719091904","full_name":"OpenBMB/UltraEval","owner":"OpenBMB","description":"[ACL 2024 Demo] Official GitHub repo for UltraEval: An open source framework for evaluating foundation models.","archived":false,"fork":false,"pushed_at":"2024-09-17T15:37:35.000Z","size":147548,"stargazers_count":208,"open_issues_count":5,"forks_count":19,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-09-17T19:39:13.060Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenBMB.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-11-15T12:43:28.000Z","updated_at":"2024-09-17T15:37:42.000Z","dependencies_parsed_at":"2024-09-17T19:14:11.319Z","dependency_job_id":"ce7a2891-2c5d-49ac-85ca-c5967566598b","html_url":"https://github.com/OpenBMB/UltraEval","commit_stats":null,"previous_names":["openbmb/ultraeval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FUltraEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FUltraEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FUltraEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenBMB%2FUltraEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenBMB","download_url":"https://codeload.github.com/OpenBMB/UltraEval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223909940,"owners_count":17223585,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-02T02:01:01.427Z","updated_at":"2024-11-10T03:31:17.670Z","avatar_url":"https://github.com/OpenBMB.png","language":"Python","readme":"\u003cdiv align=\"center\"\u003e\n  \u003cimg src=\"docs/pics/ultraeval_logo_white.jpg\" width=\"500px\"/\u003e\n  \u003cbr /\u003e\n  \u003cbr /\u003e\n\u003cp align=\"center\"\u003e\n \u003ca href=\"https://arxiv.org/abs/2404.07584\"\u003e📖Paper\u003c/a\u003e •\n \u003ca href=\"https://ultraeval.openbmb.cn/home\"\u003e 🌐Website\u003c/a\u003e •\n \u003ca href=\"#Overview\"\u003e📖Overview\u003c/a\u003e •\n \u003ca href=\"#Quick start\"\u003e🔧Quick start\u003c/a\u003e •\n \u003ca href=\"docs/tutorials/en/ultraeval.md\"\u003e🛠️Tutorials\u003c/a\u003e •\n \u003ca href=\"README_zh.md\"\u003e中文\u003c/a\u003e \n\u003c/p\u003e\n\u003c/div\u003e\n\n# Colab\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1hNXtaR3V-VgmG59QJMW7YPaAYgwtmyuH?usp=sharing)\n\nWe provide a Colab notebook to help you get started with UltraEval.\n\n# News!\n\n- \\[2024.6.4\\] **UltraEval** was accepted by ACL 2024 System Demonstration Track (SDT).🔥🔥🔥\n- \\[2024.4.11\\] We published the [UltraEval paper](https://arxiv.org/abs/2404.07584)🔥🔥🔥, and we welcome discussions and exchanges on this topic.\n- \\[2024.2.1\\] [MiniCPM](https://github.com/OpenBMB/MiniCPM) has been released🔥🔥🔥, using UltraEval as its evaluation framework!\n- \\[2023.11.23\\] We open sourced the UltraEval evaluation framework and published the first version of the list.🔥🔥🔥\n\n# Overview\nUltraEval is an open-source framework for evaluating the capabilities of foundation models, providing a suite of lightweight, easy-to-use evaluation systems that support the performance assessment of mainstream LLMs. \n\nUltraEval's overall workflow is as follows:\n\u003cdiv align=\"center\"\u003e\n\u003cp align=\"center\"\u003e\n\u003cimg src=\"docs/pics/ultraeval_pipeline_white.png\" width=\"800px\"\u003e\n\u003c/p\u003e\n\u003c/div\u003e\n\nIts main features are as follows:\n1. **Lightweight and Easy-to-use Evaluation Framework:** Seamlessly designed with an intuitive interface, minimal dependencies, effortless deployment, excellent scalability, adaptable to diverse evaluation scenarios.\n\n2. **Flexible and Diverse Evaluation Methods:** Supports a unified prompt template with an extensive array of evaluation metrics, allowing for personalized customization to suit specific needs.\n\n3. **Efficient and Swift Inference Deployment:** Facilitates multiple model deployment strategies such as torch and vLLM, enabling multi-instance deployment for swift evaluation processes.\n\n4. **Publicly Transparent Open-Source Leaderboard:** Maintains an open, traceable, and reproducible evaluation leaderboard, driven by community updates to ensure transparency and credibility.\n\n5. **Official and Authoritative Evaluation Data:** Utilizes widely recognized official evaluation sets to ensure fairness and standardization in evaluations, ensuring results are comparable and reproducible.\n\n6. **Comprehensive and Extensive Model Support:** Offers support for a wide spectrum of models, including those from the Huggingface open-source repository and personally trained models, ensuring comprehensive coverage.\n\n\n# Quick start\nWelcome to UltraEval, your assistant for evaluating the capabilities of large models. Get started in just a few simple steps:\n\n## 1. Install UltraEval\n\n```shell\ngit clone https://github.com/OpenBMB/UltraEval.git\ncd UltraEval\npip install .\n```\n## 2.Model evaluation\nEnter the UltraEval root directory; all the following commands are executed in the root directory.\n\n### 2.1 Generate the evaluation task file\n\nDownload datasets:\n```shell\nwget -O RawData.zip \"https://cloud.tsinghua.edu.cn/f/11d562a53e40411fb385/?dl=1\"\n```\nThe Google Drive link is [here](https://drive.google.com/file/d/1S0Ze9ToFC9V6CNzsI_e18ILMAk-l0Ckb/view?usp=drive_link).\n\nUnzip evaluation datasets:\n```shell\nunzip RawData.zip\n```\nPreprocess the data:\n```shell\npython data_process.py\n```\nExecute the following command to display the supported data sets and their corresponding tasks:\n\n```shell\npython configs/show_datasets.py\n```\n\nSpecify the tasks to be tested with the following instructions:\n```shell\npython configs/make_config.py --datasets ALL\n```\nThe following is the specific parameter description:\n* ``datasets``: Select the data set, default is All(all data sets); Specify multiple data sets, with, spacing. For example, --datasets MMLU,Ceval\n* ``tasks``: Select the task, the default value is empty.\n* ``method``: Select the generation method, the default value is gen.\n* ``save``: Select the filename for the generated evaluation file, which defaults to eval_config.json.\n\nNote ⚠️ : When 'tasks' have values, the number of 'datasets' must be 1, indicating that certain tasks under a specific dataset are to be executed; 'save' is a filename that should end with .json, and there is no need to input a path as it defaults to the 'configs' directory. Executing the above command will generate an evaluation file named 'eval_config.json' in the 'configs' directory.\n\nThe \"RawData.zip\" contains data collected from the official website. To expedite the unzipping process, the 'Math' and 'race' data have been preprocessed (the zip file includes the code, facilitating replication by users).\n\n### 2.2 Local deployment model\nAs an example, deploying meta-llama/Llama-2-7b-hf using the vLLM deployment model:\n```shell\npython URLs/vllm_url.py \\\n    --model_name meta-llama/Llama-2-7b-hf \\\n    --gpuid 0 \\\n    --port 5002\n```\nBelow is a description of the specific parameters:\n* ``model_name``: Model name, when using vLLM, model_name and hugging face official name need to be consistent.\n* ``gpuid``: Specify the gpu id of the deployment model, default is 0. If more than one is needed, use , to separate them.\n* ``port``: The port number of the deployment URL, default 5002.\n\nExecuting the above code will generate a URL. For instance, the URL is `http://127.0.0.1:5002/infer`, where `5002` is the port number, and `/infer` is the URL path specified by the `@app.route(\"/infer\", methods=[\"POST\"])` decorator in the `URLs/vllm_url.py` file.\n\nFor a model of individual training and a multi-GPU batch evaluation approach, see [Tutorial.md](./docs/tutorials/en/ultraeval.md).\n\n### 2.3 Take the assessment and get the results\nCreate a bash script and execute the main.py program to get the results of the assessment:\n```shell\npython main.py \\\n    --model general \\\n    --model_args url=$URL,concurrency=1 \\\n    --config_path configs/eval_config.json \\\n    --output_base_path logs \\\n    --batch_size 1 \\\n    --postprocess general_torch \\\n    --params models/model_params/vllm_sample.json \\\n    --write_out\n```\nBelow is a description of the specific parameters:\n* ``model``: Specifies the model. Currently, the general, gpt-3.5-turbo, and gpt-4 models are supported.\n* ``model_args``: Specify the URL generated in 2.2 and the number of concurrent threads to initialize the model parameters. Separate with commas, and connect parameter names and values with an equals sign. For example: url=$URL,concurrency=1.\n* ``config_path``: Specify the evaluation file path from 2.1, which by default is configs/eval_config.json.\n* ``output_base_path``: Specifies the path to save the results. The default is logs.\n* ``batch_size``: Specifies the number of batches to be processed. The default is 1.\n* ``num_fewshot``: Specifies the number of fewshot samples.\n* ``postprocess``: Specifies the post-processing method, which defaults to general_torch.\n* ``params``: Specifies parameters for model inference. The default is models/model_params/vllm_sample.json.\n* ``write_out``: Whether to save the data for each instance. The default value is False.\n* ``limit``: Evaluate a certain number of instances per task, default to None.\n\nEvaluation results are saved in the path:\n```shell\noutput_base_path    #：Output path\n--timestamp # timestamp\n----task1   # Evaluation task\n--------config.json # Record the configuration of related parameters of the evaluation task\n--------final_metrics.json  # The final result of the task\n--------instance.jsonl  # Detailed results for each instance of the task\n----....    # Other task directory\n----_all_results.json   # Synthesis of results from all evaluation tasks\n```\n### 2.4 More evaluation function support\nMore evaluation methods and features (custom evaluation set, batch evaluation, multi-GPU acceleration) can be found in [Tutorials.md](./docs/tutorials/en/ultraeval.md)\n\n# Evaluation set support\n\nUltraEval currently supports 59 evaluation datasets and comprehensively measures large model capabilities through capability categories, as follows:\n\n\u003ctable border=\"1\"\u003e\n  \u003ctr\u003e\n    \u003cth\u003eFirst-level\u003c/th\u003e\n    \u003cth\u003eSecond-level\u003c/th\u003e\n    \u003cth\u003eDataset list\u003c/th\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd rowspan=\"2\"\u003e\u003cstrong\u003eKnowledge\u003c/strong\u003e\u003c/td\u003e\n    \u003ctd\u003eDisciplinary knowledge\u003c/td\u003e\n    \u003ctd\u003eCMMLU, MMLU, CEval, AGI-Eval, JEC-QA, MEDMCQA, MEDQA-MCMLE, MEDQA-USMLE, GAOKAO-Bench\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eWorld knowledge\u003c/td\u003e\n    \u003ctd\u003eNQ-open, TriviaQA, TruthfulQA\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cstrong\u003eMath\u003c/strong\u003e\u003c/td\u003e\n    \u003ctd\u003eMath\u003c/td\u003e\n    \u003ctd\u003eGSM8K, MATH\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003e\u003cstrong\u003eCode\u003c/strong\u003e\u003c/td\u003e\n    \u003ctd\u003eCode\u003c/td\u003e\n    \u003ctd\u003eHumanEval, MBPP\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd rowspan=\"3\"\u003e\u003cstrong\u003eReason\u003c/strong\u003e\u003c/td\u003e\n    \u003ctd\u003eLogical reasoning\u003c/td\u003e\n    \u003ctd\u003eBBH\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eImplicative relation\u003c/td\u003e\n    \u003ctd\u003eAX-B, AX-G, CB, CMNLI, OCNLI, OCNLI-FC, RTE\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eCommon sense reasoning\u003c/td\u003e\n    \u003ctd\u003eHellaSwag, OpenBookQA, ARC-c, ARC-e, CommonsenseQA, COPA, PIQA, SIQA, WinoGrande, Story Cloze, StrategyQA, TheoremQA\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd rowspan=\"6\"\u003e\u003cstrong\u003eLanguage\u003c/strong\u003e\u003c/td\u003e\n    \u003ctd\u003eReading comprehension\u003c/td\u003e\n    \u003ctd\u003eboolq, C3, ChiD, DRCD, LAMBADA, MultiRC, QuAC, RACE, RECORD, SQuAD, TyDi QA, SummEdits\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eTranslation\u003c/td\u003e\n    \u003ctd\u003eFLORES, wmt20-en-zh, wmt20-en-zh\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSemantic similarity\u003c/td\u003e\n    \u003ctd\u003eAFQMC, BUSTM\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eWord sense disambiguation\u003c/td\u003e\n    \u003ctd\u003eCLUEWSC, WIC, Winogender, WSC\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eSentiment analysis\u003c/td\u003e\n    \u003ctd\u003eEPRSTMT\u003c/td\u003e\n  \u003c/tr\u003e\n  \u003ctr\u003e\n    \u003ctd\u003eNews classification\u003c/td\u003e\n    \u003ctd\u003eTNEWS\u003c/td\u003e\n  \u003c/tr\u003e\n\u003c/table\u003e\n\n\n# Leaderboard\nPlease visit UltraEval [Leaderboard](https://ultraeval.openbmb.cn/rank) to learn about the latest models and their detailed results in each dimension.\n\n# Acknowledgement\n- [HuggingFace](https://huggingface.co)\n- [vLLM](https://github.com/vllm-project/vllm/blob/main)\n- [Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/master)\n- [OpenCompass](https://github.com/open-compass/opencompass)\n# Contact us\nIf you have questions, suggestions, or feature requests regarding UltraEval, please submit GitHub Issues to jointly build an open and transparent UltraEval evaluation community.\n# License\nThis project follows the Apache-2.0 license.\n\n# Citation \nPlease cite our paper if you use UltraEval.\n\n**BibTeX:**\n```bibtex\n@misc{he2024ultraeval,\n      title={UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs}, \n      author={Chaoqun He and Renjie Luo and Shengding Hu and Yuanqian Zhao and Jie Zhou and Hanghao Wu and Jiajie Zhang and Xu Han and Zhiyuan Liu and Maosong Sun},\n      year={2024},\n      eprint={2404.07584},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```","funding_links":[],"categories":["A01_文本生成_文本对话","Anthropomorphic-Taxonomy"],"sub_categories":["大语言对话模型及数据","Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenBMB%2FUltraEval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenBMB%2FUltraEval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenBMB%2FUltraEval/lists"}