{"id":13437509,"url":"https://github.com/OpenLMLab/GAOKAO-Bench","last_synced_at":"2025-03-19T06:31:17.443Z","repository":{"id":157305131,"uuid":"628471937","full_name":"OpenLMLab/GAOKAO-Bench","owner":"OpenLMLab","description":"GAOKAO-Bench is an evaluation framework that utilizes GAOKAO questions as a dataset to evaluate large language models.","archived":false,"fork":false,"pushed_at":"2025-01-07T02:59:39.000Z","size":13443,"stargazers_count":617,"open_issues_count":4,"forks_count":44,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-03-17T15:08:12.442Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/OpenLMLab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-04-16T03:36:14.000Z","updated_at":"2025-03-16T14:41:37.000Z","dependencies_parsed_at":null,"dependency_job_id":"597ca48f-1d90-4ba3-8a57-7bee3afd2f91","html_url":"https://github.com/OpenLMLab/GAOKAO-Bench","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FGAOKAO-Bench","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FGAOKAO-Bench/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FGAOKAO-Bench/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/OpenLMLab%2FGAOKAO-Bench/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/OpenLMLab","download_url":"https://codeload.github.com/OpenLMLab/GAOKAO-Bench/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244371142,"owners_count":20442335,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T03:00:57.840Z","updated_at":"2025-03-19T06:31:17.437Z","avatar_url":"https://github.com/OpenLMLab.png","language":"Python","funding_links":[],"categories":["Statistics","Datasets-or-Benchmark","NLP语料和数据集","Evaluation and Monitoring","Python"],"sub_categories":["通用","大语言对话模型及数据"],"readme":"# GAOKAO-Bench\n\nGAOKAO-Bench是一个以中国高考题目为数据集，测评大模型语言理解能力、逻辑推理能力的测评框架。[[Read In English]](./README_EN.md)[[paper]](https://arxiv.org/abs/2305.12474)\n\n## 更新\n\n[[GAOKAO-MM]](https://github.com/OpenMOSS/GAOKAO-MM)：基于中国高考题的多模态数据集，测评多模态模型的感知、理解、知识、推理能力。\n\n[[GAOKAO-Bench-Updates]](https://github.com/OpenLMLab/GAOKAO-Bench-Updates):将中国2023年及之后的高考选择题作为数据集 ，对GAOKAO-Bench的补充。\n\n## 介绍\n\n我们希望能够建立一个标准化、综合性的评测框架来对大模型进行全方位、准确的评估。在中国，高考是标准化水平最高、综合性最强并且认可度最广的考试之一，我们希望借用高考的题目来评估大模型的能力。因此，我们收集了2010-2022年全国高考卷的题目，其中包括1781道客观题和1030道主观题，构建起GAOKAO-Bench的数据部分。\n\n## 数据集\n\n| 题目类型     | 题目数量 | 数量占比 |\n| ----------------- | ------------- | ------------- |\n| 客观题       | 1781     | 63.36%   |\n| 主观题       | 1030     | 36.64%   |\n| **题目总数** | **2811** | **100%** |\n\n数据示例如下所示：\n\n- **Year**\n\n\u003e 2022\n\n- **Category**\n\n\u003e 全国甲卷\n\u003e\n\n- **Score**\n\n\u003e 5\n\n- **Question**\n\n\u003e 若 $z=-1+\\sqrt{3} \\mathrm{i}$, 则 $\\frac{z}{z \\bar{z}-1}=()$\n\u003e\n\u003e A. $-1+\\sqrt{3} \\mathrm{i}$\t\n\u003e\n\u003e B. $-1-\\sqrt{3} i$\t\n\u003e\n\u003e C. $-\\frac{1}{3}+\\frac{\\sqrt{3}}{3} \\mathrm{i}$\n\u003e\n\u003e D. $-\\frac{1}{3}-\\frac{\\sqrt{3}}{3} i$\n\u003e\n\n- **Analysis**\n\n\u003e 【详解】\n\u003e\n\u003e $\\bar{z}=-1-\\sqrt{3} i, z \\bar{z}=(-1+\\sqrt{3} i)(-1-\\sqrt{3} i)=1+3=4$.\n\u003e\n\u003e $\\frac{z}{z \\bar{z}-1}=\\frac{-1+\\sqrt{3} \\mathrm{i}}{3}=-\\frac{1}{3}+\\frac{\\sqrt{3}}{3} \\mathrm{i}$\n\u003e\n\u003e 故选: C\n\u003e\n\n* **Standard Answer**\n\n\u003e C\n\n## 测试结果\n\n### 高考总分\n\n我们采用zero-shot的方式测试各项模型，对客观题采用基于规则的答案抽取方式，对主观题采取人工评阅的方式，最终获得了GPT-4、GPT-3.5等模型的转化后的高考总分。实验结果表明，GPT-4转换后的高考总分名列第一，文科和理科总分分别为485和447。同时，所有模型的文科成绩都高于理科成绩。\n\n\u003cimg src=\"./Graphs/histogram.png\" alt=\"histogram\" style=\"zoom:25%;\" /\u003e\n\n\u003cimg src=\"./Graphs/radar_obj_sub.png\" alt=\"radar_obj_sub\" style=\"zoom:40%;\" /\u003e\n\n\n\n### 客观题得分率\n\n| **Models**                                    | **Overall** | **Chinese** | **Eng.**  | **Sci. Math** | **Hum. Math** | **Phys.** | **Chem.** | **Biol.** | **Poli.** | **Hist.** | **Geog.** |\n| --------------------------------------------- | ----------- | ----------- | --------- | ------------- | ------------- | --------- | --------- | --------- | --------- | --------- | --------- |\n| **GPT-4-0314**                                | **72.2%**   | **53.9%**   | 93.1%     | 53.7%         | 63.3%         | **55.5%** | 44.4%     | 80.7%     | 75.9%     | 75.6%     | 80.0%     |\n| **GPT-4-0613**                                | 71.6%       | 52.1%       | **93.2%** | **54.5%**     | **64.0%**     | 50.8%     | 43.6%     | **83.0%** | 72.5%     | 74.2%     | **81.1%** |\n| **Gemini-Pro**                                | 57.9%       | 46.7%       | 69.9%     | 40.7%         | 47.7%         | 32.0%     | 40.3%     | 70.7%     | 64.7%     | 64.5%     | 68.4%     |\n| **ERNIE-Bot-0615**                            | 56.6%       | 46.7%       | 31.0%     | 38.3%         | 49.1%         | 35.9%     | **66.1%** | 79.3%     | **86.9%** | **79.1%** | 68.4%     |\n| **GPT-3.5-turbo-0301**                        | 53.2%       | 34.7%       | 76.6%     | 38.8%         | 47.8%         | 41.1%     | 38.7%     | 56.9%     | 45.3%     | 53.9%     | 54.0%     |\n| **ERNIE-Bot-turbo-0725**                      | 45.6%       | 35.3%       | 26.6%     | 34.1%         | 36.2%         | 32.0%     | 51.6%     | 64.0%     | 72.2%     | 63.4%     | 44.2%     |\n| **Baichuan2-13b-Chat**                        | 43.9%       | 26.9%       | 34.7%     | 23.8%         | 31.7%         | 25.0%     | 40.3%     | 53.3%     | 75.3%     | 59.9%     | 61.1%     |\n| **ChatGLM2-6b**                               | 42.7%       | 31.1%       | 30.6%     | 29.0%         | 35.8%         | 24.2%     | 46.0%     | 71.3%     | 55.0%     | 59.2%     | 41.1%     |\n| **Baichuan2-7b-Chat**                         | 40.5%       | 31.7%       | 33.0%     | 26.6%         | 28.4%         | 18.0%     | 26.6%     | 48.0%     | 69.7%     | 57.8%     | 49.5%     |\n| **ChatGLM-6b**                                | 30.8%       | 18.6%       | 17.0%     | 25.2%         | 25.7%         | 12.5%     | 30.6%     | 24.7%     | 54.1%     | 59.9%     | 25.3%     |\n| **Baichuan2-7b-Base**                         | 27.2%       | 16.2%       | 21.2%     | 24.8%         | 24.8%         | 0.0%      | 23.4%     | 24.0%     | 55.3%     | 32.1%     | 24.2%     |\n| **LLaMA-7b**                                  | 21.1%       | 16.2%       | 20.5%     | 24.3%         | 26.1%         | 0.0%      | 22.6%     | 22.7%     | 22.2%     | 19.2%     | 24.2%     |\n| **Vicuna-7b**                                 | 21.0%       | 12.0%       | 19.6%     | 23.8%         | 23.4%         | 7.0%      | 27.4%     | 20.0%     | 20.9%     | 23.0%     | 23.2%     |\n\n### 主观题得分率\n\n| **Models**                                    | **Overall** | **Chinese** | **Eng.**  | **Sci. Math** | **Hum. Math** | **Phys.** | **Chem.** | **Biol.** | **Poli.** | **Hist.** | **Geog.** |\n| --------------------------------------------- | ----------- | ----------- | --------- | ------------- | ------------- | --------- | --------- | --------- | --------- | --------- | --------- |\n| **GPT-4-0314**                                | **51.9%**   | 51.5%       | **88.3%** | 24.1%         | **27.9%**     | **56.7%** | **35.0%** | **85.6%** | 50.0%     | **63.1%** | 70.0%     |\n| **GPT-4-0613**                                | 50.8%       | 50.3%       | 87.6%     | **24.6%**     | 27.5%         | 47.1%     | 28.5%     | **85.6%** | 49.9%     | 59.9%     | 71.5%     |\n| **ERNIE-Bot-0615**                            | 48.4%       | **57.1%**   | 45.0%     | 17.0%         | 25.6%         | 33.5%     | 30.8%     | 84.9%     | **53.0%** | 60.0%     | **72.7%** |\n| **ERNIE-Bot-turbo-0725**                      | 39.2%       | 42.5%       | 28.8%     | 14.6%         | 15.6%         | 23.2%     | 25.0%     | 85.1%     | 45.3%     | 47.0%     | 61.8%     |\n| **GPT-3.5-turbo-0301**                        | 35.8%       | 33.9%       | 75.4%     | 15.2%         | 15.9%         | 16.9%     | 21.4%     | 36.3%     | 42.3%     | 58.4%     | 62.1%     |\n\n## 简单示例\n\n#### Openai API\n\n1. 获取GPT-4模型输出\n\n   ```\n   cd ./Bench\n   \n   ## Get the Output of Objective Questions\n   python objective_bench.py --openai_api_key=\"your openai api key\"\n   \n   ## Get the Output of Subjective Questions\n   python subjective_bench.py --openai_api_key=\"your openai api key\"\n   ```\n\n2. 计算GPT-4模型客观题得分率\n\n   * 将GPT-4模型输出的JSON文件存放在`./Results/gpt_4_obj`文件夹下。\n\n   * 执行以下指令，获得其客观题的得分率，结果存放在`./Results/gpt_4_obj/result/correction_score.json`文件下。\n   \n   ```\n   python OBJ_score_evaluation.py --obj_output_dir=../Results/gpt_4_obj\n   ```\n\n3. 计算GPT-4模型主观题得分率\n\n   由于人工批改的高昂成本，我们提供了LLM-as-a-Judge脚本，利用GPT-4-turbo为模型的主观题打分。\n\n   * 将GPT-4模型输出的JSON文件存放在`./Results/gpt_4_sub`文件夹下。\n\n   * 执行以下指令，获得GPT-4对主观题的评分，结果存放在`./Results/gpt_4_sub/gpt-4-1106-preview_correction_wo_marking_criterion`文件下。\n\n   ```\n   python subjective_grade.py --openai_api_key=\"your openai api key\"\n   ```\n\n   * 执行以下指令，获得其主观题的得分率，结果存放在`./Results/gpt_4_sub/gpt-4-1106-preview_correction_wo_marking_criterion/result/model_score.json`文件下。\n\n     ```\n     python SUB_score_evaluation.py --sub_output_dir=../Results/gpt_4_sub/gpt-4-1106-preview_correction_wo_marking_criterion --mode=model\n     ```\n\n4. 计算GPT-4模型高考总分\n\n   执行以下指令，获得GPT-4转换后的高考总分，结果保存在`./Results/merge_score.json`下。\n\n   ```\n   python merge_OBJ_SUB_score.py\n   ```\n\n   \n\n#### 其他模型\n\n封装你的模型API并放置在  `./Models` 目录下，封装方式可参考`./Models/openai_gpt4.py`。\n\n## 引用\n\n```\n@inproceedings{Zhang2023EvaluatingTP,\n  title={Evaluating the Performance of Large Language Models on GAOKAO Benchmark},\n  author={Xiaotian Zhang and Chunyang Li and Yi Zong and Zhengyu Ying and Liang He and Xipeng Qiu},\n  year={2023}\n}\n```\n\n## 致谢\n\n我们非常感谢上海市曹杨第二中学的老师们，他们负责了GAOKAO-Bench主观题部分的评分。\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenLMLab%2FGAOKAO-Bench","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FOpenLMLab%2FGAOKAO-Bench","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FOpenLMLab%2FGAOKAO-Bench/lists"}