{"id":13526133,"url":"https://github.com/ruixiangcui/AGIEval","last_synced_at":"2025-04-01T06:31:15.213Z","repository":{"id":152982049,"uuid":"619121820","full_name":"ruixiangcui/AGIEval","owner":"ruixiangcui","description":null,"archived":false,"fork":false,"pushed_at":"2024-06-13T14:20:51.000Z","size":7204,"stargazers_count":743,"open_issues_count":7,"forks_count":50,"subscribers_count":8,"default_branch":"main","last_synced_at":"2025-03-29T15:34:32.407Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ruixiangcui.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-26T10:37:46.000Z","updated_at":"2025-03-29T05:52:30.000Z","dependencies_parsed_at":"2024-04-09T15:54:07.513Z","dependency_job_id":"152b2cfa-6b74-470e-91e9-ca67aa13cd32","html_url":"https://github.com/ruixiangcui/AGIEval","commit_stats":null,"previous_names":["ruixiangcui/agieval"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruixiangcui%2FAGIEval","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruixiangcui%2FAGIEval/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruixiangcui%2FAGIEval/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ruixiangcui%2FAGIEval/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ruixiangcui","download_url":"https://codeload.github.com/ruixiangcui/AGIEval/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246596817,"owners_count":20802899,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T06:01:25.644Z","updated_at":"2025-04-01T06:31:13.100Z","avatar_url":"https://github.com/ruixiangcui.png","language":"Python","funding_links":[],"categories":["📏 评测基准","Python","Anthropomorphic-Taxonomy","Benchmarks \u0026 Datasets","Reasoning \u0026 Math","Benchmarks"],"sub_categories":["🧩 领域模型","Typical Intelligence Quotient (IQ)-General Intelligence evaluation benchmarks","General Language Understanding","General"],"readme":"# AGIEval\nThis repository contains information about AGIEval, data, code and output of baseline systems for the benchmark.\n\n# Introduction\nAGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving. \nThis benchmark is derived from 20 official, public, and high-standard admission and qualification exams intended for general human test-takers, such as general college admission tests (e.g., Chinese College Entrance Exam (Gaokao) and American SAT), law school admission tests, math competitions, lawyer qualification tests, and national civil service exams. \nFor a full description of the benchmark, please refer to our paper: [AGIEval: A Human-Centric Benchmark for\nEvaluating Foundation Models](https://arxiv.org/pdf/2304.06364.pdf).\n\n# Tasks and Data\n[We have updated the dataset to version 1.1.](data/v1_1) The new version updated Chinese Gaokao (chemistry, biology, physics) datasets with questions from 2023 and addressed annotation issues. To facilitate evaluation, now all multi-choice question (MCQ) tasks have one answer only (Gaokao-Physics and JEC-QA used to have multi-label answers). AGIEval-en datasets remain the same as Verison 1.0. The new version's statistics are as follows:\n\nAGIEval v1.1 contains 20 tasks, including 18 MCQ tasks and two cloze tasks (Gaokao-Math-Cloze and MATH). You can find the full list of tasks in the table below.\n![The datasets used in AGIEVal](AGIEval_tasks.png)\n\nYou can download all post-processed data in the [data/v1_1](data/v1_1) folder. All usage of the data should follow the license of the original datasets. \n\nThe data format for all datasets is as follows:\n```\n{\n    \"passage\": null,\n    \"question\": \"设集合 $A=\\\\{x \\\\mid x \\\\geq 1\\\\}, B=\\\\{x \\\\mid-1\u003cx\u003c2\\\\}$, 则 $A \\\\cap B=$ ($\\\\quad$)\\\\\\\\\\n\",\n    \"options\": [\"(A)$\\\\{x \\\\mid x\u003e-1\\\\}$\", \n        \"(B)$\\\\{x \\\\mid x \\\\geq 1\\\\}$\", \n        \"(C)$\\\\{x \\\\mid-1\u003cx\u003c1\\\\}$\", \n        \"(D)$\\\\{x \\\\mid 1 \\\\leq x\u003c2\\\\}$\"\n        ],\n    \"label\": \"D\",\n    \"answer\": null\n}\n```\nThe `passage` field is available for gaokao-chinese, gaokao-english, both of logiqa, all of LSAT, and SAT. The answer for multi-choice tasks is saved in the `label` field. The answer for cloze tasks is saved in the `answer` field. \n\nWe provide the prompts for few-shot learning in the [data/few_shot_prompts](data/few_shot_prompts.csv) file.\n# Baseline Systems\nWe evaluate the performance of the baseline systems (gpt-3.5-turbo and GPT-4o) on AGIEval v1.1.\nThe results are as follows:\n\n![The datasets used in AGIEVal](gpt_results.png)\n\nYou can replicate the results by following the steps below:\n1. Update your OpenAI API in the [openai_api.py](openai_api.py) file.\n2. run the [run_prediction.py](run_prediction.py) script to get the results.\n\n# Evaluation\nYou can run the [post_process_and_evaluation.py](post_process_and_evaluation.py) file to get the evaluation results.\n# Leaderboard\nWe report the leaderboard on AGIEval v1.1. The leaderboard contains two subsets AGIEval-en and AGIEval-zh. The two subset leaderboards contain only MCQ tasks. The leaderboard is as follows:\n\n## AGIEval-en few-shot\n\n| Model            | Source                                                   | Average |\n|------------------|----------------------------------------------------------|---------|\n| GPT-4o           | [Link](https://github.com/ruixiangcui/AGIEval)           | 71.4    |\n| Llama 3 400B+    | [Link](https://ai.meta.com/blog/meta-llama-3/)           | 69.9    |\n| Llama 3 70B      | [Link](https://ai.meta.com/blog/meta-llama-3/)           | 63      |\n| Mixtral 8x22B    | [Link](https://ai.meta.com/blog/meta-llama-3/)           | 61.2    |\n| GPT-3.5-Turbo    | [Link](https://github.com/ruixiangcui/AGIEval)           | 52.7    |\n| Llama 3 8B       | [Link](https://ai.meta.com/blog/meta-llama-3/)           | 45.9    |\n| Gemma 7B         | [Link](https://ai.meta.com/blog/meta-llama-3/)           | 44.9    |\n| Mistral 7B       | [Link](https://ai.meta.com/blog/meta-llama-3/)           | 44      |\n\n## AGIEval-zh few-shot\n\n| Model            | Source                                                   | Average |\n|------------------|----------------------------------------------------------|---------|\n| GPT-4o           | [Link](https://github.com/ruixiangcui/AGIEval)           | 71.9    |\n| GPT-3.5-Turbo    | [Link](https://github.com/ruixiangcui/AGIEval)           | 49.5    |\n\n## AGIEval-all few-shot\n\n| Model            | Source                                                   | Average |\n|------------------|----------------------------------------------------------|---------|\n| GPT-4o           | [Link](https://github.com/ruixiangcui/AGIEval)           | 69.0    |\n| GPT-3.5-Turbo    | [Link](https://github.com/ruixiangcui/AGIEval)           | 47.2    |\n\n## AGIEval-en zero-shot\n\n| Model            | Source                                                   | Average |\n|------------------|----------------------------------------------------------|---------|\n| GPT-4o           | [Link](https://github.com/ruixiangcui/AGIEval)           | 65.2    |\n| GPT-3.5-Turbo    | [Link](https://github.com/ruixiangcui/AGIEval)           | 54.1    |\n\n## AGIEval-zh zero-shot\n\n| Model            | Source                                                   | Average |\n|------------------|----------------------------------------------------------|---------|\n| GPT-4o           | [Link](https://github.com/ruixiangcui/AGIEval)           | 63.3    |\n| GPT-3.5-Turbo    | [Link](https://github.com/ruixiangcui/AGIEval)           | 45.0    |\n\n\n## AGIEval-all zero-shot\n(Asterisk sign indicates results reported for AGIEval v1.0.)\n\n| Model            | Source                                                   | Average |\n|------------------|----------------------------------------------------------|---------|\n| GPT-4o           | [Link](https://github.com/ruixiangcui/AGIEval)           | 62.3    |\n| InternLM2-20B*   | [Link](https://arxiv.org/pdf/2403.17297)                 | 53.0    |\n| Qwen-14B*        | [Link](https://arxiv.org/pdf/2403.17297)                 | 52.0    |\n| Phi-3-medium 14b*| [Link](https://arxiv.org/pdf/2404.14219)                 | 50.2    |\n| InternLM2-Chat-7B-SFT*| [Link](https://arxiv.org/pdf/2403.17297)            | 49.0    |\n| GPT-3.5-Turbo    | [Link](https://github.com/ruixiangcui/AGIEval)           | 46.0    |\n| Qwen-7B*         | [Link](https://arxiv.org/pdf/2403.17297)                 | 45.6    |\n| Mixtral 8x7b*    | [Link](https://arxiv.org/pdf/2404.14219)                 | 45.2    |\n| Phi-3-small 7b*  | [Link](https://arxiv.org/pdf/2404.14219)                 | 45.1    |\n| Gemma 7b*        | [Link](https://arxiv.org/pdf/2404.14219)                 | 42.1    |\n| Llama-3-In*      | [Link](https://arxiv.org/pdf/2404.14219)                 | 42.0    |\n| Phi-3-mini 3.8b* | [Link](https://arxiv.org/pdf/2404.14219)                 | 37.5    |\n| Mistral 7b*      | [Link](https://arxiv.org/pdf/2404.14219)                 | 35.1    |\n| Phi-2 2.7b*      | [Link](https://arxiv.org/pdf/2404.14219)                 | 29.8    |\n\n# Citation\nIf you use AGIEval benchmark or the code in your research, please cite our paper:\n```\n@misc{zhong2023agieval,\n      title={AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models}, \n      author={Wanjun Zhong and Ruixiang Cui and Yiduo Guo and Yaobo Liang and Shuai Lu and Yanlin Wang and Amin Saied and Weizhu Chen and Nan Duan},\n      year={2023},\n      eprint={2304.06364},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL}\n}\n```\n\n# Contributing\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).\nFor more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n# Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark \u0026 Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruixiangcui%2FAGIEval","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fruixiangcui%2FAGIEval","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fruixiangcui%2FAGIEval/lists"}