Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/MikeGu721/XiezhiBenchmark
https://github.com/MikeGu721/XiezhiBenchmark
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/MikeGu721/XiezhiBenchmark
- Owner: MikeGu721
- Created: 2023-06-07T18:55:03.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2023-12-05T11:26:48.000Z (about 1 year ago)
- Last Synced: 2024-11-02T10:34:12.357Z (2 months ago)
- Language: Python
- Size: 39.1 MB
- Stars: 91
- Watchers: 1
- Forks: 4
- Open Issues: 6
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-llm-eval - XieZhi - choice questions spanning 516 diverse disciplines and four difficulty levels. 新的领域知识综合评估基准测试:Xiezhi。对于多选题,Xiezhi涵盖了516种不同学科中的220,000个独特问题,其中涵盖了13个学科。作者还提出了Xiezhi-Specialty和Xiezhi-Interdiscipline,每个都含有15k个问题。使用Xiezhi基准测试评估了47种先进的LLMs的性能| (Datasets-or-Benchmark / 通用)
README
Xiezhi (獬豸) is a comprehensive evaluation suite for Language Models (LMs). It consists of 249587 multi-choice questions spanning 516 diverse disciplines and four difficulty levels, as shown below. Please check our [paper](https://arxiv.org/abs/2306.05783) for more details, and our **website** will be open later on.
We hope Xiezhi could help developers track the progress and analyze the important strengths/shortcomings of their LMs.
## Table of Contents
- [Leaderboard](#leaderboard)
- [Experiment Setting](#experiment-setting)
- [Data](#data)
- [How To Run Your Own Test](#how-to-run-your-own-test)
- [TODO](#todo)
- [Licenses](#licenses)
- [Citation](#citation)## Leaderboard
Below are the ranking of models in 0-shot learning in our experiment setting.
The metric we used is MRR score.
The detail of our experiment setting please refer to our [Experiment Setting](#experiment setting).
✓ denotes human performance exceeds the state-of-the-art LLMs, whereas ✗ signifies LLMs have surpassed human performance.
## Experiment Setting
### Options Setting
All tested LLMs need to choose the best-fit answer from 50 options for each question.
Each question is set up with 3 confusing options in addition to the correct answer, and another 46 options are randomly sampled from all options in all questions in Xiezhi.
It is worth noting that it is possible to use WordNet, open source synonym databases, or other word construction methods to generate more confusing options.
However, our experiments show that the performance of all LLMs declined dramatically when the number of options increased, even when using so many non-confusing options.
This achieves our goal of exacerbating the performance gap between LLMs through new experimental settings.### Metric
In this section, we present mainly two experiment results: the overall performance of all LLMs across various benchmarks,
and the ranking of the top eight 0-shot LLMs in 12 non-sensitive domain categories of the Xiezhi-Benchmark with the scores for top and average practitioners.
For the 45 open-source models assessed in our evaluation,
we calculated the probability of each model choosing every option using generative probabilities and then ranked all options accordingly based on the probabilities.
Due to legal considerations, we only display the results of two publicly recognized API-based LLMs: ChatGPT and GPT-4,
and we ask them to rank all given options through instructions.
To represent the results of all ranking outcomes, we employed the Mean Reciprocal Rank (MRR) as the metric,
which calculates the reciprocal rank of the correct answer.
MRR closer to 1 indicates that the model is more capable of placing the correct answer at the front of the ranking,
while it suggests that the LLM tends to place the correct answer at the bottom if it is closer to 0.## Data
Example of question in Xiezhi Speciality:
Example of question in Xiezhi Interdiscipline:
Example of our few-shot learning setting:
## How To Run Your Own Test
- The testing can be performed on a set of models including C-Eval, M3KE, MMLU, Xiezhi-Inter, and Xiezhi-Spec, all of which are contained within the `./Tester/model_test.py` file.
- Anyone can simply run `./Tester/test.sh` to do the evaluation
- For your own data, you need to rewrite the `_get_data` function in `./Tester/model_test.py`## TODO
- [ ] add results of traditional 4 options experiments setting
- [ ] add results of more API-based models## Licenses
[![MIT license](https://img.shields.io/badge/License-MIT-blue.svg)](https://lbesson.mit-license.org/)
This work is licensed under a [MIT License](https://lbesson.mit-license.org/).
[![CC BY-NC-SA 4.0](https://img.shields.io/badge/License-CC%20BY--NC--SA%204.0-lightgrey.svg)](http://creativecommons.org/licenses/by-nc-sa/4.0/)
The Xiezhi dataset is licensed under a
[Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-nc-sa/4.0/).## Citation
Please cite our paper if you use our dataset.
```
@article{gu2023xiezhi,
title={Xiezhi: An Ever-Updating Benchmark for Holistic Domain Knowledge Evaluation},
author={Zhouhong, Gu and Xiaoxuan, Zhu and Haoning, Ye and Lin, Zhang and Jianchen, Wang and Sihang, Jiang and Zhuozhi, Xiong and Zihan, Li and Qianyu, He and Rui, Xu and Wenhao, Huang and Weiguo, Zheng and Hongwei, Feng and Yanghua, Xiao}
journal={arXiv:2304.11679},
year={2023}
}
```