Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

https://github.com/OpenMOSS/HalluQA

Dataset and evaluation script for "Evaluating Hallucinations in Chinese Large Language Models"
https://github.com/OpenMOSS/HalluQA

Last synced: 12 days ago
JSON representation

Dataset and evaluation script for "Evaluating Hallucinations in Chinese Large Language Models"

Host: GitHub
URL: https://github.com/OpenMOSS/HalluQA
Owner: OpenMOSS
Created: 2023-10-04T03:01:40.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-02-28T12:02:29.000Z (4 months ago)
Last Synced: 2024-05-22T08:07:49.906Z (about 1 month ago)
Language: Python
Homepage: https://arxiv.org/pdf/2310.03368.pdf
Size: 6.05 MB
Stars: 101
Watchers: 5
Forks: 3
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Lists

awesome-llm-eval - HalluQA - hard部分69条，knowledge部分206条，每个问题平均有2.8个正确答案和错误答案标注。为了提高HalluQA的可用性，作者设计了一个使用GPT-4担任评估者的评测方法。具体来说，把幻觉的标准以及作为参考的正确答案以指令的形式输入给GPT-4，让GPT-4判断模型的回复有没有出现幻觉（2023-11-08） | (Datasets-or-Benchmark / 通用)
StarryDivineSky - OpenMOSS/HalluQA - 130B 生成答案并收集对抗性问题。第3步，为每个对抗性问题编写多个正确和错误的答案，并添加支持证据。第4步，检查所有带注释的问答对并删除低质样本。 (文本生成、文本对话 / 类ChatGPT大语言对话模型及数据)

README

        # Evaluating Hallucinations in Chinese Large Language Models

This repository contains data and evaluation scripts of HalluQA (Chinese Hallucination Question-Answering) benchmark.

The full data of HalluQA is in **HalluQA.json**.

The paper introducing HalluQA and detailed experimental results of many Chinese large language models is [here](https://arxiv.org/pdf/2310.03368.pdf).

## Update

**2024.2.28**: We add the multiple-choice task for HalluQA. 

The test data for multiple-choice task is in HalluQA_mc.json.

The multiple-choice QA prompt is in prompts/Chinese_QA_prompt_mc.txt .

## Data Collection Pipeline

![](imgs/pipeline.png)

HalluQA contains 450 meticulously designed adversarial questions, spanning multiple domains, and takes into account Chinese historical culture, customs, and social phenomena. The pipeline of data collection is shown above. At step 1, we write questions which we think may induce model hallucinations. At step 2, we use ChatGPT3.5/Puyu/GLM-130B to generate answers and collect adversarial questions. At step 3, we write multiple correct and wrong answers for each adversarial question and add support evidence. At step 4, we check all annotated question-answer pairs and remove low quality samples.

## Data Examples

We show some data examples of HalluQA here.

![](imgs/examples.png)

## Metric & Evaluation Method

We use non-hallucination rate as the metric of HalluQA, which represents the percentage of answers that do not exhibit hallucinations out of all model generated answers.  

For automated evaluation, we use GPT-4 as the evaluator. GPT-4 will judge whether a generated answer exhibit hallucinations based on the given criterias and reference correct answers.  

The prompt for GPT-4 based evaluation is in **calculate_metrics.py**

### Run evaluation for your models

1. Install requirements

```

pip install openai

```

2. Run evaluation using our script.

```python

python calculate_metrics.py --response_file_name gpt-4-0613_responses.json("replace with your own responses") --api_key "your openai api key" --organization "organization of your openai account"

```

3. The results and metric will be saved in results.json and non_hallucination_rate.txt respectively.

### Multiple-choice task

We also provide a multiple-choice task for HalluQA. 

You need to first generate answers for each question using the model to be tested, using our [multiple-choice prompt](./prompts/Chinese_QA_prompt_mc.txt), and then calculate the accuracy of the multiple-choice task using the following script.

```python

python calculate_metrics_mc.py --response_file_name 

```

## Results 
### Leaderboard 
**Non-hallucination 
| **Model** 
|--------------- 
| ***Retrieval-Augmented 
| ERNIE-Bot 
| Baichuan2-53B 
| ChatGLM-Pro 
| SparkDesk 
| ***Chat Model*** 
| abab5.5-chat 
| gpt-4-0613 
| Qwen-14B-chat 
| Baichuan2-13B-chat 
| Baichuan2-7B-chat 
| gpt-3.5-turbo-0613 
| Xverse-13B-chat 
| Xverse-7B-chat 
| ChatGLM2-6B 
| Qwen-7B-chat 
| Baichuan-13B-chat 
| ChatGLM-6b 
| ***Pre-Trained Model*** 
| Qwen-14B 
| Baichuan2-13B-base 
| Qwen-7B 
| Xverse-13B 
| Baichuan-13B-base 
| Baichuan2-7B-base 
| Baichuan-7B-base 
| Xverse-7B

rate of each model for different types of questions**: | **Misleading** | **Misleading-hard** | **Knowledge** | **Total** | -----------|----------------|--------------------|---------------|-----------| Chat Model*** | | | | | | 70.86          | 46.38              | 75.73        | 69.33    | | 59.43          | 43.48              | 83.98        | 68.22    | | 64.00          | 34.78              | 67.96        | 61.33    | | 59.43          | 27.54              | 71.36        | 60.00    | | | | | | | 60.57          | 39.13              | 57.77        | 56.00    | | 76.00          | 57.97              | 32.04        | 53.11    | | 75.43          | 23.19              | 30.58        | 46.89    | | 61.71          | 24.64              | 32.04        | 42.44    | | 54.86          | 28.99              | 32.52        | 40.67    | | 66.29          | 30.43              | 19.42        | 39.33    | | 65.14          | 23.19              | 22.33        | 39.11    | | 64.00          | 13.04              | 21.84        | 36.89    | | 55.43          | 23.19              | 21.36        | 34.89    | | 55.43          | 14.49              | 17.48        | 31.78    | | 49.71          | 8.70               | 23.30        | 31.33    | | 52.57          | 20.29              | 15.05        | 30.44    | | | | | | | 54.86          | 23.19              | 24.76        | 36.22    | | 23.43          | 24.64              | 45.63        | 33.78    | | 48.57          | 20.29              | 16.99        | 29.78    | | 18.86          | 24.64              | 32.52        | 27.33    | | 9.71           | 18.84              | 40.78        | 25.33    | | 8.00           | 21.74              | 41.26        | 25.33    | | 6.86           | 15.94              | 37.38        | 22.22    | | 12.00          | 13.04              | 29.61        | 20.22    |

### Detailed results

Each model's generated answers and the corresponding judgement of GPT-4 are in **Chinese_LLMs_outputs/**.

### Multiple-choice task results

Here we report accuracy of the multiple-choice task for seven representative models.

![](./imgs/mc_acc.png)

## Acknowledgements

- We sincerely thank annotators and staffs from Shanghai AI Lab who involved in this work.

- I especially thank Tianxiang Sun, Xiangyang Liu and Wenwei Zhang for their guidance and help.

- I am also grateful to Xinyang Pu for her help and patience.

## Citation

```bibtex

@article{DBLP:journals/corr/abs-2310-03368,

  author       = {Qinyuan Cheng and

                  Tianxiang Sun and

                  Wenwei Zhang and

                  Siyin Wang and

                  Xiangyang Liu and

                  Mozhi Zhang and

                  Junliang He and

                  Mianqiu Huang and

                  Zhangyue Yin and

                  Kai Chen and

                  Xipeng Qiu},

  title        = {Evaluating Hallucinations in Chinese Large Language Models},

  journal      = {CoRR},

  volume       = {abs/2310.03368},

  year         = {2023},

  url          = {https://doi.org/10.48550/arXiv.2310.03368},

  doi          = {10.48550/arXiv.2310.03368},

  eprinttype    = {arXiv},

  eprint       = {2310.03368},

  timestamp    = {Thu, 19 Oct 2023 13:12:52 +0200},

  biburl       = {https://dblp.org/rec/journals/corr/abs-2310-03368.bib},

  bibsource    = {dblp computer science bibliography, https://dblp.org}

}

```