https://github.com/THUDM/AlignBench

大模型多维度中文对齐评测基准 (ACL 2024)
https://github.com/THUDM/AlignBench

chatglm chatgpt large-language-models llm

Last synced: 3 months ago
JSON representation

大模型多维度中文对齐评测基准 (ACL 2024)

Host: GitHub
URL: https://github.com/THUDM/AlignBench
Owner: THUDM
Created: 2023-11-23T10:06:10.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2024-08-16T09:31:28.000Z (11 months ago)
Last Synced: 2025-04-05T15:07:55.273Z (3 months ago)
Topics: chatglm, chatgpt, large-language-models, llm
Language: Python
Homepage:
Size: 1.41 MB
Stars: 373
Watchers: 11
Forks: 27
Open Issues: 16
Metadata Files:
- Readme: README-en.md

Awesome Lists containing this project

StarryDivineSky - THUDM/AlignBench - as-Judge），并且结合思维链（Chain-of-Thought）生成对模型回复的多维度分析和最终的综合评分，增强了评测的高可靠性和可解释性。 (A01_文本生成_文本对话 / 大语言对话模型及数据)
awesome-llm-if - AlignBench
awesome-llm-if - AlignBench
awesome-llm-eval - AlignBench - as-Judge），并且结合思维链（Chain-of-Thought）生成对模型回复的多维度分析和最终的综合评分，增强了评测的高可靠性和可解释性 (2023-12-01)| (Datasets-or-Benchmark / 通用)

README

# AlignBench: Benchmarking Chinese Alignment of Large Language Models

阅读[中文](README.md)版

This repository contains information, data and code of AlignBench: a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese.

## 🔥 Updates

[2023.12.12] AlignBench [Website](https://llmbench.ai/align) is now officially online, welcome everyone to visit! You can use the *Submit* function on the website to perform evaluation with `CritiqueLLM` on AlignBench (results can be obtained in about 5 minutes).

## 📍 Introduction

Alignment has become the critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, effective evaluation of alignment for emerging Chinese LLMs has become a significant challenge, calling for diverse, open-ended, challenging and automatic evaluation tailored for alignment. To address this, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs’ alignment in Chinese. Equipped with a human-in-the-loop data curation pipeline, our benchmark employs a multi-dimensional rules-calibrated LLM-as-Judge with Chain-of-Thought to generate an explanation and a final rating, ensuring high reliability and interpretability. Furthermore, we developed a dedicated accompanied evaluator LLM---CritiqueLLM, which could recover 95% of GPT-4's evaluation ability and will be provided via accessible APIs to researchers for convenient evaluation of Chinese alignment.

![Overall](assets/Overall.png)

The overall framework of AlignBench is shown in the above image, including the data curation pipeline, the task taxonomy and the multi-dimensional rule-calibrated LLM-as-Judge evaluation method.

For a full description of AlignBench, please refer to the paper: [AlignBench](https://arxiv.org/abs/2311.18743)

For a full description of CritiqueLLM, please refer to the paper: [CritiqueLLM](https://arxiv.org/abs/2311.18702)

---

## 📦 Dataset

To perform a systematic evaluation, we framed a comprehensive taxonomy of the LLMs’ abilities based on the real-user instructions. We inspect and summarize user queries into 8 main categories, namely Fundamental Language Ability, Chinese Advanced Understanding, Open-ended Questions, Writing Ability, Logical Reasoning, Mathematics, Task-oriented Role Play. The taxonomy and distribution of AlignBench are as follows.

AlignBench contains 683 high-quality samples in total. Each sample in AlignBench contains a task-oriented query, a high-quality reference answer, and the corresponding category in our taxonomy. The data is placed in `data/data_release.jsonl` and each line contains a sample in `json` format.

The data format is as follows.

- `question_id` (integer): A unique identifier for the question.
- `category` (string): The primary category under which the question falls.
- `subcategory` (string): The secondary category for further classification.
- `question` (string): The actual user query.
- `reference` (string): This provides a reference or standard answer to the question.

Here is an example of `mathematics` category.

```json
{
"question_id": 1,
"category": "数学计算",
"subcategory": "初等数学",
"question": "有一串彩珠，按“2红3绿4黄”的顺序依次排列。第600颗是什么颜色?",
"reference": "一组\"2红3绿4黄\"共有9颗珠子。600除以9的商是66，余数是6。因此，第600颗珠子是在第67组的第6颗，即\"2红3绿4黄\"中的第6颗，也就是黄色。所以，第600颗珠子是黄色。"
}
```

---

## ⚙️ Evaluation Pipeline

In order to effectively evaluate the quality of responses, AlignBench currently employs GPT-4-0613 to analyze and subsequently grade the responses. During the evaluation process, the input is the user query, the model's response, and a high-quality reference answer, and the output is an multi-dimensional analytical explanation and a final rating, ranging from 1 to 10. In order to ensure reliability and interpretability, we implement the following methods. Here is an example.

![Case](assets/Case.png)

* **Point-wise Grading.** For each model answer, the evaluation methods will give a final rating ranging from 1 to 10.

* **Chain-of-Thought.** As the task of grading involves complex reasoning, we have adopted the Chain-of-Thought method to augment both the reliability and interpretability. Specifically, the evaluator LLM is instructed to generate explanations from multiple dimensions before providing a final rating.

+ **Rule-calibrated Referencing.** For each question, we provide a high-quality reference answer. To guide the evaluator to compare the answer with the reference and generate more controllable scores, we provided detailed grading rules elaborating the relationship between score intervals and the answer's quality compared to the reference. The rules are included in the prompt.

* **Multi-dimensional Analysis.** Because tasks vary in their nature and characteristics, applying the same evaluation criteria to all tasks would be unjust. As a solution, we suggest employing a multi-dimensional scoring approach to evaluate LLM's responses, tailoring the evaluation to the specific task at hand. Specifically, we set up different evaluation dimensions based on different types of questions and we instructed GPT-4 evaluator to analyze the model answer from specified dimensions and provide dimensional scores. The dimensions and their definitions are placed in `config`.

---

## 🚀 Evaluation

The whole evaluation process contains three steps: inference, LLM judgments and results display. The corresponding scripts are saved in `scripts`

1. **Step I** inference on target LLM and get the results

First, you need to deploy your target LLM.(This part is not included in this repository).

Second, implement your own API calling class in `inference/api_models`, the `do_nothing` class sevres as an example. (Note that the API class name should be the same as the file name)

Third, modify and run the following script to get the answers of the target LLM.

```bash
MODEL=do_nothing # TODO modify the model name(the same as your API calling class)

python get_answers.py \
--model do_nothing \
--workers 2 \
--question-file data/data_release.jsonl \
--save-dir data/model_answer
```

The answers will be saved in `data/model_answer` and ready for the LLM Judge process.

2. **Step II** get the GPT-4 judgments

First, fill in your GPT-4 API key in `config/multi-dimension.json`.

Then, modify and run the following script to get the judgments of the target LLM.

```bash
MODEL=do_nothing # TODO modify the model name(the same as your API calling class)

python judge.py \
--config-path config/multi-dimension.json \
--model-name $MODEL \
--parallel 2 \
```

The answers will be stored in `data/jugdment`

3. **Step III** results display

Run the following script to get the results of all the LLM judgments saved in `data/judgment`.

```bash
python show_result.py \
--input-dir data/judgment \
--ques-file data/data_release.jsonl \
--save-file data/results/results.xlsx
```

The calulated resultss will be stored in `data/results` in `xlsx` format.

---

## 📂 Leaderboard

We report our evaluation results on 17 Chinese-supported LLMs on AlignBench using `gpt-4-0613` and `CritiqueLLM`.

`gpt-4-0613` judged results:

model
Overall
Reasoning 中文推理
Language 中文语言

Avg.
Math.
Logi.
Avg.
Fund.
Chi.
Open.
Writ.
Role.
Pro.

模型
总分
推理
总分
数学
计算
逻辑
推理
语言
总分
基本
任务
中文
理解
综合
问答
文本
写作
角色
扮演
专业
能力

gpt-4-1106-preview
8.01
7.73
7.8
7.66
8.29
7.99
7.33
8.61
8.67
8.47
8.65

gpt-4-0613
7.53
7.47
7.56
7.37
7.59
7.81
6.93
7.42
7.93
7.51
7.94

chatglm-turbo（智谱清言）
6.24
5
4.74
5.26
7.49
6.82
7.17
8.16
7.77
7.76
7.24

erniebot-3.0（文心一言）
6.14
5.15
5.03
5.27
7.13
6.62
7.6
7.26
7.56
6.83
6.9

gpt-3.5-turbo-0613
6.08
5.35
5.68
5.02
6.82
6.71
5.81
7.29
7.03
7.28
6.77

chatglm-pro（智谱清言）
5.83
4.65
4.54
4.75
7.01
6.51
6.76
7.47
7.07
7.34
6.89

spark_desk_v2（讯飞星火）
5.74
4.73
4.71
4.74
6.76
5.84
6.97
7.29
7.18
6.92
6.34

qwen-14b-chat
5.72
4.81
4.91
4.71
6.63
6.9
6.36
6.74
6.64
6.59
6.56

baichuan2-13b-chat
5.25
3.92
3.76
4.07
6.59
6.22
6.05
7.11
6.97
6.75
6.43

chatglm3-6b
4.97
3.85
3.55
4.14
6.1
5.75
5.29
6.71
6.83
6.28
5.73

baichuan2-7b-chat
4.97
3.66
3.56
3.75
6.28
5.81
5.5
7.13
6.84
6.53
5.84

internlm-20b
4.96
3.66
3.39
3.92
6.26
5.96
5.5
7.18
6.19
6.49
6.22

qwen-7b-chat
4.91
3.73
3.62
3.83
6.09
6.4
5.74
6.26
6.31
6.19
5.66

chatglm2-6b
4.48
3.39
3.16
3.61
5.58
4.91
4.52
6.66
6.25
6.08
5.08

internlm-chat-7b
3.65
2.56
2.45
2.66
4.75
4.34
4.09
5.82
4.89
5.32
4.06

Chinese-llama-2-7b-chat
3.57
2.68
2.29
3.07
4.46
4.31
4.26
4.5
4.63
4.91
4.13

llama-2-13b-Chinese-chat
3.35
2.47
2.21
2.73
4.23
4.13
3.31
4.79
3.93
4.53
4.71

`CritiqueLLM` judged results:

model
Overall
Reasoning 中文推理
Language 中文语言

Avg.
Math.
Logi.
Avg.
Fund.
Chi.
Open.
Writ.
Role.
Pro.

模型
总分
推理
总分
数学
计算
逻辑
推理
语言
总分
基本
任务
中文
理解
综合
问答
文本
写作
角色
扮演
专业
能力

gpt-4-1106-preview
7.58
7.11
7.39
6.83
8.05
7.69
7.07
8.66
8.23
8.08
8.55

gpt-4-0613
6.83
6.41
6.49
6.33
7.26
7.16
6.76
7.26
7.31
7.48
7.56

chatglm-turbo（智谱清言）
6.36
4.99
4.88
5.09
7.73
7.5
7.03
8.45
8.05
7.67
7.7

erniebot-3.0（文心一言）
5.91
4.75
4.34
5.15
7.07
6.46
7.21
7.29
7.73
7.03
6.72

chatglm-pro（智谱清言）
5.73
4.49
4.55
4.43
6.96
6.47
6.81
7.26
7.25
7.29
6.7

gpt-3.5-turbo-0613
5.68
4.85
4.90
4.79
6.52
6.01
5.6
6.97
7.27
6.98
6.29

spark_desk_v2（讯飞星火）
5.51
4.58
4.53
4.62
6.44
5.76
6.29
6.37
7.25
7.03
5.96

qwen-14b-chat
5.41
4.52
4.54
4.50
6.31
6.46
5.84
6.71
6.47
6.38
5.98

baichuan2-13b-chat
5.26
3.96
3.83
4.08
6.56
5.74
6.19
7.03
7.21
6.72
6.49

baichuan2-7b-chat
5.05
3.68
3.23
4.13
6.42
5.72
5.71
7.08
7.41
6.86
5.73

chatglm3-6b
5.01
3.70
3.44
3.95
6.33
6.13
5.72
6.92
7.11
6.31
5.77

internlm-20b
4.97
3.67
3.46
3.87
6.27
5.65
5.52
6.71
6.77
6.35
6.61

qwen-7b-chat
4.74
3.66
3.51
3.80
5.83
6.01
5.52
5.89
6.28
6.16
5.12

chatglm2-6b
4.57
3.32
3.28
3.35
5.83
5.24
5.12
6.68
6.83
5.95
5.15

Chinese-llama-2-7b-chat
3.44
2.42
2.13
2.70
4.46
4.59
4.29
4.39
4.64
4.91
3.94

internlm-chat-7b
3.24
2.10
2.34
1.85
4.39
3.43
3.76
5.37
4.63
5.01
4.15

llama-2-13b-Chinese-chat
3.14
2.35
2.12
2.58
3.93
4.31
2.9
4.34
3.52
4.04
4.47

## 👏 Citation

```
@misc{liu2023alignbench,
title={AlignBench: Benchmarking Chinese Alignment of Large Language Models},
author={Xiao Liu and Xuanyu Lei and Shengyuan Wang and Yue Huang and Zhuoer Feng and Bosi Wen and Jiale Cheng and Pei Ke and Yifan Xu and Weng Lam Tam and Xiaohan Zhang and Lichao Sun and Hongning Wang and Jing Zhang and Minlie Huang and Yuxiao Dong and Jie Tang},
year={2023},
eprint={2311.18743},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/THUDM/AlignBench

Awesome Lists containing this project

README