https://github.com/tal-tech/chinese-k12-evaluation
https://github.com/tal-tech/chinese-k12-evaluation
Last synced: about 1 year ago
JSON representation
- Host: GitHub
- URL: https://github.com/tal-tech/chinese-k12-evaluation
- Owner: tal-tech
- Created: 2023-08-07T02:28:56.000Z (almost 3 years ago)
- Default Branch: main
- Last Pushed: 2024-03-21T10:42:34.000Z (about 2 years ago)
- Last Synced: 2025-04-12T10:06:05.936Z (about 1 year ago)
- Language: Python
- Size: 32.9 MB
- Stars: 25
- Watchers: 12
- Forks: 2
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
## CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation
## Overview
我们利用了k12教育领域知识图谱构建了一个LLM的评估基准。
该评测集包括500+个一级知识点和1900+个二级知识点,基本涵盖了K12教育领域的全面知识。
本次评估的主要目的是全面评估LLM在中文语境下的高级理解能力和推理能力。
我们的评估包括九个学科、五个不同的问题类型,共有41k个问题。
我们使用了四种不同的测试模式来测试当前主流的LLM(zero-shot、few-shot、 zero-shot-cot、 few-shot-cot)。并对数学的cot结果进行了推理步骤的评测。

## News
* **[2023.12.09]** CK12 has been accepted to AAAI 2024 🎉🎉🎉
## Leaderboard
我们的测试集在四种测试模式下,不同模型的表现
single,multi, filling, juding, sorting分别代表单选,多选,填空,判断,排序;single*代表选项被打乱,single**代表增加了10个干扰选项

### 数学推理子集步骤准确性评估

## data
https://drive.google.com/drive/folders/1pvvhI_y1o5olcxHGGWFqt3SDv3PRX0rd?usp=drive_link
## run testing
### 环境依赖:
torch 2.0.1
transformers 4.33.3
vllm 0.1.3
### huggingface inference
原始数据是 data/original_data
打乱选项的数据是 data/shuffle_options
增加干扰选项的数据是 data/adding_10_distractor_options
`bash code/release_code_aaai_ck12/inference_hf.sh data/original_data sava_name`
### vllm inference
`bash code/release_code_aaai_ck12/inference_vllm.sh data/original_data sava_name`
### get metric
`python code/release_code_aaai_ck12/metric.py`
## 例子
文科和理科的题目例子
### 政治

### 生物

## Citation
Please cite our paper if you use our dataset.
```
@article{you2023ck12,
title={CK12: A Rounded K12 Knowledge Graph Based Benchmark for Chinese Holistic Cognition Evaluation},
author={Weihao You, Pengcheng Wang, Changlong Li, Zhilong Ji and Jinfeng Bai}
year={2023}
}
```