An open API service indexing awesome lists of open source software.

https://github.com/opendatalab/charm

[ACL 2024 Main Conference] Chinese commonsense benchmark for LLMs
https://github.com/opendatalab/charm

Last synced: 3 months ago
JSON representation

[ACL 2024 Main Conference] Chinese commonsense benchmark for LLMs

Awesome Lists containing this project

README

          

# CHARM✨ Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations
[![arXiv](https://img.shields.io/badge/arXiv-2403.14112-b31b1b.svg)](https://arxiv.org/abs/2403.14112)
[![license](https://img.shields.io/github/license/InternLM/opencompass.svg)](./LICENSE)

πŸ“ƒ[Paper](https://arxiv.org/abs/2403.14112)
🏰[Project Page](https://opendatalab.github.io/CHARM/)
πŸ†[Leaderboard](https://opendatalab.github.io/CHARM/leaderboard.html)
✨[Findings](https://opendatalab.github.io/CHARM/findings.html)


πŸ“– δΈ­ζ–‡ | English

## Construction of CHARM


## Comparison of commonsense reasoning benchmarks




Benchmarks
CN-Lang
CSR
CN-specifics
Dual-Domain
Rea-Mem



Most benchmarks in davis2023benchmarks
✘
βœ”
✘
✘
✘


XNLI, XCOPA,XStoryCloze
βœ”
βœ”
✘
✘
✘


LogiQA, CLUE, CMMLU
βœ”
✘
βœ”
✘
✘


CORECODE
βœ”
βœ”
✘
✘
✘


CHARM (ours)
βœ”
βœ”
βœ”
βœ”
βœ”

"CN-Lang" indicates the benchmark is presented in Chinese language. "CSR" means the benchmark is designed to focus on CommonSense Reasoning. "CN-specific" indicates the benchmark includes elements that are unique to Chinese culture, language, regional characteristics, history, etc. "Dual-Domain" indicates the benchmark encompasses both Chinese-specific and global domain tasks, with questions presented in the similar style and format. "Rea-Mem" indicates the benchmark includes closely-interconnected reasoning and memorization tasks.

## πŸš€ What's New
- **[2024.7.26]** All inference and evaluation of CHARM are supported by [Opencompass](https://github.com/open-compass/opencompass).πŸ”₯πŸ”₯πŸ”₯
- **[2024.6.06]** Leaderboard updated! LLaMA-3, GPT-4o, Gemini-1.5, Yi1.5, Qwen1.5, etc. are evaluated.
- **[2024.5.24]** CHARM has been open-sourced !!! πŸ”₯πŸ”₯πŸ”₯
- **[2024.5.15]** CHARM has been accepted to the main conference of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) !!! πŸ”₯πŸ”₯πŸ”₯
- **[2024.3.21]** Paper available on [ArXiv](https://arxiv.org/abs/2403.14112).

## πŸ› οΈ Inference and Evaluation on Opencompass
Below are the steps for quickly downloading CHARM and using OpenCompass for evaluation.

### 1. OpenCompass Environment Setup
Refer to the installation steps for [OpenCompass](https://github.com/open-compass/OpenCompass/?tab=readme-ov-file#%EF%B8%8F-installation).

### 2. Download CHARM
```bash
git clone https://github.com/opendatalab/CHARM ${path_to_CHARM_repo}

cd ${path_to_opencompass}
mkdir data
ln -snf ${path_to_CHARM_repo}/data/CHARM ./data/CHARM
```
### 3. Run Inference and Evaluation
```bash
cd ${path_to_opencompass}

# modify config file `configs/eval_charm_rea.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_rea.py -r --dump-eval-details

# modify config file `configs/eval_charm_mem.py`: uncomment or add models you want to evaluate
python run.py configs/eval_charm_mem.py -r --dump-eval-details
```
The inference and evaluation results would be in `${path_to_opencompass}/outputs`, like this:
```bash
outputs
β”œβ”€β”€ CHARM_mem
β”‚ └── chat
β”‚ └── 20240605_151442
β”‚ β”œβ”€β”€ predictions
β”‚ β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind
β”‚ β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy
β”‚ β”‚ └── qwen1.5-1.8b-chat-hf
β”‚ β”œβ”€β”€ results
β”‚ β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind_judged-by--GPT-3.5-turbo-0125
β”‚ β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy_judged-by--GPT-3.5-turbo-0125
β”‚ β”‚ └── qwen1.5-1.8b-chat-hf_judged-by--GPT-3.5-turbo-0125
β”‚Β Β  └── summary
β”‚Β Β  └── 20240605_205020 # MEMORY_SUMMARY_DIR
β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Anachronisms_Judgment
β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Movie_and_Music_Recommendation
β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Sport_Understanding
β”‚Β Β  β”œβ”€β”€ judged-by--GPT-3.5-turbo-0125-charm-memory-Chinese_Time_Understanding
β”‚Β Β  └── judged-by--GPT-3.5-turbo-0125.csv # MEMORY_SUMMARY_CSV
└── CHARM_rea
└── chat
└── 20240605_152359
β”œβ”€β”€ predictions
β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind
β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy
β”‚ └── qwen1.5-1.8b-chat-hf
β”œβ”€β”€ results # REASON_RESULTS_DIR
β”‚ β”œβ”€β”€ internlm2-chat-1.8b-turbomind
β”‚ β”œβ”€β”€ llama-3-8b-instruct-lmdeploy
β”‚ └── qwen1.5-1.8b-chat-hf
└── summary
β”œβ”€β”€ summary_20240605_205328.csv # REASON_SUMMARY_CSV
└── summary_20240605_205328.txt
```
### 4. Generate Analysis Results
```bash
cd ${path_to_CHARM_repo}

# generate Table5, Table6, Table9 and Table10 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_reasoning.py ${REASON_SUMMARY_CSV}

# generate Figure3 and Figure9 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/summarize_mem_rea.py ${REASON_SUMMARY_CSV} ${MEMORY_SUMMARY_CSV}

# generate Table7, Table12, Table13 and Figure11 in https://arxiv.org/abs/2403.14112
PYTHONPATH=. python tools/analyze_mem_indep_rea.py data/CHARM ${REASON_RESULTS_DIR} ${MEMORY_SUMMARY_DIR} ${MEMORY_SUMMARY_CSV}
```

## πŸ–ŠοΈ Citation
```bibtex
@misc{sun2024benchmarking,
title={Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations},
author={Jiaxing Sun and Weiquan Huang and Jiang Wu and Chenya Gu and Wei Li and Songyang Zhang and Hang Yan and Conghui He},
year={2024},
eprint={2403.14112},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```

## πŸ’³ License

This project is released under the Apache 2.0 [license](./LICENSE).