An open API service indexing awesome lists of open source software.

https://github.com/qwenlm/polymath

Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"
https://github.com/qwenlm/polymath

large-language-models mathematical-reasoning multilingual qwen3

Last synced: 4 months ago
JSON representation

Evaluation Code Repo for Paper "PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"

Awesome Lists containing this project

README

          


logo
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts



arXiv Badge


Hugging Face Badge


Leaderboard Badge


Apache-2.0 License Badge

This is the official repository for the paper **"PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts"**.

## ๐Ÿ“– Introduction

**PolyMath** is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels, with 9,000 high-quality problem samples. Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation, making it a highly discriminative multilingual mathematical benchmark in the era of reasoning LLMs.

## โœจ Features

- ๐Ÿ“ˆ **Broad Difficulty Range:** PolyMath defines and partitions **mathematical difficulty across four levels** using two core dimensions: *Thought Depth* and *Knowledge Breadth*, ranging from K-12 to Olympiad and advanced frontier mathematics, with **125 problems per language at each level**.


logo

- ๐ŸŒ **Language Diversity:** Each problem in PolyMath is available in **18 parallel language versions**, encompassing over 75% of the worldโ€™s native speakers and major language families, ensuring diversity across both high-resource and low-resource languages.


logo

- ๐Ÿง‘โ€๐Ÿซ **High-Quality Annotation:** Each problem translation is **calibrated by language experts**, avoiding direct use of LLM-generated outputs and ensuring precise term and logical clarity.


logo

## ๐Ÿ› ๏ธ Data Usage

The PolyMath dataset is publicly available and can be accessed in [![Hugging Face](https://img.shields.io/badge/Dataset-HuggingFace-yellow?logo=huggingface)](https://huggingface.co/datasets/Qwen/PolyMath), with the following format:

```
PolyMath/
โ”œโ”€โ”€ ar/
โ”‚ โ”œโ”€โ”€ low.parquet
โ”‚ โ”œโ”€โ”€ medium.parquet
โ”‚ โ”œโ”€โ”€ high.parquet
| โ””โ”€โ”€ top.parquet
โ”œโ”€โ”€ bn/
โ”œโ”€โ”€ ...
โ””โ”€โ”€ zh/
```

* Additionally, all prompts used in the inference process are provided in `instruction.py`.

## ๐Ÿงช Evaluation

### Environment Preparation

```
conda create -n polymath python=3.10
conda activate polymath
pip install -r requirements.txt
```

### Output Process

Given that varying inference engines may generate outputs in different formats, we request that you standardize your results into the specified format:

```
mkdir output
cd output
```

1. Take `/{model_name}` as the primary directory tier, and `/{difficulty_level}` as the secondary tier.

2. For each language, generate a `{lang_name}.jsonl` file within `/{difficulty_level}`, ensuring it includes 125 output samples. Each sample should adhere to the following format:

```json
{"idx: 0, ...}
...
{
"idx": 114, ### unique sample id
"question": "ๅ‡่ฎพๅœจๅนณ้ขไธŠ็š„ไธ€ไธช็ดง้›† $C$ ๆปก่ถณไปฅไธ‹ๆกไปถ๏ผšๅฏนๆฏไธ€ไธชๆ–นๅ‘๏ผŒ้ƒฝๅญ˜ๅœจไธ€ๆก่ฏฅๆ–นๅ‘ไธŠ็š„็›ด็บฟ $l$๏ผŒไฝฟๅพ— $l \\cap C$ ็š„็ปดๆ•ฐ่‡ณๅฐ‘ไธบ $\\frac{1}{2}$ใ€‚้‚ฃไนˆ๏ผŒ$C$ ็š„ๆœ€ๅฐๅฏ่ƒฝ็ปดๆ•ฐๆ˜ฏๅคšๅฐ‘๏ผŸ", ### question in corresponding language version
"answer": "$\\frac{5}{4}$", ### ground truth
"thinking_pred": "ๅ—ฏ๏ผŒ่ฟ™ไธช้—ฎ้ข˜็œ‹่ตทๆฅๆœ‰็‚นๆŒ‘ๆˆ˜ๆ€ง๏ผŒไธ่ฟ‡่ฎฉๆˆ‘ๆ…ขๆ…ขๆƒณๆƒณใ€‚้ข˜็›ฎๆ˜ฏ่ฏด๏ผŒๅœจๅนณ้ขไธŠๆœ‰ไธ€ไธช็ดง้›†C...", ### Note: Model's thinking content. Note: If it is a non-reasoning model, leave this field blank.
"answer_pred": "้ข˜็›ฎ่ฆๆฑ‚ๅœจๅนณ้ขไธŠ็š„ไธ€ไธช็ดง้›† \\( C \\)๏ผŒๆปก่ถณๅฏนไบŽๆฏไธ€ไธชๆ–นๅ‘๏ผŒ...", ### Note: Model's answer content.
}
...
{"idx: 124, ...}
```

The complete file structure is as follows:

```shell
PolyMath/output
โ”œโ”€โ”€ qwq-32b
โ”‚ โ”œโ”€โ”€ low
โ”‚ โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ”‚ โ””โ”€โ”€ ...
โ”‚ โ”œโ”€โ”€ medium
โ”‚ โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ”‚ โ””โ”€โ”€ ...
โ”‚ โ”œโ”€โ”€ high
โ”‚ โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ”‚ โ””โ”€โ”€ ...
โ”‚ โ””โ”€โ”€ top
โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ””โ”€โ”€ ...
โ”œโ”€โ”€ deepseek-v3
โ”‚ โ”œโ”€โ”€ low
โ”‚ โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ”‚ โ””โ”€โ”€ ...
โ”‚ โ”œโ”€โ”€ medium
โ”‚ โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ”‚ โ””โ”€โ”€ ...
โ”‚ โ”œโ”€โ”€ high
โ”‚ โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ”‚ โ””โ”€โ”€ ...
โ”‚ โ””โ”€โ”€ top
โ”‚ โ”œโ”€โ”€ ar.jsonl
โ”‚ โ”œโ”€โ”€ bn.jsonl
โ”‚ โ””โ”€โ”€ ...
โ””โ”€โ”€ ... (other models)
```

### Score Computation

The `/eval/run_eval.py` provides evaluation code for **accuracy** and **language consistency**. Please run `run_eval.sh` to iterate through your processed output files.

```
cd ../eval
bash run_eval.sh
```

`run_eval.sh`

```shell
model_list=(qwq-32b deepseek-v3)
language_list=(en zh ar bn de es fr id it ja ko ms pt ru sw te th vi)
level_list=(low medium high top)

for i in ${model_list[*]}; do
for j in ${language_list[*]}; do
for k in ${level_list[*]}; do
python run_eval.py --model $i --language $j --level $k
done
done
done
```

You can customize `model_list`, `language_list`, and `level_list`. When it is detected that the evaluations for all levels of a particular model in a specific language are completed, the computation of the benchmark score will be triggered.

**During evaluation, a score file will be automatically generated at `/eval/output/{model_name}/score.json`, and all scores will be saved.**

## ๐Ÿ“„ Citation

If you use **PolyMath** in your research or find our work useful, please cite us:

```bibtex
@article{wang2025polymath,
title={PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts},
author={Yiming Wang and Pei Zhang and Jialong Tang and Haoran Wei and Baosong Yang and Rui Wang and Chenshu Sun and Feitong Sun and Jiran Zhang and Junxuan Wu and Qiqian Cang and Yichang Zhang and Fei Huang and Junyang Lin and Fei Huang and Jingren Zhou},
journal={arXiv preprint arXiv:2504.18428},
year={2025},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.18428},
}
```