https://github.com/opendatalab/rest
https://github.com/opendatalab/rest
Last synced: 5 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/opendatalab/rest
- Owner: opendatalab
- License: apache-2.0
- Created: 2025-07-09T07:42:20.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-07-15T07:02:58.000Z (6 months ago)
- Last Synced: 2025-07-15T09:59:22.379Z (6 months ago)
- Language: Python
- Size: 10.4 MB
- Stars: 14
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## 📝 Key Takeaways
📉 **Even SOTA models like DeepSeek-R1 exhibit substantial performance degradation under stress testing.**
> Tabel: Performance of DeepSeek-R1 under traditional single-question testing (single) and multi-question stress testing (stress).
| Mode | GSM8K | MATH500 | AMC23 | AIME24 | AIME25 | GPQA Diamond | LiveCodeBench(v5) |
| ----- |------| ---- |------| ------- | ----- | ----- | ----- |
| Single | 96.20 | 97.00 | 93.75 | 81.66 | 68.75 | 70.20 | 63.44 |
| Stress | 96.16 | 92.09 | 81.80 | 52.49 | 37.17 | 64.63 | 40.83 |
📊 **REST enhances the discriminative power of existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations.**
> Tabel: Performance of different LRMs on MATH500 under multi-question stress testing (stress).
| Mode | DS-R1-1.5B | L1-Qwen-1.5B-Max | DS-R1-7B | AReaL-boba-RL-7B | OpenR1-Qwen-7B | Nemotron-Nano-8B | DS-R1-32B | DeepSeek-R1 |
| ----- |------| ---- |------| ------- | ----- | ----- | ----- | ----- |
| Single | 83.40 | 83.40 | 93.00 | 95.00 | 92.20 | 94.40 | 94.60 | 97.00 |
| Stress | 42.47 | 73.23 | 66.75 | 60.77 | 81.64 | 86.04 | 88.97 | 92.09 |
💡 **"Overthinking" is a ritical factor contributing to the performance degradation and "Long2short" technique can help.**
> Figure: The effect of Long2Short training. Long2Short training mitigates the performance degradation under high stress levels (number of questions per input).
1.5B Models on MATH500
7B Models on MATH500
7B Models on AMC23
✅ **Stress testing capable LRMs employ concise reasoning for earlier questions.**
> Figure: The reasoning token count for questions at different positions on AIME24 under stress testing.
DS-R1-Distill-Qwen-7B
Nemotron-nano-7B
DeepSeek-R1
🚀 Quick Start
LEMMA mainly requires the following three packages. To install them, simply run "bash sh/install.sh":
- [OpenCompass](https://github.com/open-compass/opencompass)
- [LMDeploy](https://github.com/InternLM/lmdeploy)
- [math-verify](https://github.com/huggingface/Math-Verify)
After installation, run the following scripts to reproduce our evaluation results. To evaluate API-based models, please specify the "OPENAI_API_BASE" and "OPENAI_API_KEY" in these scripts.
```bash
bash sh/eval_math.sh
# Code data will be automatically downloaded from OpenCompass
bash sh/eval_code.sh
```
To evaluate gpqa, we use [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) to extract the answer for each question because LRMs othen fail to put each answer within "\boxed{}". We use SGLang to deploy gemma-3-27b-it, you can install it in another environment.
```bash
# Install sglang==0.4.4.post3 in another environment.
conda create -n sglang044 -y
conda activate sglang044
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]==0.4.4.post3"
```
Set "VERIFYER_MODEL_NAME", "VERIFYER_API_BASE", "VERIFYER_API_KEY" in "sh/eval_gpqa.sh" and run inference and evaluation separately.
```bash
bash sh/eval_gpqa.sh infer
bash sh/serve_gemma3.sh &
bash sh/eval_gpqa.sh eval
```
To evaluate your own model, you can set "MODEL_NAME" (a valid huggingface model name), "TP_SIZE" and "TEMPERATURE" in "eval_custom_model.sh".
```bash
bash sh/eval_huggingface_model.sh
```
Thanks for the open source code of [OpenCompass](https://github.com/open-compass/opencompass).
## Citation
Please cite the paper if you refer to our code, result or paper.
```bibtex
@misc{pan2025REST,
title={REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once},
author={Zhuoshi Pan and Qizhi Pei and Yu Li and Qiyao Sun and Zinan Tang and H. Vicky Zhao and Conghui He and Lijun Wu},
year={2025},
eprint={2507.10541},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.10541},
}
```
