https://github.com/sparkle-reasoning/sparkle
[NeurIPS'25] Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
https://github.com/sparkle-reasoning/sparkle
data-efficient grpo interpretability large-language-models machine-learning mathematical-reasoning qwen reasoning-language-models reinforcement-learning rlhf scaling
Last synced: 7 days ago
JSON representation
[NeurIPS'25] Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
- Host: GitHub
- URL: https://github.com/sparkle-reasoning/sparkle
- Owner: sparkle-reasoning
- License: mit
- Created: 2025-06-13T06:25:27.000Z (11 months ago)
- Default Branch: master
- Last Pushed: 2025-12-11T05:26:53.000Z (5 months ago)
- Last Synced: 2025-12-12T00:15:40.832Z (5 months ago)
- Topics: data-efficient, grpo, interpretability, large-language-models, machine-learning, mathematical-reasoning, qwen, reasoning-language-models, reinforcement-learning, rlhf, scaling
- Language: Python
- Homepage: https://arxiv.org/abs/2506.04723
- Size: 5.24 MB
- Stars: 14
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
#
SPARKLE: Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning
[](https://arxiv.org/abs/2506.04723)
[](https://sparkle-reasoning.github.io/)
[](https://huggingface.co/sparkle-reasoning/models)
[](https://huggingface.co/sparkle-reasoning/datasets)
[](LICENSE)

**SPARKLE** is a fine-grained framework for evaluating LLM reasoning improvements under reinforcement learning (RL), analyzing models along three key axes: **plan-following and execution**, **knowledge utilization**, and **subproblem decomposition**. We also study difficulty, and our work reveals that hard problems remain valuable for RL training when appropriately structured with partial solution steps.
## π₯ Key Insights
### π‘ **Hard Problems Are Still Valuable**
Contrary to common belief, hard problems can be effective for RL training when **augmented with partial solution steps**. Our curriculum-style approach shows that continuing training on the hardest problemsβaugmented with partial solutionsβleads to the best performance.
### π‘ **RL Enhances Internal Strategy Formation**
RL-tuned models don't just execute external plans betterβthey **formulate and follow internal strategies** better suited to their reasoning processes. Providing explicit step-by-step plans surprisingly degrades performance on challenging benchmarks, but RL models show greater robustness.
### π‘ **Better Knowledge Integration**
RL significantly enhances the model's capacity to **integrate provided knowledge** into its reasoning process, leading to consistent performance improvements across diverse mathematical tasks and difficulty levels.
## π Results
| Model | AIME | AMC | MATH500 | GSM8K | OlympiadBench | Avg. |
| --------------------------- | ------------------ | ------------------ | ------------------ | ------------------ | ------------------ | --------- |
| Qwen-2.5-Math-7B-Base | 16.67 | 42.50 | 44.03 | 42.53 | 28.65 | 35.23 |
| SparkleRL-Stage 1 | 46.67 (β30.00) | 67.50 (β25.00) | 80.00 (β35.97) | 91.77 (β49.24) | 39.11 (β10.46) | 65.01 |
| **SparkleRL-Stage 2 (Aug)** | **50.42** (β33.75) | **71.25** (β28.75) | **81.00** (β36.97) | 92.38 (β49.85) | **40.11** (β11.46) | **67.03** |
*Table: Avg@8 performance across benchmarks. Stage 2 (Aug) uses our curriculum-style training with augmented hard problems.*
## π Quick Start
### Installation
```bash
# Create and activate conda environment
conda create -n sparkle python==3.12
conda activate sparkle
# Install PyTorch and Flash Attention
pip3 install torch==2.4.0
pip install psutil numpy
pip3 install flash-attn --no-build-isolation
# Install VERL and dependencies
cd verl
pip3 install -e .
pip install wandb IPython matplotlib
pip install vertexai latex2sympy2
pip3 install -U antlr4-python3-runtime==4.9.3
```
### Prepare Datasets
```bash
# Generate parquet files in data/*.parquet
python scripts/data/prepare_stage_one_data.py
python scripts/data/prepare_stage_two_data_aug.py --aug_version all # Recommended based on our ablation studies
```
### Training
```bash
# Set XFormers backend to avoid CUDA errors
export VLLM_ATTENTION_BACKEND=XFORMERS
# Stage 1: Foundation RL training on full dataset
export PATH_TO_BASE_MODEL="Qwen/Qwen2.5-Math-7B"
./scripts/train/stage_one.sh --model $PATH_TO_BASE_MODEL
# Stage 2: Curriculum-style training with augmented hard problems (recommended)
export PATH_TO_STAGE_ONE_MODEL="/path/to/your/stage1/checkpoint"
./scripts/train/stage_two_aug.sh --model $PATH_TO_STAGE_ONE_MODEL
```
> **Note**: Stage 2 training uses the `spk_h_aug` reward type which handles augmented responses with partial format. This is crucial for the curriculum-style training approach.
### Evaluation
```bash
# Step 1: Convert FSDP checkpoint to HuggingFace format (if using your own checkpoints)
python eval/fsdp2hf.py \
--fsdp_path /path/to/checkpoint/actor \
--base_model Qwen/Qwen2.5-Math-7B \
--output_path /path/to/output
# Step 2: Set up evaluation environment
cd eval/lm-evaluation-harness
pip install -e .
# Step 3: Run comprehensive evaluation across all benchmarks
export PATH_TO_STAGE_ONE_MODEL="/path/to/stage1/model"
export PATH_TO_STAGE_TWO_MODEL="/path/to/stage2/model"
./scripts/eval/eval_all_vllm.sh
```
> **Tip**: You can also directly use our pre-trained checkpoints from HuggingFace instead of converting your own FSDP checkpoints.
## π€ Model Checkpoints
We release our checkpoints on [HuggingFace](https://huggingface.co/sparkle-reasoning/models):
- [`sparkle-reasoning/SparkleRL-7B-Stage1`](https://huggingface.co/sparkle-reasoning/SparkleRL-7B-Stage1) - Foundation RL-tuned model trained with the large-scale full dataset
- [`sparkle-reasoning/SparkleRL-7B-Stage2-aug`](https://huggingface.co/sparkle-reasoning/SparkleRL-7B-Stage2-aug) - **Recommended**: Curriculum-style training with a small amount of augmented hard problems
- [`sparkle-reasoning/SparkleRL-7B-Stage2-hard`](https://huggingface.co/sparkle-reasoning/SparkleRL-7B-Stage2-hard) - Training on hard problems only
- [`sparkle-reasoning/SparkleRL-7B-Stage2-mix`](https://huggingface.co/sparkle-reasoning/SparkleRL-7B-Stage2-mix) - Mixed difficulty training
## π Datasets
Our curated datasets are available on [HuggingFace](https://huggingface.co/sparkle-reasoning/datasets):
### Training Data
- [`sparkle-reasoning/dsr40k`](https://huggingface.co/datasets/sparkle-reasoning/dsr40k) - Large-scale training data (40.3k problems) used for stage one foundation training
- [`sparkle-reasoning/hardmath`](https://huggingface.co/datasets/sparkle-reasoning/hardmath) - Challenging mathematical problems (6.5k problems) used for stage two curriculum training, specifically questions that the stage one model cannot answer, with rigorous data label cleaning
### Evaluation Benchmarks
- AIME 2024, AMC 2023, MATH500, GSM8K, OlympiadBench - Standard mathematical reasoning evaluation sets
## ποΈ Framework Overview
The SPARKLE framework evaluates mathematical reasoning along three dimensions:
1. **Plan-Following and Execution**: How well models follow and execute reasoning plans
2. **Knowledge Utilization**: Ability to integrate external knowledge into reasoning
3. **Subproblem Decomposition**: Capacity to solve decomposed subproblems
### Curriculum-Style Training
Our key innovation is a two-stage curriculum approach:
1. **Stage 1**: Train on the full dataset to build a strong foundation
2. **Stage 2**: Continue training on the hardest problems augmented with partial solution steps
#### Example: Augmented Hard Problem
**π΅ Original Problem:**
> One of Euler's conjectures was disproved in the 1960s by three American mathematicians when they showed there was a positive integer such that: 133β΅ + 110β΅ + 84β΅ + 27β΅ = nβ΅. Find the value of n.
**π― Augmented with Partial Solution:**
> One of Euler's conjectures was disproved in the 1960s by three American mathematicians when they showed there was a positive integer such that: 133β΅ + 110β΅ + 84β΅ + 27β΅ = nβ΅. Find the value of n.
>
> Taking the given equation modulo 2, 3, and 5, respectively, we have:
> nβ΅ β‘ 0 (mod 2), nβ΅ β‘ 0 (mod 3), nβ΅ β‘ 4 (mod 5)
## π§ TODOs
- [ ] Release test sets - ETA by July 13, 2025
- [ ] Provide additional evaluation scripts for fine-grained analysis
## π Issues & Support
If you encounter any problems, have questions, or would like to contribute to the project, please feel free to:
- **Open an issue** on our GitHub repository
- **Contact us directly** at [milawang@cs.wisc.edu](mailto:milawang@cs.wisc.edu)
We welcome contributions, bug reports, and feature requests from the community!
## π Citation
If you find this work useful, please consider citing:
```bibtex
@misc{wang2025accuracydissectingmathematicalreasoning,
title={Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning},
author={Jiayu Wang and Yifei Ming and Zixuan Ke and Caiming Xiong and Shafiq Joty and Aws Albarghouthi and Frederic Sala},
year={2025},
eprint={2506.04723},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.04723},
}
```
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Links
- π **Paper**: [Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning](https://arxiv.org/abs/2506.04723)
- π **Project Page**: [https://sparkle-reasoning.github.io/](https://sparkle-reasoning.github.io/)
- π€ **Models**: [https://huggingface.co/sparkle-reasoning/models](https://huggingface.co/sparkle-reasoning/models)
- π€ **Datasets**: [https://huggingface.co/sparkle-reasoning/datasets](https://huggingface.co/sparkle-reasoning/datasets)