https://github.com/hkust-nlp/laser

Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
https://github.com/hkust-nlp/laser
Last synced: 21 days ago
JSON representation
Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping
Host: GitHub
URL: https://github.com/hkust-nlp/laser
Owner: hkust-nlp
Created: 2025-05-20T03:41:18.000Z (about 1 month ago)
Default Branch: main
Last Pushed: 2025-05-21T18:35:28.000Z (about 1 month ago)
Last Synced: 2025-05-21T19:46:39.559Z (about 1 month ago)
Language: Python
Size: 12.2 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        # Laser



  :hugs: HF Repo   

  :page_with_curl: Paper

  :bird: Twitter   



This repo contains the resources (**Code**, **Data**, **Models**) for the paper "Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping"

Laser (**L**ength-b**A**sed **S**t**E**p **R**eward shaping) and its adaptive versions Laser-D, Laser-DE ( **Dynamic** and **D**ifficulty-aware **L**ength-b**A**sed **S**t**E**p **R**eward shaping) are three novel methods to successfully improve both the effectiveness and efficiency of reasoning. Laser-D and Laser-DE achieve a **6.1** improvement on AIME2024 while reducing token usage by **63\%**.



  

 

## Table of Contents

- [Laser](#laser)

  - [Table of Contents](#table-of-contents)

  - [News](#news)

  - [Introduction](#introduction)

    - [Unified Framework for Length-based Reward Shaping](#unified-framework-for-length-based-reward-shaping)

  - [Performance](#performance)

    - [Efficacy-Efficiency Trade-off](#efficacy-efficiency-trade-off)

  - [:rocket: Resources](#rocket-resources)

    - [Datasets](#datasets)

    - [Models](#models)

      - [1.5B Models (Based on DeepSeek-R1-Distill-Qwen-1.5B)](#15b-models-based-on-deepseek-r1-distill-qwen-15b)

        - [Laser Models](#laser-models)

        - [Laser-D Models](#laser-d-models)

        - [Laser-DE Models](#laser-de-models)

      - [7B Models (Based on DeepSeek-R1-Distill-Qwen-7B)](#7b-models-based-on-deepseek-r1-distill-qwen-7b)

  - [How to Start :running:?](#how-to-start-running)

    - [Installation](#installation)

    - [Training](#training)

    - [Evaluation](#evaluation)

  - [Citation](#citation)

## News

- :fire: [05/2025] We are excited to release the resources for the paper "Laser: Learn to Reason Efficiently with Adaptive Length-based Reward Shaping"

## Introduction

In Laser, we propose a **unified view** for length-based reward shaping, unifying various reward-shaping and truncation methods. Building on this view, we propose a novel **L**ength-b**A**sed **S**t**E**p **R**eward shaping method (**Laser**), which employs a step reward function based on target length. We further propose the adaptive version of Laser, **Laser-D** and **Laser-DE**, based on two key intuitions: 

1. The reasoning behavior of the model evolves dynamically during training, necessitating reward specifications that are also adaptive and dynamic; 

2. Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. 

This approach facilitates a combination of fast and slow thinking, leading to a better overall tradeoff. Unlike methods that improve token efficiency at the expense of accuracy, our proposed approaches deliver substantial gains in both dimensions—even on the challenging AIME2024 benchmark.

### Unified Framework for Length-based Reward Shaping

We propose a unified framework for length-based reward shaping, unifying various reward-shaping and truncation methods. More details can be found in our [paper](), section 4.



  



## Performance

### Efficacy-Efficiency Trade-off

Efficacy (accuracy) and efficiency (token efficiency) are actually two conflicting goals. The goal of RL-based CoT compression should be to find a better balance between the two and improve both.

Each point in the following figures represents an independent experiment, obtained through different training runs with different parameter configurations. Benchmarks consist of MATH500, AIME2024, AMC2023, and Olympiad Bench.



  

  



## :rocket: Resources

### Datasets

| Dataset Name | Description | Link |

|:------------:|:------------|:----:|

| **Laser-Deepscaler-Dataset** | Training dataset | [🤗 HuggingFace](https://huggingface.co/datasets/hkust-nlp/Laser-Deepscaler-Dataset) |

### Models

#### 1.5B Models (Based on [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B))

##### Laser Models

| Model Name | Adaptive Target Length (L) | Size | Link |

|:----------:|:--------------------------:|:----:|:----:|

| **Laser-L2048** | 2048 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-L2048-1.5B) |

| **Laser-L4096** | 4096 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-L4096-1.5B) |

| **Laser-L8192** | 8192 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-L8192-1.5B) |

##### Laser-D Models

| Model Name | Adaptive Target Length (L) | Size | Link |

|:----------:|:--------------------------:|:----:|:----:|

| **Laser-D-L1024** | 1024 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-D-L1024-1.5B) |

| **Laser-D-L2048** | 2048 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-D-L2048-1.5B) |

| **Laser-D-L4096** | 4096 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-D-L4096-1.5B) |

##### Laser-DE Models

| Model Name | Adaptive Target Length (L) | Size | Link |

|:----------:|:--------------------------:|:----:|:----:|

| **Laser-DE-L1024** | 1024 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-DE-L1024-1.5B) |

| **Laser-DE-L2048** | 2048 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-DE-L2048-1.5B) |

| **Laser-DE-L4096** | 4096 | 1.5B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-DE-L4096-1.5B) |

#### 7B Models (Based on [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B))

| Model Name | Adaptive Target Length (L) | Size | Link |

|:----------:|:--------------------------:|:----:|:----:|

| **Laser-D-L4096** | 4096  | 7B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-D-L4096-7B) |

| **Laser-DE-L4096** | 4096 | 7B | [🤗 HuggingFace](https://huggingface.co/hkust-nlp/Laser-DE-L4096-7B) |

> **Note**: Smaller value of $L$ indicates more rapid compression during training, resulting in more concise Chains of Thought (CoTs) during inference.

## How to Start :running:?

### Installation

```bash

conda create -n laser python=3.10

git clone https://github.com/hkust-nlp/Laser.git

pip install -r requirement.txt

pip install flash-attn==2.6.3 --no-build-isolation

pip install -e . --no-dependencies

```

### Data Preparation

```bash

python scripts/pull_from_hub.py --repo_id hkust-nlp/Laser-Deepscaler-Dataset --local_path ./data/deepscaler --repo_type dataset --ignore_patterns "global_step*"

```

or you can download the dataset from [🤗 HuggingFace](https://huggingface.co/datasets/hkust-nlp/Laser-Deepscaler-Dataset) and put it in the `data/deepscaler` folder.

### Training

If you use slurm to run the training with ray, you can use the following command:

```bash

bash scripts/example/ray_start_slurm.sh $SCRIPT

# e.g. bash scripts/example/ray_start_slurm.sh scripts/training/laser-de-1.5b/laser-de-1.5B-l4096.sh

```

Otherwise, you can use the following command to run the training with ray:

```bash

bash scripts/example/ray_start_sh.sh $SCRIPT

```

SCRIPT is the script you want to run, for example, `scripts/training/laser-de-1.5b/laser-de-1.5B-l4096.sh`.

```bash

# Laser

scripts/training/laser-1.5b/laser-1.5b-l2048.sh

scripts/training/laser-1.5b/laser-1.5b-l4096.sh

scripts/training/laser-1.5b/laser-1.5b-l8192.sh

# Laser-D

scripts/training/laser-d-1.5b/laser-d-1.5b-l1024.sh

scripts/training/laser-d-1.5b/laser-d-1.5b-l2048.sh

scripts/training/laser-d-1.5b/laser-d-1.5b-l4096.sh

# Laser-DE

scripts/training/laser-de-1.5b/laser-de-1.5b-l1024.sh

scripts/training/laser-de-1.5b/laser-de-1.5b-l2048.sh

scripts/training/laser-de-1.5b/laser-de-1.5b-l4096.sh

```

### Evaluation

```bash

RUNNAME=""

INIT_MODEL_PATH=""  # path to the init model, or any hf model path

TPSIZE=1

STEPS="" # if empty, init model will be evaluated

bash Qwen2.5-Math/evaluation/sh/nodes/run_eval.sh $RUNNAME $INIT_MODEL_PATH $TPSIZE $STEPS

```

## Citation

If you find the content of this project helpful, please cite our paper as follows:

```

@misc{liu2025learnreasonefficientlyadaptive,

      title={Learn to Reason Efficiently with Adaptive Length-based Reward Shaping}, 

      author={Wei Liu and Ruochen Zhou and Yiyun Deng and Yuzhen Huang and Junteng Liu and Yuntian Deng and Yizhe Zhang and Junxian He},

      year={2025},

      eprint={2505.15612},

      archivePrefix={arXiv},

      primaryClass={cs.CL},

      url={https://arxiv.org/abs/2505.15612}, 

}

```

## Acknowledgements

- As a sister project of [SimpleRL](https://github.com/hkust-nlp/simpleRL-reason), we would like to thank the authors of [SimpleRL](https://github.com/hkust-nlp/simpleRL-reason) for their great work.

- Our code is built on the great work of [verl](https://github.com/volcengine/verl).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/hkust-nlp/laser

Awesome Lists containing this project

README