Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

Published 21 May 2025 in cs.CL, cs.AI, and cs.LG | (2505.15612v1)

Abstract: Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Laser-D, an adaptive reward shaping approach that adjusts target reasoning paths based on task difficulty to improve efficiency.
It unifies efficient reasoning methods under a reward shaping framework, achieving a 6.1% accuracy boost and 63% reduction in token usage on benchmarks.
The dynamic adaptation mechanism offers practical implications for deploying large reasoning models in resource-constrained environments.

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

The paper "Learn to Reason Efficiently with Adaptive Length-based Reward Shaping" investigates methods to improve reasoning efficiency in large reasoning models (LRMs) via reinforcement learning (RL) techniques, particularly focusing on adaptive length-based reward shaping strategies.

Introduction

The motivation for this research stems from the recognition that while LRMs can produce extended chains of thought to enhance problem-solving capabilities, they often become inefficient due to redundant token generation. This paper introduces a novel approach, termed Laser-D (Dynamic and Difficulty-aware Length-bAsed StEp Reward), which dynamically adapts the target length of reasoning paths based on task complexity during training. This adaptive approach contrasts with traditional fixed-length penalty methods by more effectively balancing performance with computational resource constraints.

Methodology

The research proposes a length-based reward shaping strategy, with several key innovations:

Unification Framework: The authors provide a unified framework that views various efficient reasoning methods through the prism of reward shaping, leveraging both correctness and length-based terms.
Length-bAsed StEp Reward (Laser): This method refines the truncation baseline by introducing a step reward function guided by a desired target length but without hard truncation, thus promoting concise, accurate responses.
Dynamic Adaptation via Laser-D: Unlike static reward methods, Laser-D incorporates an automatic adapting mechanism that adjusts target lengths based on dynamic assessments of question difficulty, allowing for more nuanced resource allocation that aligns with problem complexity.
Figure 1: Dynamics of adaptive target lengths during the training process of Laser-D and Laser-DE. The figure shows how the adaptive target length $L_A$ changes over training iterations for problems of different difficulty levels.

Experimental Results

The experiments conducted across varying model sizes and benchmarks demonstrate the superior efficacy and efficiency of Laser-D, achieving a remarkable balance in performance and token usage:

Performance Metrics: On the challenging AIME2024 benchmark, Laser-D achieves a 6.1% accuracy improvement while reducing token usage by 63%.
Comparative Analysis: Compared to other methods such as truncation and group-based rewards, Laser-D achieves the most favorable trade-offs across benchmarks like MATH500, AMC2023, and OlympiadBench.

Figure 2: Left: Accuracy and response length on AIME2024 showcasing the efficiency of Laser-D target length adaptations.

Theoretical and Practical Implications

The work positions itself within the broader context of optimizing RL for computational efficiency in LRMs, suggesting that dynamic, difficulty-aware learning paradigms can significantly reduce computing overhead while maintaining competitive performance levels. This approach has implications for deployment in resource-constrained environments where efficiency is paramount.

Conclusion

In summary, the paper presents significant strides in improving reasoning efficiency via RL with adaptive reward shaping. By dynamically adjusting to problem complexity and linking resource allocation to task difficulty, Laser-D exemplifies a sophisticated strategy capable of pushing the boundaries of reasoning efficiency in state-of-the-art LRMs. Future research directions could explore extensions of this method to other domains such as code generation and agentic tasks, validating its efficacy beyond mathematical reasoning contexts.

Markdown Report Issue