Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Published 16 May 2025 in cs.CL and cs.AI | (2505.10832v3)

Abstract: Large reasoning models (LRMs) are proficient at generating explicit, step-by-step reasoning sequences before producing final answers. However, such detailed reasoning can introduce substantial computational overhead and latency, particularly for simple problems. To address this over-thinking problem, we explore how to equip LRMs with adaptive thinking capabilities: enabling them to dynamically decide whether or not to engage in explicit reasoning based on problem complexity. Building on R1-style distilled models, we observe that inserting a simple ellipsis ("...") into the prompt can stochastically trigger either a thinking or no-thinking mode, revealing a latent controllability in the reasoning behavior. Leveraging this property, we propose AutoThink, a multi-stage reinforcement learning (RL) framework that progressively optimizes reasoning policies via stage-wise reward shaping. AutoThink learns to invoke explicit reasoning only when necessary, while defaulting to succinct responses for simpler tasks. Experiments on five mainstream mathematical benchmarks demonstrate that AutoThink achieves favorable accuracy-efficiency trade-offs compared to recent prompting and RL-based pruning methods. It can be seamlessly integrated into any R1-style model, including both distilled and further fine-tuned variants. Notably, AutoThink improves relative accuracy by 6.4 percent while reducing token usage by 52 percent on DeepSeek-R1-Distill-Qwen-1.5B, establishing a scalable and adaptive reasoning paradigm for LRMs. Project Page: https://github.com/ScienceOne-AI/AutoThink.

Abstract PDF Upgrade to Chat

Summary

The paper introduces AutoThink, a multi-stage RL framework that enables LLMs to decide dynamically when to engage in detailed reasoning.
It uses an ellipsis prompt and structured reward shaping to improve accuracy by 6.4% while reducing token usage by 52% on benchmark tests.
The method tailors reasoning depth to task complexity, reducing computational overhead and offering a scalable solution for resource-constrained environments.

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

The paper "Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL" focuses on enhancing the reasoning efficiency of LLMs by equipping them with the ability to dynamically decide whether to engage in detailed reasoning based on the complexity of the problem. This is achieved through a multi-stage reinforcement learning (RL) framework named AutoThink, which optimizes the reasoning policies via structured reward shaping.

Introduction to Adaptive Reasoning

Recent advancements in large reasoning models (LRMs) have been primarily focused on generating explicit, step-by-step reasoning sequences to improve accuracy in solving complex tasks. However, this approach can result in significant computational overhead due to unnecessary detailed reasoning on simpler problems. To mitigate this, the authors propose a mechanism for adaptive reasoning within LLMs, enabling them to balance accuracy and computational efficiency dynamically.

Figure 1: Overview of AutoThink Compared to Prior Reasoning Paradigms.

Method: Ellipsis Prompt and Multi-Stage RL Framework

The core of the proposed method involves a minimal prompting scheme using an ellipsis ("...") to trigger stochastic switching between thinking and no-thinking modes in R1-style models. This allows the models to decide autonomously whether explicit reasoning is necessary. The multi-stage RL framework refines this behavior over three stages:

Stage 1: Stabilizes the dual-mode behavior by balancing rewards for thinking and no-thinking responses to prevent mode collapse.
Stage 2: Reinforces reliable reasoning within each mode by allowing free evolution of reasoning policies, focused on enhancing accuracy.
Stage 3: Introduces length-aware pruning to encourage concise reasoning, ensuring tokens are used efficiently.

Figure 2: Accuracy and token usage with standard and ellipsis prompts.

Results and Analysis

Experiments demonstrated that AutoThink achieves better accuracy-efficiency trade-offs compared to existing methods. For instance, on the DeepSeek-R1-Distill-Qwen-1.5B model, AutoThink improved relative accuracy by 6.4% while reducing token usage by 52%. These improvements were consistent across different mathematical benchmarks.

Figure 3: Distribution of Reasoning Behaviors Across Models and Reasoning Modes.

The results underscore the effectiveness of the ellipsis prompt in activating adaptive reasoning while the RL framework ensures that reasoning depth is aligned with task difficulty.

Conclusion

The paper presents a novel approach to reasoning efficiently in LLMs by introducing an adaptive reasoning framework based on dynamic prompt adjustments coupled with a multi-stage RL training paradigm. The method promises significant improvements in computational efficiency without sacrificing accuracy, making it an attractive solution for deploying reasoning models in resource-constrained environments.

Overall, AutoThink offers a scalable and versatile mechanism for reasoning adaptation, capable of integrating seamlessly into existing R1-style models to enhance their utility in various practical applications. Future work could explore budget-aware CoT generation to further refine the control over reasoning depth and adaptively manage computational resources.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Plain-language summary of “Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL”

What is this paper about?

This paper is about teaching AI LLMs to be smart about when to “think out loud.” Some problems need detailed step-by-step reasoning, but many don’t. Writing out lots of steps makes the AI slower and more expensive to run. The authors show how to help these models decide, on their own, whether to do detailed thinking or give a quick answer.

What questions are the researchers trying to answer?

Can we stop AI models from “overthinking” (writing extra steps that aren’t needed) without hurting accuracy?
Can a model learn to think deeply only on hard problems and answer briefly on easy ones?
Is there a simple way to control this behavior that works across different models?

How did they do it? (Methods explained simply)

First, a bit of background:

Many modern reasoning AIs (called “R1-style models”) are trained to show their work using a special format: they write reasoning inside a “> … </think>” section and then give the final answer. > > - This improves accuracy on tough questions but can waste time on easy ones. > > The authors discovered a neat trick: > > - If you put just three dots “...” after “<think>” (like writing “<think> …”), the model sometimes decides to think a lot, and sometimes decides to think very little or not at all. It’s like giving the model a pause and letting it choose how much to explain. > > But a tiny prompt change isn’t enough to make the model difficulty-aware (it didn’t reliably think more on hard problems and less on easy ones). So they trained the model with reinforcement learning (RL), which is like giving points for good behavior: > > - Think of RL as a game: the model tries different behaviors and gets rewards or penalties based on how helpful they are. > > They trained in three stages: > > - Stage 1: Keep both “modes” alive. Prevent the model from always thinking or never thinking. If too many examples use one mode, the rewards nudge it back toward balance. > > - Stage 2: Get good at both modes. Reward correct answers more, whether the model is thinking or not. This builds accuracy while keeping mode flexibility. > > - Stage 3: Trim the fluff. Add rewards that prefer shorter responses when the model is already correct, and allow longer ones when it’s struggling. In short: be brief when right, be thorough when needed. > > Analogy: Imagine a student who learns: > > 1. not to always show all steps or always skip them, > > 2. to get the answer right whichever approach they choose, > > 3. to write less on easy questions and show more work on hard ones. > > ### What did they find, and why does it matter? > > - The method, called AutoThink, helped the AI automatically choose when to think deeply. > > - On five math benchmarks, it gave a better balance between accuracy and speed than other methods that either force short answers or cut steps uniformly. > > - A standout result: on a popular 1.5B-parameter model (DeepSeek-R1-Distill-Qwen-1.5B), AutoThink increased accuracy by about 6.4% while cutting token usage (the amount it writes) by 52%. That means better answers with about half the typing—faster and cheaper. > > - After training, the model actually thought more on harder problems and less on easier ones—exactly what we want. > > - It works as a drop-in upgrade to many R1-style models (both smaller and larger ones). > > ### Why is this important? > > - Faster and cheaper: Less unnecessary text means lower cost and quicker responses. > > - Smarter behavior: The model thinks like a good student—only showing steps when needed. > > - Scalable: It can be added to different models without redesigning everything. > > - Practical: Helpful for math, coding, and other tasks where sometimes a quick answer is enough. > > ### What are the possible impacts and next steps? > > - This approach could make reasoning AIs more efficient for real-world apps (tutors, assistants, search tools), saving time and money. > > - It may reduce energy use by cutting extra computation, which is better for the environment. > > - Future improvements could: > - Enforce a strict “budget” (for example, “never write more than X steps”). > - Stop rare cases where the model hides thinking after the tag.
- Use smarter data selection to train even better.

Overall, AutoThink shows a simple, effective way to teach AI when to think hard and when to keep it short—leading to better results with less effort.

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Summary

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Introduction to Adaptive Reasoning

Method: Ellipsis Prompt and Multi-Stage RL Framework

Results and Analysis

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-language summary of “Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL”

What is this paper about?

What questions are the researchers trying to answer?

How did they do it? (Methods explained simply)

Open Problems

Continue Learning

Authors (7)

Collections

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Summary

Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL

Introduction to Adaptive Reasoning

Method: Ellipsis Prompt and Multi-Stage RL Framework

Results and Analysis

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Plain-language summary of “Learning When to Think: Shaping Adaptive Reasoning in R1-Style Models via Multi-Stage RL”

What is this paper about?

What questions are the researchers trying to answer?

How did they do it? (Methods explained simply)

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections