Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models

Published 4 Jun 2025 in cs.AI and cs.CL | (2506.04210v2)

Abstract: Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek R1) have led to a popular belief that extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance. This raises a natural question: Does thinking more at test-time truly lead to better reasoning? To answer this question, we perform a detailed empirical study across models and benchmarks, which reveals a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking". To understand this non-monotonic trend, we consider a simple probabilistic model, which reveals that additional thinking increases output variance-creating an illusion of improved reasoning while ultimately undermining precision. Thus, observed gains from "more thinking" are not true indicators of improved reasoning, but artifacts stemming from the connection between model uncertainty and evaluation metric. This suggests that test-time scaling through extended thinking is not an effective way to utilize the inference thinking budget. Recognizing these limitations, we introduce an alternative test-time scaling approach, parallel thinking, inspired by Best-of-N sampling. Our method generates multiple independent reasoning paths within the same inference budget and selects the most consistent response via majority vote, achieving up to 20% higher accuracy compared to extended thinking. This provides a simple yet effective mechanism for test-time scaling of reasoning models.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that extended sequential thinking leads to overthinking and a non-monotonic accuracy degradation in reasoning tasks.
It employs rigorous experiments on GSM-8K, MATH-500, and AIME 2024 to show how increasing thinking tokens initially boost accuracy before causing performance collapse.
The study proposes parallel thinking using Best-of-N sampling, which outperforms sequential scaling by achieving up to 22% higher accuracy under fixed compute budgets.

Test-Time Scaling in Reasoning Models: The Limits of Extended Thinking and the Efficacy of Parallel Inference

Introduction

The paper "Does Thinking More always Help? Understanding Test-Time Scaling in Reasoning Models" (2506.04210) presents a comprehensive empirical and theoretical analysis of test-time scaling strategies in LLMs for reasoning tasks. The authors challenge the prevailing assumption that extending the inference-time "thinking" trace—by prompting models to "Wait" or "Think more"—monotonically improves reasoning performance. Through systematic experiments and a probabilistic framework, the study demonstrates that extended sequential thinking exhibits a non-monotonic effect: initial gains are followed by a degradation in accuracy, a phenomenon termed "overthinking." The paper further introduces "parallel thinking," a Best-of-N sampling approach, as a superior alternative for utilizing inference compute budgets.

Empirical Analysis of Test-Time Scaling

Experimental Setup

The study evaluates three open-source, RL-trained reasoning models (DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, DeepSeek-R1-Distill-Llama-8B) on GSM-8K, MATH-500, and AIME 2024 mathematical reasoning benchmarks. Test-time scaling is operationalized via two main strategies:

Sequential Extension: The model is prompted to continue its reasoning trace by suppressing the end-of-thinking delimiter and appending tokens such as "Wait" or "Think more," thereby increasing the number of "thinking" tokens before producing a final answer.
Budget Control: The number of thinking tokens is either unconstrained (up to the model's maximum context) or explicitly fixed to a set value.

Key Observations

Across all models and datasets, the authors observe a consistent non-monotonic relationship between the number of thinking tokens and accuracy. For example, on GSM-8K, accuracy increases from 82.2% to 87.3% as the average thinking tokens rise from 385 to 1100, but then drops to 70.3% as the token count increases further to 15,980. This pattern is robust to different budget control schemes (unconstrained, exact, or minimum thinking tokens).

The degradation phase—overthinking—contradicts the widely held belief that more computation at inference always yields better reasoning. The absence of a reliable stopping criterion and the inefficiency of allocating all compute to a single, extended reasoning trace are highlighted as practical limitations.

Theoretical Framework: Variance-Driven Mirage

To explain the empirical findings, the authors introduce a probabilistic model where the policy distribution $\pi(y|x)$ and the reward function $r(x, y)$ are both Gaussian. The expected reward is shown to depend on the variance of the policy distribution:

$\mathbb{E}_{y \sim \pi(\cdot|x)} [r(x,y)] = \frac{1}{\sqrt{2\pi(\sigma_r^2 + \sigma_\pi^2)}} \exp\left( -\frac{(\mu_r - \mu_\pi)^2}{2(\sigma_r^2 + \sigma_\pi^2)} \right)$

Increasing the variance $\sigma_\pi^2$ initially increases the overlap between the policy and reward distributions, improving expected reward. However, beyond a critical point, further increases in variance dilute the probability mass, reducing expected reward. This trade-off—coverage versus dilution—mirrors the empirical non-monotonicity observed in LLMs.

Empirically, the entropy of the model's output distribution increases with the length of the thinking trace. Initial entropy growth correlates with accuracy gains, but excessive entropy leads to performance collapse. The authors demonstrate, for instance, a 12x increase in entropy (from 0.23 to 2.79) as thinking tokens increase from 385 to 6136, with accuracy peaking and then declining.

Parallel Thinking: Best-of-N as an Effective Scaling Strategy

Recognizing the inefficiency of sequential overthinking, the paper proposes "parallel thinking," inspired by Best-of-N sampling. The approach is as follows:

Budget Allocation: Given a total thinking token budget $B$ , generate $N$ independent reasoning traces, each with $\leq B/N$ tokens.
Majority Voting: For each trace, sample a final answer. The most frequent answer across traces is selected as the output.

This method leverages the same total compute as sequential extension but avoids the entropy explosion associated with overthinking. Empirical results show that, under a 16K token budget, parallel thinking achieves up to 22% higher accuracy than sequential scaling and up to 47% higher than exact thinking token control.

Implementation Considerations

Sampling: Each reasoning trace is generated independently, allowing for parallelization across hardware resources.
Voting Mechanism: Majority voting is used for answer selection, but more sophisticated aggregation (e.g., self-consistency, confidence-weighted voting) could be explored.
Scalability: The approach is trivially parallelizable and can be adapted to distributed inference settings.

Implications and Future Directions

Practical Implications

Inference Budgeting: Allocating compute to multiple independent traces is more effective than extending a single trace, especially under fixed resource constraints.
Model Uncertainty: Monitoring entropy or variance of the output distribution can serve as a diagnostic for overthinking and may inform adaptive inference strategies.
Deployment: Parallel thinking is compatible with existing LLM inference pipelines and can be implemented with minimal architectural changes.

Theoretical Implications

Illusion of Improvement: The observed gains from extended thinking are not due to genuine reasoning improvements but are artifacts of increased output variance.
Optimal Scaling: The results motivate further research into theoretically grounded, compute-optimal inference strategies, including adaptive trace length and dynamic budget allocation.

Open Questions

Scaling to Larger Models: The study is limited to mid-sized models; it remains to be seen whether larger models (e.g., 32B, 70B) exhibit similar overthinking dynamics.
Task Generality: While the analysis focuses on mathematical reasoning, the generality of these findings to other domains (e.g., code generation, commonsense reasoning) warrants investigation.
Theoretical Guarantees: Formalizing the relationship between entropy, variance, and reasoning performance in high-dimensional, non-Gaussian settings is an open theoretical challenge.

Conclusion

This work provides a rigorous empirical and theoretical examination of test-time scaling in reasoning LLMs, demonstrating that extended sequential thinking leads to overthinking and performance degradation due to increased output variance. The proposed parallel thinking strategy, based on Best-of-N sampling, offers a simple and effective alternative for inference-time scaling, yielding substantial accuracy improvements under fixed compute budgets. These findings have significant implications for the design and deployment of reasoning LLMs and motivate further research into principled, resource-efficient inference strategies.

Markdown Report Issue