The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Published 7 Jun 2025 in cs.AI, cs.CL, and cs.LG | (2506.06941v2)

Abstract: Recent generations of LLMs have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established math and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from contamination and does not provide insights into the reasoning traces. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs think. Through extensive experiments, we show that LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having remaining token budget. By comparing LRMs with their standard LLM counterparts under same inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models outperform LRMs, (2) medium-complexity tasks where LRMs demonstrates advantage, and (3) high-complexity tasks where both models face complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across scales. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models' computational behavior, shedding light on their strengths, limitations, and raising questions about their reasoning capabilities.

Abstract PDF Upgrade to Chat

Summary

The paper shows that large reasoning models collapse in performance when task complexity exceeds a critical threshold.
Methodology involves systematic testing on puzzles like Tower of Hanoi and Checker Jumping, with detailed analysis of intermediate reasoning traces.
Findings indicate that even with explicit algorithm guidance, models struggle with exact computation and robust symbolic manipulation.

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models

This paper (2506.06941) investigates the capabilities and limitations of Large Reasoning Models (LRMs) by examining their performance on systematically varied puzzle environments. The study moves beyond traditional benchmarks, which often suffer from data contamination, to analyze both the final answers and the intermediate reasoning traces of LRMs. Through controlled experiments, the authors reveal critical insights into how these models "think," uncovering performance collapses at higher complexities and limitations in exact computation and algorithmic reasoning.

Experimental Setup and Findings

The authors evaluate LRMs, including o3-mini, DeepSeek-R1, and Claude-3.7-Sonnet-Thinking, across four puzzle environments: Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World (Figure 1). These puzzles allow for precise control over compositional complexity while maintaining consistent logical structures. The models' performance is assessed by measuring accuracy and analyzing the reasoning traces, revealing a counterintuitive scaling limit where reasoning effort decreases despite an adequate token budget.

Figure 1: Illustration of the four puzzle environments, showcasing the progression from initial, intermediate, and target states for Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World.

The study identifies three performance regimes based on problem complexity (Figure 2): (1) low-complexity tasks where standard LLMs outperform LRMs due to greater token efficiency, (2) medium-complexity tasks where the additional thinking in LRMs demonstrates an advantage, and (3) high-complexity tasks where both models experience a complete collapse in performance. A key finding is that LRMs often fail to generalize problem-solving capabilities, with accuracy collapsing to near-zero beyond a certain complexity threshold.

Figure 2: Comparative analysis of model thinking mechanisms, token efficiency, and inefficiencies in reasoning, highlighting the limitations of LRMs.

Analysis of Reasoning Traces

The authors conduct an in-depth analysis of intermediate reasoning traces, revealing complexity-dependent patterns. In simpler problems, LRMs tend to identify correct solutions early but then inefficiently explore incorrect alternatives, an "overthinking" phenomenon. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths. Beyond a certain threshold, models fail to find correct solutions and fixate on early incorrect attempts, wasting the remaining inference token budget. This indicates limitations in self-correction capabilities and clear scaling limitations (Figure 3).

Figure 3: Accuracy and thinking tokens versus problem complexity for reasoning models, demonstrating the point at which reasoning collapses and effort decreases.

The paper further highlights LRMs' limitations in exact computation. Even when provided with explicit algorithms, the models fail to improve performance, and the observed collapse still occurs at roughly the same point (Figure 4). This is particularly evident in the Tower of Hanoi and Checker Jumping puzzles, suggesting that LRMs struggle with following logical procedures and may lack robust symbolic manipulation capabilities.

Figure 4: Performance comparison between default problem-solving and algorithm-guided execution across Tower of Hanoi and Checker Jumping puzzles, demonstrating the models' limitations in following logical procedures.

Implications and Open Questions

The findings of this study raise significant questions about the nature of reasoning in current LRM systems. The performance collapses, limitations in algorithmic execution, and inconsistent reasoning across scales highlight fundamental barriers to generalizable reasoning. The authors observe non-monotonic failure behavior with respect to problem complexity, where models fail earlier in the solution sequence for higher complexity values.

Figure 5: Accuracy versus compositional depth (number of moves required) for three LRMs across four puzzle environments, illustrating the relationship between problem complexity and model performance.

These observations suggest that LRMs may not truly understand the underlying logic of the problems but rather rely on learned solution distributions from their training data. This work opens avenues for future research into the symbolic manipulation capabilities of LRMs [marcus2003algebraic, gsmsymbolic] and the development of more robust reasoning mechanisms.

Conclusion

This paper (2506.06941) provides a systematic examination of LRMs through the lens of problem complexity, revealing both strengths and limitations in current models. The study highlights the importance of moving beyond final accuracy measurements and analyzing intermediate reasoning traces to understand how these models approach complex tasks. The findings challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning. The identified scaling limits, limitations in algorithmic execution, and inconsistent reasoning patterns pave the way for future investigations into the reasoning capabilities of these systems.

Markdown