ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Published 30 Nov 2025 in cs.LG | (2512.00831v2)

Abstract: Large Reasoning Models (LRMs) are LLMs explicitly trained to generate long-form Chain-of-Thoughts (CoTs), achieving impressive success on challenging tasks like math and programming. However, their underlying reasoning "algorithms" remain poorly understood. To investigate this, we propose ReJump, which represents a reasoning trace as a visitation order over nodes in a tree of intermediate problem-solving steps. Transitions between nodes, which we term jumps, include adjacent moves that capture behaviors such as calculation, and non-adjacent moves that capture behaviors such as backtracking and verification. ReJump enables analyzing LLM reasoning with diverse metrics that quantify exploration, exploitation, overthinking, forgetting, and verification. Using our proposed LLM agent to extract reasoning traces into ReJump format, we evaluate state-of-the-art LRMs on two tasks and find that models with similar accuracy can exhibit distinct reasoning behaviors, while different tasks favor different reasoning styles (e.g., varying balance between exploration and exploitation). To further understand how learning strategies shape reasoning, we use ReJump to compare distilled LRMs with their teachers, CoT-prompted LLMs with LRMs, and to examine how the number of reasoning examples and reinforcement learning affect reasoning behavior. Finally, we show that ReJump can improve reasoning quality at test time through strategies such as ReJump-guided Best-of-N selection and prompt selection. Our code is publicly available at https://github.com/UW-Madison-Lee-Lab/ReJump.

Abstract PDF Upgrade to Chat

Summary

The paper introduces ReJump, a representation capturing LLM reasoning as tree structures, enabling nuanced analysis of multi-step problem-solving.
It defines specific metrics like exploration-exploitation balance to quantify reasoning processes and compare model performance.
Experimental evaluations on tasks like MATH-500 and Game of 24 reveal that models with similar accuracy can exhibit distinct reasoning strategies.

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Introduction

The research introduces ReJump, a novel representation method for analyzing reasoning in Large Reasoning Models (LRMs), which are a subclass of LLMs trained for generating Chain-of-Thought (CoT) reasoning. ReJump addresses how models solve problems by representing reasoning traces as tree structures, encapsulating the complexity of multi-step problem-solving processes typical in tasks like mathematical proofs and logical deductions.

Figure 1: ReJump representations of reasoning traces generated by Claude 3.7 Sonnet, Grok 3 Mini Beta, and DeepSeek-R1 on a Game of 24 problem.

Methodology

ReJump Representation: ReJump captures reasoning by encoding the sequence of visited nodes as a tree, where nodes represent intermediate problem-solving steps, and transitions (termed 'jumps') between nodes indicate actions like calculation, verification, or backtracking. This structure differs from traditional linear CoT by illustrating multiple potential solution paths, thus offering insights into the model's strategic reasoning capabilities.

Figure 2: Illustration of how $d_{\text{jump}}$ quantifies the exploration-exploitation trade-off in model reasoning.

Metrics in ReJump: The study introduces several quantitative metrics for evaluating reasoning, including exploration-exploitation balance ( $d_{\text{jump}}$ ), success rate, verification rate, and overthinking rate. Such metrics are designed to evaluate not just the end results but the nuances of the reasoning paths taken by different models.

Experimental Evaluation

The research evaluates several state-of-the-art LRMs, such as DeepSeek-R1 and Grok 3 Mini Beta, on tasks like MATH-500 and the Game of 24. The analysis reveals that models exhibiting similar accuracy may follow distinct reasoning processes. For instance, some models might prioritize verification processes over exploratory steps, impacting their problem-solving efficiency and accuracy in different scenarios.

Figure 3: Illustration of how reasoning traces are converted into the ReJump representation for a math word problem.

Comparison to Existing Models

ReJump demonstrates its superiority in distinguishing reasoning patterns by comparing it against existing methods such as direct comparisons of final answer accuracies. The method provides a more nuanced understanding of reasoning behaviors, highlighting differences in how models handle exploration versus exploitation and manage verification tasks.

Figure 4: Reasoning performance of various models on MATH-500 and Game of 24. Bar plots show final accuracy, while radar plots depict reasoning metrics.

Implications and Future Directions

ReJump's ability to dissect reasoning strategies enhances the understanding of LLM behaviors and guides improvements in model development, particularly for tasks requiring strategic problem-solving like advanced mathematics or logical deductions. The authors suggest future work might involve integrating ReJump-derived insights into the training processes, potentially using them to refine reinforcement learning strategies or guide prompt design for improved task-specific reasoning.

Figure 5: Detailed reasoning performance plots for DeepSeek-R1, Grok 3 Mini Beta, and Claude 3.7 Sonnet, emphasizing the variation in underlying processes despite similar performance metrics.

Conclusion

The introduction of ReJump represents a significant step in analyzing and enhancing the reasoning processes of contemporary LLMs. By articulating the intricacies of reasoning paths, ReJump not only aids in the evaluation of current models but also paves the way for developing LLMs that are more adept at complex reasoning tasks. The framework holds promise for advancing both theoretical understanding and practical application of LLMs in domains demanding high-level cognitive reasoning.

Markdown Report Issue