Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM

Published 20 May 2025 in cs.CL | (2505.13890v1)

Abstract: Recent advances in test-time scaling have enabled LLMs to display sophisticated reasoning abilities via extended Chain-of-Thought (CoT) generation. Despite their potential, these Reasoning LLMs (RLMs) often demonstrate counterintuitive and unstable behaviors, such as performance degradation under few-shot prompting, that challenge our current understanding of RLMs. In this work, we introduce a unified graph-based analytical framework for better modeling the reasoning processes of RLMs. Our method first clusters long, verbose CoT outputs into semantically coherent reasoning steps, then constructs directed reasoning graphs to capture contextual and logical dependencies among these steps. Through comprehensive analysis across models and prompting regimes, we reveal that structural properties, such as exploration density, branching, and convergence ratios, strongly correlate with reasoning accuracy. Our findings demonstrate how prompting strategies substantially reshape the internal reasoning structure of RLMs, directly affecting task outcomes. The proposed framework not only enables quantitative evaluation of reasoning quality beyond conventional metrics but also provides practical insights for prompt engineering and the cognitive analysis of LLMs. Code and resources will be released to facilitate future research in this direction.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a graph-based framework to map the chain-of-thought in RLMs, providing a new perspective on reasoning processes.
It quantifies reasoning behavior using metrics like exploration density, branching, and convergence ratios to correlate graph structure with performance.
The study reveals that zero-shot prompting leads to richer, more flexible graph structures compared to few-shot approaches.

Graph-Based Analysis of Reasoning in LLMs

Introduction

The paper "Mapping the Minds of LLMs: A Graph-Based Analysis of Reasoning LLM" (2505.13890) explores the reasoning capabilities of Reasoning LLMs (RLMs) by introducing a novel graph-based framework. This research is primarily motivated by the need to understand the counterintuitive and unstable behaviors exhibited by RLMs, particularly under various prompting regimes such as zero-shot and few-shot settings. The authors propose using graph structures to quantify and analyze the reasoning processes of these models, providing insights that extend beyond traditional accuracy metrics.

Framework Overview

The proposed framework constructs reasoning graphs from the chain-of-thought (CoT) outputs of RLMs. This approach involves clustering verbose CoT sequences into semantically coherent reasoning steps, which are then used as nodes in a directed graph that reveals contextual and logical dependencies among steps. The graph-based representation enhances human readability, systematic interpretability, and allows the calculation of quantifiable graph metrics for analysis (Figure 1).

Figure 1: A conceptual overview of our framework for modeling the long reasoning CoT with a graph structure.

Graph Construction from CoT Tokens

The methodology for constructing reasoning graphs involves three main steps: clustering raw tokens into reasoning steps, detecting semantic relationships between these steps to form edges, and finally constructing the reasoning graph structure. Initially, token sequences are divided into reasoning units using delimiters, followed by a logical clustering process that organizes these units into semantically meaningful steps. Semantic relationships between steps are then detected to construct a graph that represents the reasoning process (Figure 2).

Figure 2: Our pipeline for building the graph structure from reasoning LLMs' output.

Graph-Based Metrics and Cognitive Analysis

The graph-based analysis introduces several metrics to quantify reasoning processes: exploration density, branching ratio, convergence ratio, and linearity. These metrics provide a detailed understanding of the model's reasoning behavior, highlighting how structures differ across different prompting styles. The study reveals that zero-shot prompting leads to richer graph structures compared to few-shot settings, which often impose linear constraints and reduce the model's effective reasoning capacity.

Impact of Prompting Regimes

The paper's analysis of prompting regimes demonstrates that few-shot prompting degrades performance, a counterintuitive outcome given its success with conventional LLMs. The inclusion of examples in prompts, especially verbose ones, can shift the reasoning structure towards linearity, limiting the model's capacity for adaptive reasoning. This is evidenced by significant drops in accuracy with increasing few-shot examples, as demonstrated in Figures 5 and 6.

Figure 3: Few-shot prompting accuracy on GPQA-Diamond^

dataset using reasoning Qwen-7B.*

Figure 4: Average number of tokens under different numbers of shots with explanatary few-shot style.

Structural Insights and Performance Correlations

The paper presents an analysis correlating graph structures with reasoning performance across different models and prompting paradigms. It finds consistent patterns where complex reasoning graphs—with higher exploration and convergence—are associated with better performance. This underscores the importance of flexible reasoning structures for effective problem-solving, as showcased in Figure 5.

Figure 5: Distribution of reasoning step counts under zero-shot and one-shot prompting.

Implications and Future Research

The findings imply that graph-based approaches provide a deeper, structural-level understanding of RLM behavior, potentially guiding future prompt engineering strategies. By illuminating how different prompting methods influence reasoning structures, this work opens avenues for optimizing prompt design to enhance reasoning performance in RLMs. Future research could extend this framework to other RLM applications, exploring multi-modal and open-domain reasoning tasks.

Conclusion

The study offers an innovative perspective on RLM analysis through graph-based reasoning structures, presenting a framework that could significantly influence how prompt engineering and model behavior are understood in AI research. It marks a step forward in quantifying reasoning quality and enhancing interpretability beyond traditional performance metrics.