Understanding Reasoning in Thinking Language Models via Steering Vectors

Published 22 Jun 2025 in cs.LG and cs.AI | (2506.18167v3)

Abstract: Recent advances in LLMs have led to the development of thinking LLMs that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model's activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model's reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a technique to extract steering vectors that enable control over specific reasoning behaviors in LLMs.
It employs the Difference of Means method to identify and manipulate reasoning patterns such as backtracking, uncertainty, and example testing.
Experimental results on DeepSeek-R1-Distill models confirm the causal impact of steering vectors in modulating model reasoning.

Steering Reasoning in Thinking LLMs

This paper (2506.18167) introduces a method for controlling the reasoning processes of thinking LLMs by identifying and manipulating specific reasoning behaviors using steering vectors. The approach focuses on DeepSeek-R1-Distill models and demonstrates the ability to modulate behaviors such as expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. By extracting and applying steering vectors, the authors provide a practical method for steering reasoning processes in a controlled and interpretable manner.

Methodology

The methodology involves several key steps, including identifying reasoning behaviors, extracting steering vectors, and evaluating their causal impact.

Identifying Reasoning Behaviors

The authors first identified a set of reasoning behaviors exhibited by thinking models, including:

Initialization: Rephrasing the task and articulating initial thoughts.
Deduction: Deriving conclusions based on current approach and assumptions.
Knowledge Augmentation: Incorporating external knowledge to refine reasoning.
Example Testing: Generating examples or scenarios to validate hypotheses.
Uncertainty Estimation: Explicitly stating confidence or uncertainty.
Backtracking: Abandoning the current approach and exploring alternatives.

These behaviors were identified through an examination of reasoning chains generated by DeepSeek-R1 and GPT-4o.

Figure 1: Comparison of behavioral patterns between DeepSeek-R1-Distill models and baseline models, showing differences in the frequency of reasoning behaviors.

Extracting Steering Vectors

Steering vectors were extracted using the Difference of Means method. Contrastive datasets were constructed to represent the presence or absence of a specific reasoning behavior. The Difference of Means vector was then computed as the difference in mean activations between these datasets:

$\mathbf{u} = \frac{1}{|D_+|} \sum_{p_i \in D_+} \mathbf{a}(p_i) - \frac{1}{|D_-|} \sum_{p_j \in D_-} \mathbf{a}(p_j)$

where $D_+$ and $D_-$ represent datasets with and without the behavior, respectively, and $\mathbf{a}(p_i)$ represents the activations of the model components.

Locating Causally Relevant Activations

To extract robust steering vectors, the authors identified the activations where these vectors are linearly represented within the model, focusing on the residual stream activations. This process involved two steps: identifying relevant token positions and determining causally relevant layers. Attribution patching was employed to evaluate which layers contribute causally to the behavior in question.

The patching effect is approximated as:

$\Delta L \approx (\mathbf{u}_\ell^c)^T \cdot \frac{\partial}{\partial \mathbf{a}_\ell} L(\mathbf{x}_\text{clean} \mid \text{do}(\mathbf{a}_\ell = \mathbf{a}_\text{clean})),$

where $\mathbf{u}_\ell^c = (\mathbf{a}_\ell^\text{patched} - \mathbf{a}_\ell)$ .

Figure 2: Causal impact of candidate steering vectors across model layers, measured by KL-divergence.

Evaluating Steering Vectors

The effectiveness of the extracted steering vectors was evaluated by applying them to the selected layers and observing their influence on the model's reasoning process. Steering was implemented by adding or subtracting the extracted steering vectors to the residual stream activations at inference time. This intervention allowed for the modulation of behaviors such as backtracking, uncertainty estimation, and example testing.

Results

The results demonstrate that the extracted vectors effectively control the model's reasoning patterns. Positive steering increases behaviors such as backtracking and uncertainty estimation, while negative steering reduces them. These effects are consistent across different DeepSeek-R1-Distill models, reinforcing the hypothesis that thinking LLMs encode these reasoning mechanisms as linear directions in their activation space.

Figure 3: Steering on DeepSeek-R1's backtracking feature vector changes the model's behavior.

Figure 4: Effect of applying the steering vector for each reasoning behavior across different distill models, showing changes in the fraction of tokens exhibiting each behavior.

The paper references previous work on steering and interpreting LLMs by identifying meaningful directions or features within their internal representation spaces. These include methods for extracting latent steering vectors, activation engineering, and contrastive activation addition. The authors also discuss related work focusing on leveraging internal representations for reasoning and chain-of-thought.

Conclusion

The paper concludes by highlighting the effectiveness of the proposed steering approach for controlling reasoning behaviors in thinking LLMs. The ability to adjust specific aspects of the reasoning process through steering vectors opens new possibilities for adapting these models to different tasks and requirements. The authors note limitations of their work, including potential inaccuracies in the automated annotation process and the need to generalize findings to other models. Future research directions include developing more robust annotation methods and extending the research to a broader range of models.