Surgical Activation Steering via Generative Causal Mediation

Published 17 Feb 2026 in cs.CL, cs.CY, cs.HC, and cs.LG | (2602.16080v1)

Abstract: Where should we intervene in a LLM (LM) to control behaviors that are diffused across many tokens of a long-form response? We introduce Generative Causal Mediation (GCM), a procedure for selecting model components, e.g., attention heads, to steer a binary concept (e.g., talk in verse vs. talk in prose) from contrastive long-form responses. In GCM, we first construct a dataset of contrasting inputs and responses. Then, we quantify how individual model components mediate the contrastive concept and select the strongest mediators for steering. We evaluate GCM on three tasks--refusal, sycophancy, and style transfer--across three LLMs. GCM successfully localizes concepts expressed in long-form responses and consistently outperforms correlational probe-based baselines when steering with a sparse set of attention heads. Together, these results demonstrate that GCM provides an effective approach for localizing and controlling the long-form responses of LMs.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a Generative Causal Mediation framework that causally localizes distributed behaviors in long-form language models.
Experimental results show that precise activation and attribution patching improve concept induction with as little as 5% of attention heads achieving over 80% success in some tasks.
The study demonstrates that causal mediation enhances interpretability and control in language models while mitigating off-target effects compared to global steering methods.

Surgical Activation Steering via Generative Causal Mediation

Problem Context and Motivation

The paper addresses the challenge of activation steering in autoregressive LMs for concepts diffused across multiple tokens in long-form generated responses. Existing approaches to activation steering have largely focused on localization at token or phrase-level granularity, which is inadequate for controlling behaviors such as style transfer, refusal, or sycophancy, which are inherently distributed throughout the output sequence. The motivation is to establish principled methods for identifying internal components—specifically attention heads—that causally mediate such distributed behaviors, enabling targeted and interpretable steering.

Generative Causal Mediation Framework

Dataset Construction

Generative Causal Mediation (GCM) is operationalized by constructing datasets of contrastive input-output pairs where the target concept is either present or absent. For example, style prompts ("Respond in verse" vs. "Respond in prose"), refusal prompts ("harmless" vs. "harmful" instructions), and sycophancy prompts ("I love this..." vs. "I hate this...") are designed to yield contrasting responses. Responses are deterministically generated to minimize sampling variance.

Indirect Effect Measurement

GCM quantifies the indirect effect (IE) of each attention head by patching the activation from the contrasting input onto the original input and measuring the log probability difference between generating the target response versus the original response. This provides a causal estimate of each head’s mediation of the target concept in generative outputs. Attention heads are then ranked by their IE scores, and a sparse subset (top $k\%$ ) is selected for intervention.

Model Component Selection

Three GCM variants for localization are evaluated:

Activation Patching: Directly swaps activations between contrasting runs.
Attribution Patching: First-order Taylor approximation of activation patching, reducing computational overhead.
Attention Head Knockouts: Zeroes out head activations (agnostic to contrasting input).

These variants are compared against probe-based and random baselines. Effectiveness is assessed by steering success rate (concept present post-intervention), fluency, and relevance, all evaluated by a strong LM judge calibrated to human annotations.

Steering Methods and Hyperparameter Search

GCM-localized attention heads are subjected to three distinct steering methods:

Mean Steering: Replaces head activation with the mean contrasting activation.
Difference-in-Means Steering: Adds the scaled difference between contrasting and original means.
Representation Fine-Tuning (ReFT): Trains an adapter mapping original to contrastive representations with low-rank orthonormal matrices.

A comprehensive grid search over scaling factor $\alpha$ and fraction $k$ of intervened heads is performed to optimize steering efficacy, resulting in a total of 16,200 experiments spanning three LMs (Qwen-14B-Chat, SOLAR-10.7B-Instruct, OLMo-13B-DPO), three tasks, and three steering strategies.

Experimental Results

Localization Efficacy

GCM consistently outperforms both random and probe-based baselines for steering binary concepts in long-form settings. Activation patching and attribution patching show statistically significant improvement ( $p < 10^{-43}$ ) across models and tasks. Attention head knockouts do not yield significant gains over baselines.

Steering Performance

GCM-enabled steering achieves high success rates with low intervention budgets; steering with only 5% of attention heads achieves >80% concept induction. On held-out datasets, transfer rates are task and model dependent; refusal induction transfers at 40–80%, verse style transfer at 20–80%, and sycophancy at 10–30%.

Mean and difference-in-means steering benefit most from GCM localization; supervised ReFT’s effectiveness is less contingent on localization due to its supervision. Notably, sycophancy reduction is trivially steerable, even with random head selection, while style transfer requires precise localization due to sparsity.

Global vs. Local Steering

Global (all-head) steering delivers comparable control for simple binary concepts but induces greater risks of off-target effects and degradation of fluency/relevance. For more granular or multiplex concepts, surgical localization via GCM is theoretically preferable.

Implications and Theoretical Significance

GCM establishes a framework for causal localization in distributed concept settings—bridging the gap between token-level activation editing and global post-training methods. It demonstrates that casual mediation analysis is state-of-the-art for localizing steerable components for nuanced, distributed behaviors, outperforming correlational approaches. The results reinforce the hypothesis that many LM concepts are linearly represented and accessible via affine subspace edits but do not assume linearity during localization.

Practically, GCM contributes to transparency and interpretability in LM control, enabling reliable and targeted interventions without retraining. The approach is robust across models varying in architecture and scale, suggesting generalizability. However, surgical localization may be unnecessary for trivial or strongly represented concepts and essential for ambitious objectives (multiplex steering, fine-grained control).

Future Directions

Multi-concept Steering: Extension to simultaneous control of multiple, interacting concepts, requiring deeper combinatorial causal analysis.
Fine-grained Objectives: Granular behavioral control, e.g., controlling semantics or style at sub-sentence granularity.
Robustness: Evaluation of off-target effects and mitigation strategies in global steering.
Theory: Formalization of the relationships between linear representation hypotheses, causal mediation, and distributed abstraction.

Conclusion

Generative Causal Mediation provides a principled, causal framework for surgical steering of concepts diffused across long-form LM outputs. Across refusal, sycophancy, and style transfer, GCM-based localization yields superior steering performance compared to probe and random baselines. Lean approximations via attribution patching achieve near-optimal results. The findings highlight the importance of causal localization in practical LM control and invite further exploration into scaling and multiplexing steerability objectives (2602.16080).