Steering MoE LLMs via Expert (De)Activation

Published 11 Sep 2025 in cs.CL and cs.LG | (2509.09660v1)

Abstract: Mixture-of-Experts (MoE) in LLMs routes each token through a subset of specialized Feed-Forward Networks (FFN), known as experts. We present SteerMoE, a framework for steering MoE models by detecting and controlling behavior-linked experts. Our detection method identifies experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors. By selectively (de)activating such experts during inference, we control behaviors like faithfulness and safety without retraining or modifying weights. Across 11 benchmarks and 6 LLMs, our steering raises safety by up to +20% and faithfulness by +27%. In adversarial attack mode, it drops safety by -41% alone, and -100% when combined with existing jailbreak methods, bypassing all safety guardrails and exposing a new dimension of alignment faking hidden within experts.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates a method to adjust expert activations that steers LLM behavior, achieving up to a +27 improvement in faithfulness.
The approach uses paired-example routing-difference detection to compute risk scores, enabling activation/deactivation of experts without retraining.
The results reveal enhanced interpretability and safety, while also exposing alignment vulnerabilities that could be exploited adversarially.

Steering MoE LLMs via Expert (De)Activation

Introduction

The paper "Steering MoE LLMs via Expert (De)Activation" presents a method for controlling the behavior of Mixture-of-Experts (MoE) models in LLMs by adjusting the activation of specific experts during inference. This approach allows users to modify behaviors like faithfulness and safety without retraining the models or changing their weights. The research focuses on leveraging the routing pathways in MoE models to achieve behavioral steering through lightweight modifications.

Methodology

Expert Activation Patterns

MoE models, characterized by sparing routing of tokens through a subset of specialized feed-forward networks (FFNs) or experts, are efficient in handling vast parameter counts without significant increases in computational costs. The proposed method detects experts with distinct activation patterns across paired inputs exhibiting contrasting behaviors, such as safe versus unsafe, and adjusts these patterns to influence the model's outputs.

Paired-Example Routing-Difference Detection

The framework detects behavior-linked experts by analyzing the difference in expert activation frequencies between paired examples reflecting contrasting behaviors. It computes a risk difference score for each expert, quantifying its association with the behavior of interest. This score guides which experts to promote or suppress during inference.

Steering Mechanism

At inference, expert scores are adjusted based on their risk difference scores. Experts promoting the desired behavior are activated, while those inducing undesired behavior are deactivated through adjustments to the router logits. This mechanism allows the model behavior to be steered effectively while maintaining the original model weights intact.

Experimental Results

Faithfulness

Using benchmarks such as RAG and various datasets including SQuAD, the research demonstrates that steering models by activating experts linked with document-grounded facts improves faithfulness and alignment with retrieved data. This approach yielded up to a +27 improvement in faithfulness accuracy across several evaluation datasets.

Figure 1: Comparison of the off-the-shelf and steered models on faithfulness benchmarks.

Safety

For safety benchmarks including the TDC2023 Red Teaming Track, the methodology adjusts expert activation to manipulate the model's compliance with unsafe prompts. Results indicated improvements in model safety, with activation leading to a +20 improvement in safe response rates and deactivation exposing vulnerabilities for adversarial manipulation.

Figure 2: Comparison of off-the-shelf and steered models on safety benchmarks.

Interpretability and Implications

The study highlights the interpretability of expert activations as behaviors often cluster within middle layers of the model. This clustering suggests that experts encoded more than just domain-specific abilities, highlighting both opportunities for modular behavioral aligning and potential risks if adversaries exploit these pathways.

Figure 3: Number of important experts (top 20\%) in each layer of models.

Conclusion

The proposed framework allows for the test-time control of MoE models by leveraging expert activations as a modular and interpretable signal for behavioral alignment. This method not only presents opportunities for improving model outputs without retraining but also exposes alignment vulnerabilities that must be addressed to prevent adversarial exploits. Exploring dynamic expert manipulation and expanding to other behavioral dimensions represent future directions to enhance model safety and reliability comprehensively.

Markdown