MS-Neurons: Monolingual Safety in LLMs
- MS-Neurons are specialized neurons in LLMs that enforce language-specific safety alignment by triggering robust refusal behavior on harmful prompts.
- They are identified through contrastive ranking of neuron activations on harmful versus benign prompts and isolated via targeted ablation methods.
- Manipulation of MS-Neurons, through techniques like SafeTuning, enables precise control of safety outputs, markedly reducing the attack success rate on unsafe requests.
Monolingual Safety Neurons (MS-Neurons) are a distinct and causally validated subpopulation in LLMs that mechanistically underlie language-specific safety alignment, particularly the model’s refusal to comply with harmful or jailbreak prompts. Identified through direct analysis of neuron-level activation patterns and representational impact, MS-Neurons mediate robust refusal behavior in response to unethical or unsafe requests, while remaining inert during benign interactions. The precise identification, manipulation, and targeted fine-tuning of these neurons have established both interpretability and effective control of safety in transformer-based models within a single linguistic domain (Zhang et al., 1 Feb 2026, Zhao et al., 1 Sep 2025).
1. Formal Definition and Mathematical Basis
MS-Neurons are formally defined in the context of a transformer LLM , where each neuron in attention (query, key, value projections ) or output projection is evaluated for its causal impact on the output representation. For a given input :
Here, denotes the forward pass with neuron deactivated (zeroed out). Aggregated over dataset , the importance score is
Given two datasets— (approx. 800 jailbreak prompts with safe refusals) and (1,000 benign prompts with standard responses)—the monolingual safety neurons in layer are determined by
where denotes the set of top \%-importance neurons per layer (typically ). The global MS-Neuron set is the union over all layers. These neurons are thus the most causally engaged by harmful prompts and not by benign prompts (Zhang et al., 1 Feb 2026).
2. Identification and Isolation Methodology
MS-Neuron identification proceeds via layer-wise contrastive ranking:
- Activation Probing: Compute and for each neuron in each layer. Rank and select top as in each case.
- Contrastive Subtraction: Define MS-Neurons per layer as neurons in the top for but not for . Empirical results show substantial overlap () between the top neurons for benign and harmful prompts, supporting the specificity of the subtraction step.
- Aggregation: The full MS-Neuron set is the union across all layers, which isolates neurons whose importance is specific to refusal behavior in the target language (Zhang et al., 1 Feb 2026).
In the MLP-based framework (Zhao et al., 1 Sep 2025), a neuron (row of in MLP layer ) is assigned contribution , with being the neuron’s activation for input . Top-k\% neurons by (on harmful prompts), after excluding those also important on benign data, constitute the safety-specific subset.
3. Causal Validation and Empirical Impact
Causality is validated via targeted ablation and measurement of Attack Success Rate (ASR):
- Ablation Protocol: Deactivate all MS-Neurons at inference (set their contributions to zero). As control, mask an equal number of randomly selected neurons (M-R).
- ASR Metric: ASR is the empirical fraction of jailbreak prompts where the model fails to refuse and produces unsafe content, as auto-judged by a strong external model (e.g., GPT-4o).
Empirical findings:
| Model | Default ASR | MS-Neuron Mask ASR | Random Mask ASR | ASR Increase |
|---|---|---|---|---|
| Llama3.1-8B-it | 30.22% | 66.98% | +1.32% | +36.76% |
| Qwen3-8B (AdvBench-x) | 15.60% | 41.46% | - | +25.85% |
| Llama3.1-8B-it | 31.47% | 67.48% | - | +36.01% |
Masking random neurons has negligible effect, while masking MS-Neurons causes a catastrophic loss in refusal, confirming their necessity for safety behavior (Zhang et al., 1 Feb 2026). Similar results under neuron gating and calibration are reported for other architectures (Zhao et al., 1 Sep 2025).
4. Interpretation and Theoretical Role
MS-Neurons are disproportionately localized in attention projections, suggesting that safety alignment is implemented through dynamic information routing rather than static factual storage (Zhang et al., 1 Feb 2026). The isolation of MS-Neurons by contrastive importance ensures they encode features unique to malicious or unsafe instruction classes, triggering refusals without affecting the core linguistic capabilities.
In the vocabulary-projection framework, aggregating safety neuron activations and projecting into the model’s output space recovers clear semantic axes of "conformity" (e.g., willingness to answer) and "rejection" (e.g., refusal tokens like “cannot,” “impossible,” etc.) (Zhao et al., 1 Sep 2025). This aligns the mechanistic function of MS-Neurons with the model's overt safety behavior.
5. Control and Alignment Strategies
Direct control of MS-Neurons enables both attack and defense mechanisms:
- Inference-Time Manipulation: Scaling or clamping MS-Neuron activations calibrates the model’s willingness to refuse or comply with harmful prompts, enabling precise modulation of safety-relevant output probabilities.
- SafeTuning: Fine-tuning is performed exclusively on the parameters corresponding to safety neurons and their projections. The loss function combines cross-entropy on explicit refusal samples with activation regularization:
Here, is a set of monolingual harmful/refusal pairs. Optimizing only the associated projections (e.g., AdamW, low LR, full weight freeze elsewhere) enhances refusal consistency while minimally impacting benign capability (Zhao et al., 1 Sep 2025). SafeTuning has been shown to reduce ASR to $0$– on strong jailbreaks (vs. without defense) and preserves Win Rate on benign queries, outperforming prompt-based and decoding-based baselines.
6. Practical Implementation: Pseudocode and Adaptation
MS-Neurons are identified and controlled using contrastive ranking and masking, as summarized below:
1 2 3 4 5 6 7 8 9 10 11 12 |
MS_neurons = set() for l in layers: for N in layer[l]: I_jail[N] = avg(||f_θ(x) - f_{θ\N}(x)||^2 for x in D_jail) I_norm[N] = avg(||f_θ(x) - f_{θ\N}(x)||^2 for x in D_norm) S_jail = top_p_percent_by(I_jail) S_norm = top_p_percent_by(I_norm) MS_layer = S_jail - S_norm MS_neurons.update(MS_layer) return MS_neurons response = f_θ_masked(x, zero_out=MS_neurons) |
Adapting to a monolingual setting requires restricting all collected harmful and benign corpora to the target language and adjusting the vocabulary projection step accordingly. No changes are needed to the model architecture or the overall algorithm (Zhao et al., 1 Sep 2025). This suggests that monolingual safety intervention can be independently and efficiently realized in any linguistic domain.
7. Relation to Broader Safety Alignment Research
MS-Neurons are the monolingual instantiation of broader safety-related neuron approaches, connecting to the concept of "shared safety neurons" (SS-Neurons) in multilingual settings, which mediate cross-lingual safety transfer (Zhang et al., 1 Feb 2026). The interpretability and targeted control of MS-Neurons enable the empirical verification of safety circuits and support neuron-oriented safety fine-tuning strategies.
A plausible implication is that further refinement of MS-Neuron discovery, and their interplay with cross-lingual subpopulations, will provide scalable mechanisms for robust and language-consistent safety alignment in next-generation LLMs. MS-Neurons thus constitute a foundational component for both mechanistic interpretability and practical safety intervention in monolingual language modeling.