MS-Neurons: Monolingual Safety in LLMs

Updated 8 February 2026

MS-Neurons are specialized neurons in LLMs that enforce language-specific safety alignment by triggering robust refusal behavior on harmful prompts.
They are identified through contrastive ranking of neuron activations on harmful versus benign prompts and isolated via targeted ablation methods.
Manipulation of MS-Neurons, through techniques like SafeTuning, enables precise control of safety outputs, markedly reducing the attack success rate on unsafe requests.

Monolingual Safety Neurons (MS-Neurons) are a distinct and causally validated subpopulation in LLMs that mechanistically underlie language-specific safety alignment, particularly the model’s refusal to comply with harmful or jailbreak prompts. Identified through direct analysis of neuron-level activation patterns and representational impact, MS-Neurons mediate robust refusal behavior in response to unethical or unsafe requests, while remaining inert during benign interactions. The precise identification, manipulation, and targeted fine-tuning of these neurons have established both interpretability and effective control of safety in transformer-based models within a single linguistic domain (Zhang et al., 1 Feb 2026, Zhao et al., 1 Sep 2025).

1. Formal Definition and Mathematical Basis

MS-Neurons are formally defined in the context of a transformer LLM $f_\theta$ , where each neuron $N$ in attention (query, key, value projections $W_Q^{(l)}, W_K^{(l)}, W_V^{(l)}$ ) or output projection $W_O^{(l)}$ is evaluated for its causal impact on the output representation. For a given input $x$ :

$\Delta_{\mathsf{LLM}}(x, N) = \left\| f_\theta(x) - f_{\theta \setminus N}(x) \right\|_2^2$

Here, $f_{\theta \setminus N}$ denotes the forward pass with neuron $N$ deactivated (zeroed out). Aggregated over dataset $\mathcal{D}$ , the importance score is

$I(N, \mathcal{D}) = \mathbb{E}_{x \sim \mathcal{D}}\left[\Delta_{\mathsf{LLM}}(x, N)\right]$

Given two datasets— $\mathcal{D}_{\mathrm{jail}}$ (approx. 800 jailbreak prompts with safe refusals) and $\mathcal{D}_{\mathrm{norm}}$ (1,000 benign prompts with standard responses)—the monolingual safety neurons in layer $l$ are determined by

$\mathrm{MS}^{(l)} = \mathcal{S}^{(l)}_p(\mathcal{D}_{\mathrm{jail}}) \setminus \mathcal{S}^{(l)}_p(\mathcal{D}_{\mathrm{norm}})$

where $\mathcal{S}^{(l)}_p(\cdot)$ denotes the set of top $p$ \%-importance neurons per layer (typically $p=3\%$ ). The global MS-Neuron set is the union over all layers. These neurons are thus the most causally engaged by harmful prompts and not by benign prompts (Zhang et al., 1 Feb 2026).

2. Identification and Isolation Methodology

MS-Neuron identification proceeds via layer-wise contrastive ranking:

Activation Probing: Compute $I(N, \mathcal{D}_{\mathrm{jail}})$ and $I(N, \mathcal{D}_{\mathrm{norm}})$ for each neuron $N$ in each layer. Rank and select top $p\%$ as $\mathcal{S}^{(l)}_p(\cdot)$ in each case.
Contrastive Subtraction: Define MS-Neurons per layer as neurons in the top $p\%$ for $\mathcal{D}_{\mathrm{jail}}$ but not for $\mathcal{D}_{\mathrm{norm}}$ . Empirical results show substantial overlap ( $>90\%$ ) between the top neurons for benign and harmful prompts, supporting the specificity of the subtraction step.
Aggregation: The full MS-Neuron set is the union across all layers, which isolates neurons whose importance is specific to refusal behavior in the target language (Zhang et al., 1 Feb 2026).

In the MLP-based framework (Zhao et al., 1 Sep 2025), a neuron (row $i$ of $W_{\ell 2}$ in MLP layer $\ell$ ) is assigned contribution $C_{\ell i} = a_{\ell i} \cdot \|N_{\ell i}\|$ , with $a_{\ell i}$ being the neuron’s activation for input $x$ . Top-k\% neurons by $C_{\ell i}$ (on harmful prompts), after excluding those also important on benign data, constitute the safety-specific subset.

3. Causal Validation and Empirical Impact

Causality is validated via targeted ablation and measurement of Attack Success Rate (ASR):

Ablation Protocol: Deactivate all MS-Neurons at inference (set their contributions to zero). As control, mask an equal number of randomly selected neurons (M-R).
ASR Metric: ASR is the empirical fraction of jailbreak prompts where the model fails to refuse and produces unsafe content, as auto-judged by a strong external model (e.g., GPT-4o).

Empirical findings:

Model	Default ASR	MS-Neuron Mask ASR	Random Mask ASR	ASR Increase
Llama3.1-8B-it	30.22%	66.98%	+1.32%	+36.76%
Qwen3-8B (AdvBench-x)	15.60%	41.46%	-	+25.85%
Llama3.1-8B-it	31.47%	67.48%	-	+36.01%

Masking random neurons has negligible effect, while masking MS-Neurons causes a catastrophic loss in refusal, confirming their necessity for safety behavior (Zhang et al., 1 Feb 2026). Similar results under neuron gating and calibration are reported for other architectures (Zhao et al., 1 Sep 2025).

4. Interpretation and Theoretical Role

MS-Neurons are disproportionately localized in attention projections, suggesting that safety alignment is implemented through dynamic information routing rather than static factual storage (Zhang et al., 1 Feb 2026). The isolation of MS-Neurons by contrastive importance ensures they encode features unique to malicious or unsafe instruction classes, triggering refusals without affecting the core linguistic capabilities.

In the vocabulary-projection framework, aggregating safety neuron activations and projecting into the model’s output space recovers clear semantic axes of "conformity" (e.g., willingness to answer) and "rejection" (e.g., refusal tokens like “cannot,” “impossible,” etc.) (Zhao et al., 1 Sep 2025). This aligns the mechanistic function of MS-Neurons with the model's overt safety behavior.

5. Control and Alignment Strategies

Direct control of MS-Neurons enables both attack and defense mechanisms:

Inference-Time Manipulation: Scaling or clamping MS-Neuron activations calibrates the model’s willingness to refuse or comply with harmful prompts, enabling precise modulation of safety-relevant output probabilities.
SafeTuning: Fine-tuning is performed exclusively on the parameters corresponding to safety neurons and their projections. The loss function combines cross-entropy on explicit refusal samples with activation regularization:

$L_{\mathrm{safety}}(\theta) = -\frac{1}{|S|} \sum_i \log P_\theta(y_{\mathrm{refuse}}^i \mid x_{\mathrm{harm}}^i) + \lambda \|\Delta a_S\|^2$

Here, $S$ is a set of monolingual harmful/refusal pairs. Optimizing only the associated projections (e.g., AdamW, low LR, full weight freeze elsewhere) enhances refusal consistency while minimally impacting benign capability (Zhao et al., 1 Sep 2025). SafeTuning has been shown to reduce ASR to $0$– $13\%$ on strong jailbreaks (vs. $>97\%$ without defense) and preserves Win Rate on benign queries, outperforming prompt-based and decoding-based baselines.

6. Practical Implementation: Pseudocode and Adaptation

MS-Neurons are identified and controlled using contrastive ranking and masking, as summarized below:

MS_neurons = set()
for l in layers:
    for N in layer[l]:
        I_jail[N] = avg(||f_θ(x) - f_{θ\N}(x)||^2 for x in D_jail)
        I_norm[N] = avg(||f_θ(x) - f_{θ\N}(x)||^2 for x in D_norm)
    S_jail = top_p_percent_by(I_jail)
    S_norm = top_p_percent_by(I_norm)
    MS_layer = S_jail - S_norm
    MS_neurons.update(MS_layer)
return MS_neurons

response = f_θ_masked(x, zero_out=MS_neurons)

Adapting to a monolingual setting requires restricting all collected harmful and benign corpora to the target language and adjusting the vocabulary projection step accordingly. No changes are needed to the model architecture or the overall algorithm (Zhao et al., 1 Sep 2025). This suggests that monolingual safety intervention can be independently and efficiently realized in any linguistic domain.

7. Relation to Broader Safety Alignment Research

MS-Neurons are the monolingual instantiation of broader safety-related neuron approaches, connecting to the concept of "shared safety neurons" (SS-Neurons) in multilingual settings, which mediate cross-lingual safety transfer (Zhang et al., 1 Feb 2026). The interpretability and targeted control of MS-Neurons enable the empirical verification of safety circuits and support neuron-oriented safety fine-tuning strategies.

A plausible implication is that further refinement of MS-Neuron discovery, and their interplay with cross-lingual subpopulations, will provide scalable mechanisms for robust and language-consistent safety alignment in next-generation LLMs. MS-Neurons thus constitute a foundational component for both mechanistic interpretability and practical safety intervention in monolingual language modeling.

Markdown Report Issue Upgrade to Chat

References (2)

Who Transfers Safety? Identifying and Targeting Cross-Lingual Shared Safety Neurons (2026)

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Monolingual Safety Neurons (MS-Neurons).

MS-Neurons: Monolingual Safety in LLMs

1. Formal Definition and Mathematical Basis

2. Identification and Isolation Methodology

3. Causal Validation and Empirical Impact

4. Interpretation and Theoretical Role

5. Control and Alignment Strategies

6. Practical Implementation: Pseudocode and Adaptation

7. Relation to Broader Safety Alignment Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MS-Neurons: Monolingual Safety in LLMs

1. Formal Definition and Mathematical Basis

2. Identification and Isolation Methodology

3. Causal Validation and Empirical Impact

4. Interpretation and Theoretical Role

5. Control and Alignment Strategies

6. Practical Implementation: Pseudocode and Adaptation

7. Relation to Broader Safety Alignment Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research