Monosemanticity Score (MS) in Model Interpretability

Updated 31 January 2026

Monosemanticity Score is a metric that defines features as monosemantic when they uniquely respond to a single, human-interpretable concept.
It employs measures like FMS and MDS to capture local and global disentanglement, showing improvements in feature isolation across modalities.
Empirical findings indicate that higher monosemanticity supports direct feature attribution, precise interventions, and safer, more controllable model behavior.

A Monosemanticity Score (MS) quantifies the degree to which model features (usually neurons, units, or latent dimensions) correspond to single, human-interpretable concepts rather than representing multiple, entangled attributes. Monosemanticity is fundamental in mechanistic interpretability, allowing robust attribution of model behavior to interpretable units, and directly impacts feature disentanglement, sparsity, and model controllability. Several lines of recent research have introduced rigorous metrics and proxies for monosemanticity in linguistic, visual, and multimodal models, tying increased monosemanticity to both improved interpretability and, in some contexts, model capacity or alignment performance.

1. Formal Definitions of Monosemanticity

The foundational definition treats a neuron (or latent feature) as monosemantic if it activates only in response to a single, interpretable concept or feature set. Given a partition $\{A_1, \dots, A_m\}$ of input space $X$ (with $A_i \cap A_j = \emptyset$ , $\cup_i A_i = X$ ), neuron $z$ is monosemantic for $A_j$ if:

$\forall x \in X, \;\;\text{act}(z, x) = 1 \implies x \in A_j$

This is the strongest form of disentanglement at the unit level. Imperfect, “polysemantic” units encode superpositions—mixtures of distinct concepts—which hinders interpretability and direct feature-level intervention (Yan et al., 2024).

2. Metric Formulations and Proxies

2.1 Feature Monosemanticity Score (FMS)

FMS, introduced for latent representations of LLMs, combines measures of feature-level capacity, local and global disentanglement (Härle et al., 24 Jun 2025):

Let $C$ be a concept set ( $|C|$ target concepts). For each $c \in C$ :

$X$ 0 — best single-feature classification accuracy.
$X$ 1 — accuracy of a small decision-tree stump of depth $X$ 2 (cumulative capacity).
$X$ 3 — accuracy after removing $X$ 4 most informative features.

Key subcomponents:

Local disentanglement:

$X$ 5

Global disentanglement:

$X$ 6

$X$ 7

Aggregate FMS over all concepts:

$X$ 8

All components are normalized to $X$ 9, with higher values indicating greater monosemanticity (Härle et al., 24 Jun 2025).

2.2 Modality Dominance Score (MDS) as Monosemanticity Score

In multimodal architectures, monosemanticity is operationalized via the Modality Dominance Score (MDS), $A_i \cap A_j = \emptyset$ 0, for feature $A_i \cap A_j = \emptyset$ 1 (Yan et al., 16 Feb 2025):

$A_i \cap A_j = \emptyset$ 2

Where $A_i \cap A_j = \emptyset$ 3 and $A_i \cap A_j = \emptyset$ 4 are the $A_i \cap A_j = \emptyset$ 5-th feature activations for image and text modalities, over $A_i \cap A_j = \emptyset$ 6 paired samples. $A_i \cap A_j = \emptyset$ 7 near $A_i \cap A_j = \emptyset$ 8 indicates image-specificity, near $A_i \cap A_j = \emptyset$ 9 text-specificity, and near $\cup_i A_i = X$ 0 cross-modal entanglement.

2.3 Proxies via Feature Correlation and Activation-Variance

Monosemanticity can be tracked via proxies such as:

Superposition-decomposition ( $\cup_i A_i = X$ 1): High values in a specific formula indicate monosemantic basis vectors.
Activation-variance: High variance across samples in a dimension implies sparse, concept-specific firing.
Feature decorrelation ( $\cup_i A_i = X$ 2): Lower pairwise correlation between feature activations across samples implies greater monosemanticity (Yan et al., 2024).

3. Measurement and Experimental Protocols

Measurement depends on context:

SAE/G-SAE Models / Latent Representations: Compute FMS using a labeled dataset and decision-tree classifiers. Remove features iteratively, record accuracy drops, and aggregate via the FMS formula (Härle et al., 24 Jun 2025).
Multimodal CLIP-like Models: Compute MDS per feature over paired image-text activations; threshold feature groups using the empirical mean and standard deviation of $\cup_i A_i = X$ 3 (Yan et al., 16 Feb 2025).
Correlation/Sparsity Proxies: For MLP activations $\cup_i A_i = X$ 4, compute

$\cup_i A_i = X$ 5

and summarize global monosemanticity as $\cup_i A_i = X$ 6 (Yan et al., 2024).

Supervised or contrastive training can enhance monosemanticity, as shown by Guided SAEs (G-SAE) or feature decorrelation regularizers.

4. Empirical Findings and Comparative Analyses

FMS and MDS expose substantial differences in monosemanticity across models and training regimes:

G-SAE vs. Vanilla SAE: G-SAE nearly doubles FMS@1 scores across tasks (0.52 vs. 0.27), indicating more precise isolation of concepts in single latent dimensions (Härle et al., 24 Jun 2025).
CLIP Variants: Pure CLIP is skewed toward image-dominant neurons, while monosemanticity-enhancing objectives yield a more balanced distribution between modalities (Yan et al., 16 Feb 2025).
Feature Decorrelation: Preference alignment (Direct Preference Optimization, DPO) alone increases monosemanticity proxies; augmenting with a feature decorrelation regularizer (DecPO) further enhances sparsity, diversity, and alignment performance (Yan et al., 2024).

Model/Setting	Monosemanticity Metric	Key Empirical Result
Vanilla SAE	FMS@1	0.27 (avg), low single-feature purity
G-SAE	FMS@1	0.52 (avg), strong concept–dimension mapping
CLIP	MDS ( $\cup_i A_i = X$ 7)	Most features are image-dominant
CLIP+SAE / CLIP+NCL	MDS ( $\cup_i A_i = X$ 8)	More balanced (text-/image-dominant, cross-modal)
DPO/DecPO (Llama-2)	Corr/Sparsity proxies	DecPO yields +10–13 alignment points, lower $\cup_i A_i = X$ 9

The higher FMS/MDS in these settings supports more precise and controllable feature interventions.

5. Applications and Interpretability Impact

High monosemanticity, as indexed by MS or FMS, enables:

Mechanistic interpretability: Direct attribution of model decisions to latent features or neurons.
Fine-grained control: Behavioral steering via single-feature interventions without cross-concept leakage.
Improved detection: Cleaner detection of privacy attributes, toxicity, or style.
Reliable multimodal attribution: Disentanglement of image, text, and cross-modal concepts, supporting targeted adversarial robustness and controllable generation tasks (Härle et al., 24 Jun 2025, Yan et al., 16 Feb 2025).

Monosemanticity is thus key for auditability in safety-critical applications and in scientific analysis of representation learning.

6. Limitations and Research Directions

Classic disentanglement metrics (e.g., $z$ 0-VAE, Mutual Information Gap) do not assess whether individual units encode a single concept, and vision-specific metrics fail to generalize to language or multimodal settings (Härle et al., 24 Jun 2025). MS/FMS provides a scalar quantification but may be confounded by hierarchical/subconcept structure (revealed by FMS@ $z$ 1 for $z$ 2), or by the lack of perfect ground-truth concepts (Yan et al., 16 Feb 2025, Yan et al., 2024).

A plausible implication is that further formalization is required to make MS reflect not only local capacity but also true semantic disentanglement, especially in deep or highly overparameterized architectures.

Ongoing research aims to:

Validate MS/FMS/MDS against human-annotated concept datasets.
Tie monosemanticity more tightly to generalization and safety properties.
Develop scalable computation techniques for very large ( $z$ 38B) model architectures (Yan et al., 2024).

7. Practical Recommendations

FMS/MDS should be used as an audit tool prior to deploying single-feature interventions or interpretability claims.
Low FMS indicates concept leakage. One-vector steering or ablation is likely to have unpredictable side effects.
Practical improvement methods: Guided feature conditioning, decorrelation regularization, and contrastive training on labeled concepts have been empirically shown to increase monosemanticity and downstream performance.
Thresholding MDS scores provides automatic grouping of features into modality-specific or cross-modal sets, guiding selective interventions in multimodal networks (Yan et al., 16 Feb 2025).

By delivering quantifiable, interpretable measures of feature purity, monosemanticity scores such as FMS and MDS constitute a core methodology in contemporary mechanistic interpretability and model alignment research (Härle et al., 24 Jun 2025, Yan et al., 2024, Yan et al., 16 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective (2024)

Measuring and Guiding Monosemanticity (2025)

Multi-Faceted Multimodal Monosemanticity (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Monosemanticity Score (MS).

Monosemanticity Score (MS) in Model Interpretability

1. Formal Definitions of Monosemanticity

2. Metric Formulations and Proxies

2.1 Feature Monosemanticity Score (FMS)

2.2 Modality Dominance Score (MDS) as Monosemanticity Score

2.3 Proxies via Feature Correlation and Activation-Variance

3. Measurement and Experimental Protocols

4. Empirical Findings and Comparative Analyses

5. Applications and Interpretability Impact

6. Limitations and Research Directions

7. Practical Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Monosemanticity Score (MS) in Model Interpretability

1. Formal Definitions of Monosemanticity

2. Metric Formulations and Proxies

2.1 Feature Monosemanticity Score (FMS)

2.2 Modality Dominance Score (MDS) as Monosemanticity Score

2.3 Proxies via Feature Correlation and Activation-Variance

3. Measurement and Experimental Protocols

4. Empirical Findings and Comparative Analyses

5. Applications and Interpretability Impact

6. Limitations and Research Directions

7. Practical Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research