Monosemanticity Score (MS) in Model Interpretability
- Monosemanticity Score is a metric that defines features as monosemantic when they uniquely respond to a single, human-interpretable concept.
- It employs measures like FMS and MDS to capture local and global disentanglement, showing improvements in feature isolation across modalities.
- Empirical findings indicate that higher monosemanticity supports direct feature attribution, precise interventions, and safer, more controllable model behavior.
A Monosemanticity Score (MS) quantifies the degree to which model features (usually neurons, units, or latent dimensions) correspond to single, human-interpretable concepts rather than representing multiple, entangled attributes. Monosemanticity is fundamental in mechanistic interpretability, allowing robust attribution of model behavior to interpretable units, and directly impacts feature disentanglement, sparsity, and model controllability. Several lines of recent research have introduced rigorous metrics and proxies for monosemanticity in linguistic, visual, and multimodal models, tying increased monosemanticity to both improved interpretability and, in some contexts, model capacity or alignment performance.
1. Formal Definitions of Monosemanticity
The foundational definition treats a neuron (or latent feature) as monosemantic if it activates only in response to a single, interpretable concept or feature set. Given a partition of input space %%%%1%%%% (with , ), neuron is monosemantic for if:
This is the strongest form of disentanglement at the unit level. Imperfect, “polysemantic” units encode superpositions—mixtures of distinct concepts—which hinders interpretability and direct feature-level intervention (Yan et al., 2024).
2. Metric Formulations and Proxies
2.1 Feature Monosemanticity Score (FMS)
FMS, introduced for latent representations of LLMs, combines measures of feature-level capacity, local and global disentanglement (Härle et al., 24 Jun 2025):
Let be a concept set ( target concepts). For each :
- — best single-feature classification accuracy.
- — accuracy of a small decision-tree stump of depth (cumulative capacity).
- — accuracy after removing most informative features.
Key subcomponents:
- Local disentanglement:
- Global disentanglement:
- Aggregate FMS over all concepts:
All components are normalized to , with higher values indicating greater monosemanticity (Härle et al., 24 Jun 2025).
2.2 Modality Dominance Score (MDS) as Monosemanticity Score
In multimodal architectures, monosemanticity is operationalized via the Modality Dominance Score (MDS), , for feature (Yan et al., 16 Feb 2025):
Where and are the -th feature activations for image and text modalities, over paired samples. near $1$ indicates image-specificity, near $0$ text-specificity, and near $0.5$ cross-modal entanglement.
2.3 Proxies via Feature Correlation and Activation-Variance
Monosemanticity can be tracked via proxies such as:
- Superposition-decomposition (): High values in a specific formula indicate monosemantic basis vectors.
- Activation-variance: High variance across samples in a dimension implies sparse, concept-specific firing.
- Feature decorrelation (): Lower pairwise correlation between feature activations across samples implies greater monosemanticity (Yan et al., 2024).
3. Measurement and Experimental Protocols
Measurement depends on context:
- SAE/G-SAE Models / Latent Representations: Compute FMS using a labeled dataset and decision-tree classifiers. Remove features iteratively, record accuracy drops, and aggregate via the FMS formula (Härle et al., 24 Jun 2025).
- Multimodal CLIP-like Models: Compute MDS per feature over paired image-text activations; threshold feature groups using the empirical mean and standard deviation of (Yan et al., 16 Feb 2025).
- Correlation/Sparsity Proxies: For MLP activations , compute
and summarize global monosemanticity as (Yan et al., 2024).
Supervised or contrastive training can enhance monosemanticity, as shown by Guided SAEs (G-SAE) or feature decorrelation regularizers.
4. Empirical Findings and Comparative Analyses
FMS and MDS expose substantial differences in monosemanticity across models and training regimes:
- G-SAE vs. Vanilla SAE: G-SAE nearly doubles FMS@1 scores across tasks (0.52 vs. 0.27), indicating more precise isolation of concepts in single latent dimensions (Härle et al., 24 Jun 2025).
- CLIP Variants: Pure CLIP is skewed toward image-dominant neurons, while monosemanticity-enhancing objectives yield a more balanced distribution between modalities (Yan et al., 16 Feb 2025).
- Feature Decorrelation: Preference alignment (Direct Preference Optimization, DPO) alone increases monosemanticity proxies; augmenting with a feature decorrelation regularizer (DecPO) further enhances sparsity, diversity, and alignment performance (Yan et al., 2024).
| Model/Setting | Monosemanticity Metric | Key Empirical Result |
|---|---|---|
| Vanilla SAE | FMS@1 | 0.27 (avg), low single-feature purity |
| G-SAE | FMS@1 | 0.52 (avg), strong concept–dimension mapping |
| CLIP | MDS () | Most features are image-dominant |
| CLIP+SAE / CLIP+NCL | MDS () | More balanced (text-/image-dominant, cross-modal) |
| DPO/DecPO (Llama-2) | Corr/Sparsity proxies | DecPO yields +10–13 alignment points, lower |
The higher FMS/MDS in these settings supports more precise and controllable feature interventions.
5. Applications and Interpretability Impact
High monosemanticity, as indexed by MS or FMS, enables:
- Mechanistic interpretability: Direct attribution of model decisions to latent features or neurons.
- Fine-grained control: Behavioral steering via single-feature interventions without cross-concept leakage.
- Improved detection: Cleaner detection of privacy attributes, toxicity, or style.
- Reliable multimodal attribution: Disentanglement of image, text, and cross-modal concepts, supporting targeted adversarial robustness and controllable generation tasks (Härle et al., 24 Jun 2025, Yan et al., 16 Feb 2025).
Monosemanticity is thus key for auditability in safety-critical applications and in scientific analysis of representation learning.
6. Limitations and Research Directions
Classic disentanglement metrics (e.g., -VAE, Mutual Information Gap) do not assess whether individual units encode a single concept, and vision-specific metrics fail to generalize to language or multimodal settings (Härle et al., 24 Jun 2025). MS/FMS provides a scalar quantification but may be confounded by hierarchical/subconcept structure (revealed by FMS@ for ), or by the lack of perfect ground-truth concepts (Yan et al., 16 Feb 2025, Yan et al., 2024).
A plausible implication is that further formalization is required to make MS reflect not only local capacity but also true semantic disentanglement, especially in deep or highly overparameterized architectures.
Ongoing research aims to:
- Validate MS/FMS/MDS against human-annotated concept datasets.
- Tie monosemanticity more tightly to generalization and safety properties.
- Develop scalable computation techniques for very large (8B) model architectures (Yan et al., 2024).
7. Practical Recommendations
- FMS/MDS should be used as an audit tool prior to deploying single-feature interventions or interpretability claims.
- Low FMS indicates concept leakage. One-vector steering or ablation is likely to have unpredictable side effects.
- Practical improvement methods: Guided feature conditioning, decorrelation regularization, and contrastive training on labeled concepts have been empirically shown to increase monosemanticity and downstream performance.
- Thresholding MDS scores provides automatic grouping of features into modality-specific or cross-modal sets, guiding selective interventions in multimodal networks (Yan et al., 16 Feb 2025).
By delivering quantifiable, interpretable measures of feature purity, monosemanticity scores such as FMS and MDS constitute a core methodology in contemporary mechanistic interpretability and model alignment research (Härle et al., 24 Jun 2025, Yan et al., 2024, Yan et al., 16 Feb 2025).