Multi-Faceted Multimodal Monosemanticity

Published 16 Feb 2025 in cs.CV and cs.AI | (2502.14888v3)

Abstract: Humans experience the world through multiple modalities, such as, vision, language, and speech, making it natural to explore the commonality and distinctions among them. In this work, we take a data-driven approach to address this question by analyzing interpretable, monosemantic features extracted from deep multimodal models. Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs. Building on prior research in single-modal interpretability, we develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP. Specifically, we introduce the Modality Dominance Score (MDS) to attribute each CLIP feature to a specific modality. We then map CLIP features into a more interpretable space, enabling us to categorize them into three distinct classes: vision features (single-modal), language features (single-modal), and visual-language features (cross-modal). Interestingly, this data-driven categorization closely aligns with human intuitive understandings of different modalities. We further show that this modality decomposition can benefit multiple downstream tasks, including reducing bias in gender detection, generating cross-modal adversarial examples, and enabling modal-specific feature control in text-to-image generation. These results indicate that large-scale multimodal models, when equipped with task-agnostic interpretability tools, can offer valuable insights into the relationships between different data modalities.

Abstract PDF Upgrade to Chat

Summary

The paper proposes a modality dominance score (MDS) to quantify and isolate monosemantic features in vision-language models.
It introduces adaptations of Sparse Autoencoders and Non-negative Contrastive Learning to enhance feature sparsity and align modality-specific representations.
The study demonstrates applications in bias analysis, adversarial attack defense, and controlled text-to-image generation using modality-specific interventions.

Multi-Faceted Multimodal Monosemanticity

Introduction to Multimodal Monosemanticity

The paper investigates the extraction and evaluation of monosemantic features within Vision-LLMs (VLMs) like CLIP. By introducing a Modality Dominance Score (MDS), the authors quantify the contribution of each modality (vision and language) to the final-layer features, which are then categorized into three groups: vision-dominant (ImgD), language-dominant (TextD), and cross-modal (CrossD). This categorization aids in understanding the model's behavior and its alignment with cognitive patterns seen in humans.

Figure 1: Modality Dominance Score (MDS) distributions of three feature categories for different VLMs.

Decoding Multimodal Monosemanticity

Monosemantic Feature Extraction

The authors adapt Sparse Autoencoders (SAEs) and Non-negative Contrastive Learning (NCL) models to improve the monosemanticity of features in VLMs. SAE models enhance feature interpretability through sparsity, targeting the most activated features, while NCL leverages a positive contrastive framework to align features non-negatively, facilitating interpretability across modalities.

Modality Specificity with MDS

MDS is introduced to assess the predominant modality influence on the model's features. By analyzing VLMs with MDS, the paper reveals that features show a bias towards image modality and that self-supervision in DeCLIP enhances modality separation.

Understanding Multimodal Monosemanticity

Quantitative and Qualitative Analysis

Monosemanticity is assessed using embedding similarity (EmbSim) and WinRate metrics. The results indicate enhanced interpretability for models employing SAE and NCL paradigms, showcasing clear modality distinctions.

(Figures 3 and 4)

Figures 3 & 4: Demonstrations of how ImgD and TextD features capture dominant visual and semantic concepts respectively.

Practical Implications

Gender Pattern Analysis

By selectively disabling ImgD and TextD features, the study shows their respective roles in gender identification tasks, uncovering modality-specific biases and stereotypes.

Adversarial Attack Defense

A key application explored is in defending multimodal systems against adversarial text injections. Aligning adversarial targets with modality-specific features enhances robustness, especially with TextD due to its profound semantic content.

Text-to-Image Generation Control

The study further applies feature control in text-to-image generation, illustrating how TextD and ImgD influence high-level semantic coherence and low-level visual detail in generated images.

Figure 2: New images generated with varying interventions in modality-specific features.

Conclusion

This research advances the understanding of feature monosemanticity in multimodal neural networks, proposing robust methods for feature extraction that align with human cognitive interpretation. The work demonstrates significant implications in bias reduction, adversarial robustness, and controllable image generation through modality-specific feature interventions. Future explorations may extend these methodologies across diverse architectures to further bridge gaps between human cognition and AI interpretations.

Markdown Report Issue