Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Published 9 Sep 2021 in cs.CL and cs.CV | (2109.04448v1)

Abstract: Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities. We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information. This method involves ablating inputs from one modality, either entirely or selectively based on cross-modal grounding alignments, and evaluating the model prediction performance on the other modality. Model performance is measured by modality-specific tasks that mirror the model pretraining objectives (e.g. masked language modelling for text). Models that have learned to construct cross-modal representations using both modalities are expected to perform worse when inputs are missing from a modality. We find that recently proposed models have much greater relative difficulty predicting text when visual information is ablated, compared to predicting visual object categories when text is ablated, indicating that these models are not symmetrically cross-modal.

Abstract PDF Upgrade to Chat

Citations (80)

View on Semantic Scholar

Summary

The paper introduces a novel cross-modal ablation method to diagnose how visual and textual inputs contribute to BERT pretraining.
The study finds that ablating visual inputs degrades text prediction more than ablating language inputs, indicating stronger vision-for-language dependency.
The research highlights the impact of noisy silver object annotations on cross-modal balance and calls for improved label quality in multimodal models.

The study presented in the paper, "Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers," provides a diagnostic analysis of vision-and-language BERT models, focusing on how these models integrate cross-modal information during pretraining. The research is premised on evaluating whether these multimodal transformers effectively leverage visual context in language tasks and linguistic context in vision tasks.

Objective and Methodology

The authors propose a novel diagnostic method, cross-modal input ablation, to assess the extent of cross-modal integration in multimodal models. This involves selectively ablating inputs from one modality and assessing the model's ability to predict masked data from the other modality. Performance metrics are aligned with the model's pretraining objectives, such as masked language modeling (MLM) for text and masked region classification (MRC-KL) for visual data. The study hypothesizes that models effectively utilizing cross-modal inputs should demonstrate degraded performance when inputs from a modality are missing.

Key Findings

The experimental results reveal a significant asymmetry in cross-modal representation within these models. The models exhibit greater difficulty with text prediction tasks when visual data is ablated compared to visual property prediction when language data is ablated. This suggests a stronger vision-for-language integration compared to language-for-vision integration. These findings challenge assumptions of balanced cross-modal interactions in existing multimodal transformers.

The paper also explores potential reasons behind this asymmetry. Initial explorations include assessing various pretraining architectures, loss functions, initialization strategies, and co-masking techniques. However, these interventions did not significantly alter the model's recruitment of language-for-vision.

A critical insight was the identification of noise in the silver object annotations, produced by an object detector, which likely discouraged models from integrating linguistic context effectively in visual tasks. Analysis on a subset of data with correctly matched labels did not alter outcomes, reinforcing the influence of noisy labels on cross-modal dependencies.

Implications and Future Directions

This paper's findings underscore the need to re-evaluate the design and training paradigms of multimodal BERTs, particularly when symmetry in cross-modal interactions is essential. The diagnostic tool introduced can serve as a useful strategy for model developers to test and ensure balanced cross-modal influences in future architectures.

Practically, ensuring high-quality training labels—especially for the visual tasks—might enhance the model's ability to integrate language context effectively. From a theoretical standpoint, this study prompts a reconsideration of the model pretraining objectives and dataset compositions to foster balanced development of cross-modal representations.

Looking forward, exploring more language-for-vision tasks and integrating robust, human-generated visual annotations could drive advancements in multimodal AI, allowing these models to be more bidirectionally integrative. This approach might not only improve existing applications but could also unlock new domains where understanding and generating multimodal content is crucial.

In summary, this work provides compelling evidence of the directional bias in multimodal transformers and introduces a new method to diagnose and potentially rectify such biases, making it an important contribution to the field of AI research.