Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Published 3 Mar 2019 in cs.CV and cs.CL | (1903.00839v2)

Abstract: Referring expression grounding aims at locating certain objects or persons in an image with a referring expression, where the key challenge is to comprehend and align various types of information from visual and textual domain, such as visual attributes, location and interactions with surrounding regions. Although the attention mechanism has been successfully applied for cross-modal alignments, previous attention models focus on only the most dominant features of both modalities, and neglect the fact that there could be multiple comprehensive textual-visual correspondences between images and referring expressions. To tackle this issue, we design a novel cross-modal attention-guided erasing approach, where we discard the most dominant information from either textual or visual domains to generate difficult training samples online, and to drive the model to discover complementary textual-visual correspondences. Extensive experiments demonstrate the effectiveness of our proposed method, which achieves state-of-the-art performance on three referring expression grounding datasets.

Abstract PDF Upgrade to Chat

Citations (171)

View on Semantic Scholar

Summary

The paper introduces cross-modal attention-guided erasing as a regularization technique to improve feature alignment in referring expression grounding.
It employs three targeted erasing mechanisms that force the model to discover complementary textual and visual cues beyond dominant features.
Extensive tests on RefCOCO datasets demonstrate state-of-the-art performance while maintaining efficiency during inference.

Referring expression grounding is a complex task within the domain of vision and language that requires accurate cross-modal alignments to locate objects in an image based on natural language descriptions. This paper introduces a novel approach to enhance the grounding process by employing an innovative cross-modal attention-guided erasing method. Unlike previous models that predominantly focus on capturing the most salient feature alignments between text and visual elements, this approach attempts to uncover and learn from the complementary correspondences that are often overlooked.

The proposed method stands out by leveraging an erasing mechanism where dominant textual or visual features, indicated by attention weights, are deliberately discarded during the training phase. This process yields harder training samples that push the model to seek additional evidence across both modalities, ultimately encouraging the learning of richer dual-modal correspondences.

The methodology is structured around three distinct erasing mechanisms:

Image-aware Query Sentence Erasing: Here, the model determines word importance in the sentence based on visual context and attention levels. Key words are replaced with an "unknown" token, thereby maintaining sentence structure while eliminating the influence of those words, encouraging the model to explore alternative alignments.
Sentence-aware Subject Region Erasing: This mechanism involves erasing critical regions on the subject module’s feature map determined by sentence-aware spatial attention. It forces the model to rediscover complementary regions over focusing only on the most discriminative areas.
Sentence-aware Context Object Erasing: Essential for the location and relationship modules, this approach discards prominent context objects as determined by sentence-aware attention, encouraging the model to leverage other regions or modules.

The paper underscores the superiority of attention-guided erasing over other techniques, such as random erasing or adversarially selecting regions, by highlighting the natural tendency of attention mechanisms to emphasize the most salient features. This can suppress back-propagation signal efficiency to less prominent, but still relevant, features. By shifting or removing the attention away from dominant pairs, the method acts as a structured regularization technique that directs the learning process towards a broader set of cross-modal interactions.

To validate the proposed approach, extensive experiments were conducted on the RefCOCO, RefCOCO+, and RefCOCOg datasets. Across these datasets, the method achieved state-of-the-art performance, surpassing previous models by effectively capturing a wider array of textual and visual features required for accurate grounding.

The theoretical implications of this research are significant, suggesting that a shift from general attention strategies to targeted erasing can enhance understanding in tasks combining vision and language. Practically, this model does not increase complexity during inference, making it scalable and efficient for real-world applications.

Future directions may explore automated refinement of erasing methods, potentially incorporating dynamic attention weights that evolve during the training process. Additionally, expanding upon the types of textual-visual relationships that can be learned through erasing could enhance model adaptability across a broader spectrum of multimodal tasks. This paper represents a substantial contribution to referring expression grounding by bridging gaps in understanding diverse cross-modal interactions through systematic feature erasing.