Multimodal Token Fusion for Vision Transformers

Published 19 Apr 2022 in cs.CV | (2204.08721v2)

Abstract: Many adaptations of transformers have emerged to address the single-modal vision tasks, where self-attention modules are stacked to handle input sources like images. Intuitively, feeding multiple modalities of data to vision transformers could improve the performance, yet the inner-modal attentive weights may also be diluted, which could thus undermine the final performance. In this paper, we propose a multimodal token fusion method (TokenFusion), tailored for transformer-based vision tasks. To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features. Residual positional alignment is also adopted to enable explicit utilization of the inter-modal alignments after fusion. The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact. Extensive experiments are conducted on a variety of homogeneous and heterogeneous modalities and demonstrate that TokenFusion surpasses state-of-the-art methods in three typical vision tasks: multimodal image-to-image translation, RGB-depth semantic segmentation, and 3D object detection with point cloud and images. Our code is available at https://github.com/yikaiw/TokenFusion.

Abstract PDF Upgrade to Chat

Citations (128)

View on Semantic Scholar

Summary

The paper introduces TokenFusion, a dynamic fusion method that replaces less informative tokens with aggregated inter-modal features to improve vision transformer performance.
It employs a residual positional alignment strategy to preserve positional embeddings and effectively integrate multimodal data in tasks like image translation and semantic segmentation.
Experimental results demonstrate significant gains over state-of-the-art methods, with improved FID, KID, and mAP metrics in RGB-depth segmentation and 3D object detection.

Multimodal Token Fusion for Vision Transformers

The paper "Multimodal Token Fusion for Vision Transformers" presents a novel method termed "TokenFusion," designed to enhance the capability of transformer models in handling multimodal vision tasks. This research addresses a critical challenge in applying vision transformers to multimodal data, where the fusion of information from diverse modalities can dilute the informative content and affect overall performance.

Methodology

TokenFusion introduces a dynamic fusion technique that identifies less informative tokens within a transformer model and replaces them with aggregated inter-modal features. This minimizes interference with the single-modal design and maximizes the retention of crucial information. The model employs a residual positional alignment strategy allowing for the preservation of positional embeddings and more effective integration of multimodal data.

This technique is particularly applied to vision tasks involving homogeneous modalities like multimodal image-to-image translation and RGB-depth semantic segmentation, as well as heterogeneous modalities such as 3D object detection using point clouds and images. A significant advantage of TokenFusion is its dynamic and adaptive nature, making it compatible with pre-trained models and their pre-established architectures.

Experimental Results

The experiments on several vision tasks demonstrate that TokenFusion surpasses existing state-of-the-art methods. For tasks like image-to-image translation, TokenFusion achieves superior performance over previous methods, reflected in lower FID and KID metrics, indicating improved visual similarity and quality. In the context of RGB-depth semantic segmentation, TokenFusion achieves high accuracy, outperforming prominent models like SSMA and CEN.

In 3D object detection scenarios involving both 3D point clouds and 2D images, TokenFusion achieves remarkable improvements in mAP metrics on datasets like SUN RGB-D and ScanNetV2. The fusion strategy significantly boosts detection accuracy by strategically aligning 2D and 3D information, indicating the proposed method's robustness and adaptability across diverse modalities.

Implications and Future Work

The implications of this research are significant in both theoretical and practical domains. From a theoretical perspective, TokenFusion offers a structured approach to utilize linguistic transformer mechanisms in vision tasks that involve diverse data modalities, positioning it as a potential framework for designing future multimodal transformer architectures. Practically, the improvements in performance metrics across various tasks cleanly suggest that Transformer-based models, when appropriately configured to handle multimodal inputs, can achieve a new level of efficacy and applicability in commercial and research domains.

The paper also opens avenues for future exploration, particularly in improving the adaptability of TokenFusion across even more complex scenarios, such as real-time detection tasks and more integration-heavy applications like AR/VR. The possibility of extending this approach to other machine learning models outside vision tasks, potentially expanding into fields such as audio-visual fusion or even more generalized multimodal ML applications, presents an exciting frontier for researchers.

TokenFusion is positioned as a versatile and high-performance solution in vision tasks, offering clear guidance on leveraging transformer architecture for multimodal fusion with substantial empirical backing. Future research could further refine such methodologies by addressing computational efficiencies and exploring integrations with other state-of-the-art machine learning advancements.