Zorro: the masked multimodal transformer

Published 23 Jan 2023 in cs.CV | (2301.09595v2)

Abstract: Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

Abstract PDF Upgrade to Chat

Authors (11)

Citations (18)

View on Semantic Scholar

Summary

The paper introduces a novel masking method for unified multimodal processing that preserves modality-specific representations.
It demonstrates robust improvements through contrastive pre-training on ViT, Swin, and HiP architectures with top benchmark scores on AudioSet and VGGSound.
Its versatile approach enables effective cross-modal training and reliable unimodal inference on tasks like Kinetics-400 and ESC-50.

Overview of Zorro

The recently introduced technique called Zorro represents an innovative approach in multimodal learning that addresses key limitations of previous methods. Specifically, Zorro enables the engagement of a single backbone Transformer network across various sensory modalities such as audio and video, achieving both uni- and multimodal processing capabilities.

Methodology

Employing a masking strategy, Zorro retains modality-specific portions of the representation, maintaining their purity, while allowing another part of the representation to access all modalities. The paper evaluates Zorro by applying it to three prominent transformer-based architectures, namely ViT, Swin, and HiP, with subsequent contrastive pre-training showing impressive results. The contrastive pre-training, a standout highlight, is facilitated by Zorro's ability to produce both multimodal and modality-specific outputs.

Results

Statistically robust achievements by Zorro in the field of contrastive pre-training are particularly promising. The model yields state-of-the-art results across several multimodal benchmarks, including AudioSet and VGGSound. Furthermore, Zorro exhibits a remarkable ability to perform unimodal inference, specifically on video and audio benchmarks like Kinetics-400 and ESC-50, a testament to the model's versatility.

Contributions and Implications

Zorro's four key contributions encompass the introduction of novel multimodal Transformer architectures for both supervised and self-supervised training, the demonstration of Zorro-modified architectures outperforming their vanilla counterparts, the evidence of efficient pre-training on large-scale audio-visual datasets, and remarkable benchmark performance with the added benefit of unimodal inferencing capability. This positions Zorro as a powerful tool for advancing multimodal AI systems, capable of addressing tasks requiring integration of different types of sensory data with minimal engineering overhead.