Synchformer: Efficient Synchronization from Sparse Cues

Published 29 Jan 2024 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS | (2401.16423v1)

Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

Abstract PDF HTML Upgrade to Chat

References (30)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a two-stage training paradigm that separates feature extraction from synchronization modeling to effectively align audio-visual streams.
It employs segment-level contrastive pre-training and transformer encoders to efficiently process transient synchronization cues in challenging, open-domain videos.
The model further innovates by incorporating evidence attribution and synchronizability prediction, enhancing both interpretability and performance.

Introduction

The presentation of a new model for audio-visual synchronization, named Synchformer, charts significant progress in aligning audio and visual streams, particularly in challenging 'in-the-wild' video settings where synchronization cues may be sporadic. As videos from platforms like YouTube are used, detection of these sparse cues is essential to accurately identify the temporal offset, a measurement of the time difference between corresponding events in audio and visual channels. The Synchformer's training process bifurcates feature extraction from synchronization modeling, using a bespoke multi-modal segment-level contrastive pre-training. The result is enhanced performance across both densely and sparsely cued environments.

The heritage of synchronization models is rooted in dense cue scenarios such as talking heads where informative signals are abundant. This contrasts with open-domain videos where cues are transient and scattered, demanding models to interpret extended temporal windows to capture sporadic synchronization cues. Prior approaches have grappled with challenges including reliance on densely-cued data for pre-training and the necessity for end-to-end training, both of which incur high computational costs and confine feature extractor selection. The limitations of earlier designs, such as those encountered by Iashin et al., have spurred the creation of Synchformer, which imposes a two-stage training paradigm and extends the training to a million-scale dataset, AudioSet. The Synchformer also explores interpretability through evidence attribution techniques and examines its nascent ability to discern audio-visual synchronizability.

Methodology

Synchformer diverges from full video analysis, by dissecting videos into temporal segments and separately extracting audio and visual features. A transformer encoder then aggregates these features, and a synchronization module predicts the temporal offset. This architectural innovation permits handling shorter input sequences, resulting in a less memory-intensive training process. The method also stipulates a segmentation approach, with audio and visual features extracted from designated segments which overlap to improve synchronization precision. The feature extractors undergo segment-level contrastive pre-training, followed by a synchronization module training using pre-trained and frozen extractors.

Additional Capabilities and Conclusion

Synchformer explores two additional capabilities: evidence attribution and synchronizability prediction. Evidence attribution seeks to highlight the temporal portions of the audio and visual streams that the model relies on for its predictions. Synchronizability prediction, a novel concept introduced by Synchformer, assesses whether the audio and visual streams can be synchronized at all. Through its innovative design and the implementation of these additional capabilities, Synchformer marks a quantum leap in synchronization model performance. Not only does it efficiently handle sparse synchronization cues, it also sets a foundation for understanding model predictions and extends the boundaries of synchronization tasks to new challenges such as synchronizability assessment.

Markdown Report Issue