Synchformer: Efficient Synchronization from Sparse Cues
Abstract: Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
- “Perfect match: Improved cross-modal embeddings for audio-visual synchronisation,” in ICASSP, 2019.
- J. S. Chung and A. Zisserman, “Out of time: automated lip sync in the wild,” in Workshop on Multi-view Lip-reading, ACCV, 2016.
- “Self-supervised learning of audio-visual objects from video,” in ECCV, 2020.
- R. Arandjelovic and A. Zisserman, “Objects that sound,” in ECCV, 2018.
- A. Owens and A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in ECCV, 2018.
- “Audio-visual synchronisation in the wild,” in BMVC, 2021.
- “Sparse in space and time: Audio-visual synchronisation with trainable selectors,” in BMVC, 2022.
- “Audio set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017.
- J. Hershey and J. Movellan, “Audio vision: Using audio-visual synchrony to locate sounds,” NeurIPS, 1999.
- M. Slaney and M. Covell, “Facesync: A linear operator for measuring synchronization of video facial images and audio tracks,” NeurIPS, 2000.
- J. S. Chung and A. Zisserman, “Lip reading in the wild,” in ACCV, 2016.
- L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, Prentice-Hall, Inc., 1993.
- “Dynamic temporal alignment of speech to lips,” in ICASSP, 2019.
- “On attention modules for audio-visual synchronization,” in Workshop on Sight and Sound, CVPR, 2019.
- “End-to-end lip synchronisation based on pattern classification,” in SLT Workshop, 2021.
- “Vocalist: An audio-visual synchronisation model for lips and voices,” in Interspeech, 2022.
- “ModEFormer: Modality-preserving embedding for audio-video synchronization using transformers,” in ICASSP. IEEE, 2023.
- “VGG-Sound: A large-scale audio-visual dataset,” in ICASSP, 2020.
- “AST: Audio Spectrogram Transformer,” in Interspeech, 2021.
- “Keeping your eye on the ball: Trajectory attention in video transformers,” in NeurIPS, 2021.
- “Is space-time attention all you need for video understanding?,” in ICML, 2021.
- “Attention is all you need,” in NeurIPS, 2017.
- “BERT: Pre-training of deep bidirectional transformers for language understanding,” in NAACL: HLT, 2019.
- “Learning transferable visual models from natural language supervision,” in ICML, 2021.
- “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- “LRS3-TED: a large-scale dataset for visual speech recognition,” arXiv preprint arXiv:1809.00496, 2018.
- “Deep residual learning for image recognition,” in CVPR, 2016.
- “Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification,” in ECCV, 2018.
- “End-to-end object detection with transformers,” in ECCV, 2020.
- “The "something something" video database for learning and evaluating visual common sense,” in ICCV, 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.