Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Abstract: We present Integrated Multimodal Perception (IMP), a simple and scalable multimodal multi-task training and modeling approach. IMP integrates multimodal inputs including image, video, text, and audio into a single Transformer encoder with minimal modality-specific components. IMP makes use of a novel design that combines Alternating Gradient Descent (AGD) and Mixture-of-Experts (MoE) for efficient model and task scaling. We conduct extensive empirical studies and reveal the following key insights: 1) Performing gradient descent updates by alternating on diverse modalities, loss functions, and tasks, with varying input resolutions, efficiently improves the model. 2) Sparsification with MoE on a single modality-agnostic encoder substantially improves the performance, outperforming dense models that use modality-specific encoders or additional fusion layers and greatly mitigates the conflicts between modalities. IMP achieves competitive performance on a wide range of downstream tasks including video classification, image classification, image-text, and video-text retrieval. Most notably, we train a sparse IMP-MoE-L variant focusing on video tasks that achieves new state-of-the-art in zero-shot video classification: 77.0% on Kinetics-400, 76.8% on Kinetics-600, and 68.3% on Kinetics-700, improving the previous state-of-the-art by +5%, +6.7%, and +5.8%, respectively, while using only 15% of their total training computational cost.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. NeurIPS, 2021.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. NeurIPS, 2020.
- JAX: composable transformations of Python+NumPy programs. 2018. URL http://github.com/google/jax.
- Language models are few-shot learners. NeurIPS, 2020.
- Gemma A Calvert. Crossmodal processing in the human brain: insights from functional neuroimaging studies. Cerebral cortex, 2001.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987, 2019.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Scaling vision transformers to 22 billion parameters. arXiv preprint arXiv:2302.05442, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Multisensory interplay reveals crossmodal influences on ‘sensory-specific’brain regions, neural responses, and judgments. Neuron, 2008.
- Audio set: An ontology and human-labeled dataset for audio events. In ICASSP, 2017.
- Smart frame selection for action recognition. In AAAI, 2021.
- Masked autoencoders that listen. NeurIPS, 2022.
- Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 2017.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. PMLR, 2021.
- A domain-specific supercomputer for training deep neural networks. Communications of the ACM, 2020.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
- Hmdb: a large video database for human motion recognition. In ICCV, 2011.
- Polyvit: Co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993, 2021.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
- Prioritized training on points that are learnable, worth learning, and not yet learnt. In ICML. PMLR, 2022.
- Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, 2019.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. In NeurIPS, 2022.
- Learning audio-video modalities from image captions. arXiv preprint arXiv:2204.00679, 2022.
- Expanding language-image pretrained models for general video recognition. In ECCV, 2022.
- Improved optimization strategies for deep multi-task networks. arXiv preprint arXiv:2109.11678, 2021.
- Combined scaling for open-vocabulary image classification. arXiv preprint arXiv: 2111.10050, 2021.
- Karol J Piczak. Esc: Dataset for environmental sound classification. In ACM MM, 2015.
- Dynamic pretraining of vision-language models, 2023. URL https://openreview.net/forum?id=QcffIcjq8bl.
- Learning transferable visual models from natural language supervision. In ICML. PMLR, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
- Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
- Scaling vision with sparse mixture of experts. NeurIPS, 2021.
- Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
- ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
- The development of embodied cognition: Six lessons from babies. Artificial life, 2005.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In SIGIR, 2021.
- Learning video representations from textual web supervision. arXiv preprint arXiv:2007.14937, 2020.
- Attention is all you need. NeurIPS, 2017.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
- A multigrid method for efficiently training video models. In CVPR, 2020.
- Scaling multimodal pre-training via cross-modality gradient harmonization. In NeurIPS, 2022a.
- Transferring textual knowledge for visual recognition. arXiv preprint arXiv:2207.01297, 2022b.
- Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. arXiv preprint arXiv:2301.00182, 2022c.
- Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
- From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2014.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Scaling vision transformers. In CVPR, 2022.
- Mixture-of-experts with expert choice routing. In NeurIPS, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.