EMMeTT: Efficient Multimodal Machine Translation Training

Published 20 Sep 2024 in cs.CL, cs.SD, and eess.AS | (2409.13523v1)

Abstract: A rising interest in the modality extension of foundation LLMs warrants discussion on the most effective, and efficient, multimodal training approach. This work focuses on neural machine translation (NMT) and proposes a joint multimodal training regime of Speech-LLM to include automatic speech translation (AST). We investigate two different foundation model architectures, decoder-only GPT and encoder-decoder T5, extended with Canary-1B's speech encoder. To handle joint multimodal training, we propose a novel training framework called EMMeTT. EMMeTT improves training efficiency with the following: balanced sampling across languages, datasets, and modalities; efficient sequential data iteration; and a novel 2D bucketing scheme for multimodal data, complemented by a batch size optimizer (OOMptimizer). We show that a multimodal training consistently helps with both architectures. Moreover, SALM-T5 trained with EMMeTT retains the original NMT capability while outperforming AST baselines on four-language subsets of FLORES and FLEURS. The resultant Multimodal Translation Model produces strong text and speech translation results at the same time.

Abstract PDF Upgrade to Chat

Summary

The paper introduces EMMeTT, a joint multimodal training framework that integrates text and speech translation using balanced sampling, 2D bucketing, and an OOMptimizer.
The paper demonstrates that EMMeTT maintains text translation accuracy while enhancing automatic speech translation, achieving improved BLEU scores on FLORES and FLEURS benchmarks.
The paper extends T5 and GPT architectures for multimodal tasks, providing a scalable approach that can be applied to other complex multimodal applications.

Overview of "EMMeTT: Efficient Multimodal Machine Translation Training"

The paper "EMMeTT: Efficient Multimodal Machine Translation Training" presents a novel framework to address the efficiency and effectiveness challenges in training foundation models for multimodal machine translation, specifically focusing on integrating text and speech modalities. The authors propose a joint multimodal training framework named EMMeTT that extends foundation models to handle automatic speech translation (AST) and neural machine translation (NMT) concurrently. This work explores the application of EMMeTT on two architectures: the encoder-decoder T5 and the decoder-only GPT, with both models enhanced to accommodate speech input via the Canary-1B's speech encoder.

Contributions

Multimodal Training Framework: The primary contribution is the EMMeTT framework, which enables efficient training by balancing data sampling across languages, datasets, and modalities. The framework employs techniques such as:
- Balanced sampling to ensure a stationary data distribution.
- Efficient sequential data iteration.
- A novel 2D bucketing technique alongside a batch size optimizer (termed OOMptimizer).
Model Architectures: Two distinct model architectures are investigated:
- SALM-T5: An encoder-decoder model incorporating T5 for NMT tasks.
- BESTOW-GPT: A decoder-only model based on TinyLlama GPT architecture, utilizing cross-attention mechanisms for speech-text integration.
Empirical Results: The authors demonstrate that multimodal training not only retains the original text translation capabilities but also significantly enhances AST performance. This dual training paradigm shows improved BLEU scores across multiple language pairs in the FLORES and FLEURS benchmarks.
Efficiency Techniques:
- 2D Bucketing: Addresses the padding inefficiencies by segregating sampling based on both input and output sequence lengths.
- OOMptimizer: An automatic batch size optimizer that calibrates batch sizes to maximize GPU memory utilization without causing out-of-memory errors.

Experimental Findings

The experiments involved training on extensive datasets using large-scale computational resources. The models were evaluated on subsets of the FLEURS and FLORES benchmarks, focusing on English, French, German, and Spanish languages. Key findings include:

Speech Translation: Both architectures (BESTOW-GPT and SALM-T5) benefitted from multimodal training, with SALM-T5 showing a notable increase in BLEU scores from 33.9 to 34.4.
Text Translation: The T5-based model maintained its text translation proficiency post multimodal training, thereby validating the framework's capability to prevent catastrophic forgetting.

Methodological Insights

The EMMeTT framework employs several methodological advancements:

Stochastic Weighted Multiplexer (MUX): This technique ensures balanced and stationary data distribution for training across different modalities.
Dynamic Bucketing and 2D Bucketing: By stratifying data based on both input and output lengths, padding inefficiencies are minimized, thereby optimizing training.
Round Robin and Zip Samplers: These methods effectively combine separately sampled mini-batches from each modality, maintaining a balanced training update across modalities.

Implications and Future Directions

This research has significant implications for the training of multimodal foundation models. The joint training strategies proposed could be generalized beyond machine translation tasks, potentially benefiting other multimodal applications like speech recognition, image-text tasks, and more. The efficiency gains from techniques such as 2D bucketing and OOMptimizer could accelerate the training of larger, more comprehensive models.

Future developments might explore scaling these methodologies to even larger datasets and models, integrating more modalities, and refining the balance between different types of data to achieve optimal training outcomes. The open-source release of the training code and OOMptimizer within the NVIDIA NeMo toolkit should encourage further experiments and rapid advancements in this domain.

In summary, "EMMeTT: Efficient Multimodal Machine Translation Training" offers a substantial contribution to multimodal training methodologies, demonstrating both enhanced performance and efficiency in integrating text and speech translation capabilities within foundation models.

Markdown Report Issue