Accelerating Vision-Language Pretraining with Free Language Modeling

Published 24 Mar 2023 in cs.CV | (2303.14038v1)

Abstract: The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. Code will be public at https://github.com/TencentARC/FLM.

Abstract PDF Upgrade to Chat

Citations (9)

View on Semantic Scholar

Summary

The paper introduces Free Language Modeling (FLM) that decouples prediction and corruption rates, enabling full token utilization during training.
The paper presents an encode-corrupt-predict framework that reduces pretraining time by 2.5x while maintaining competitive results on vision-language tasks.
The paper demonstrates FLM's scalability across large datasets, paving the way for more efficient and flexible vision-language model training.

Overview of "Accelerating Vision-Language Pretraining with Free Language Modeling"

The paper "Accelerating Vision-Language Pretraining with Free Language Modeling" introduces a novel approach aimed at enhancing the efficiency of vision-language pretraining (VLP) processes. The authors identify and address a significant bottleneck in existing masked language modeling (MLM) techniques, namely the dependency between prediction and corruption rates. This dependency often results in suboptimal utilization of output tokens, leading to slower convergence and extended training periods, particularly when handling large-scale web datasets.

Free Language Modeling (FLM)

The key contribution of this paper is the introduction of Free Language Modeling (FLM), which decouples the prediction rate from the corruption rate, allowing for full (100%) prediction utilization irrespective of the corruption extent. This paradigm shift facilitates a more flexible context exploitation during training:

Unlinked Corruption and Prediction Rates: FLM allows for setting an independent corruption rate while still employing a 100% prediction rate. This eliminates the typical constraint where increasing the corruption rate necessarily limits the proportion of tokens subjected to prediction tasks.
Customizable Corruption Spans: The technique allows for flexible corrupted spans tailored for each token prediction, thereby enabling the model to better leverage bidirectional contexts and enhance learning speed.

Framework and Implementation

The framework proposed involves an encode-corrupt-predict strategy. Here, encoding occurs once, maintaining bidirectional context representation, followed by generating multiple corrupted instances for parallel prediction. This departure from pre-encoding corruption aligns with efficient GPU utilization and results in substantial reductions in pretraining time.

In particular, FLM demonstrates a 2.5-fold reduction in pretraining time compared to MLM-based methods without detrimental effects on task performance for vision-language understanding and generation. This is evidenced by competitive benchmarking outcomes on tasks such as VQA, NLVR^2, image captioning, and image-text retrieval across widely recognized datasets.

Experimental Validation and Outcomes

The authors validate FLM across multiple configurations and observe notable improvements compared to previous methodologies:

Efficient Learning: FLM yields significant performance retention with improved training efficiency. Comparisons with MLM, AR, and PrefixLM, particularly in settings with varied prediction and corruption rates, highlight the versatility of FLM in adapting to different training demands while retaining high efficacy.
Broader Applicability: The methodology scales well with increases in data size and model complexity, maintaining competitive performance across different architectures and datasets while demonstrating faster convergence.

Implications and Future Directions

The implications of this research are twofold. Practically, it provides a robust tool for reducing computational resources required for training vision-LLMs, thus facilitating their broader deployment. Theoretically, it challenges existing paradigms by demonstrating the benefits of decoupling prediction and corruption rates, opening avenues for future research into more flexible learning models within AI's vision-language domain.

In conclusion, "Accelerating Vision-Language Pretraining with Free Language Modeling" proposes a crucial advancement in vision-LLM training efficiency, with empirical support from extensive experiments. Future work may explore further optimizations or extensions of this approach, particularly in integrating it with other learning paradigms or refining its application across varied AI tasks.