Transferring Knowledge from Large Foundation Models to Small Downstream Models

Published 11 Jun 2024 in cs.LG | (2406.07337v1)

Abstract: How do we transfer the relevant knowledge from ever larger foundation models into small, task-specific downstream models that can run at much lower costs? Standard transfer learning using pre-trained weights as the initialization transfers limited information and commits us to often massive pre-trained architectures. This procedure also precludes combining multiple pre-trained models that learn complementary information. To address these shortcomings, we introduce Adaptive Feature Transfer (AFT). Instead of transferring weights, AFT operates purely on features, thereby decoupling the choice of the pre-trained model from the smaller downstream model. Rather than indiscriminately compressing all pre-trained features, AFT adaptively transfers pre-trained features that are most useful for performing the downstream task, using a simple regularization that adds minimal overhead. Across multiple vision, language, and multi-modal datasets, AFT achieves significantly better downstream performance compared to alternatives with a similar computational cost. Furthermore, AFT reliably translates improvement in pre-trained models into improvement in downstream performance, even if the downstream model is over $50\times$ smaller, and can effectively transfer complementary information learned by multiple pre-trained models.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces Adaptive Feature Transfer (AFT), which transfers key features instead of weights to enhance the performance of small downstream models.
It applies a novel feature-based regularization and kernel-based formulation to align transferred features while reducing computational complexity.
Experimental results across vision, language, and multi-modal tasks demonstrate significant performance gains and efficiency improvements.

Transferring Knowledge from Large Foundation Models to Small Downstream Models

The paper introduces Adaptive Feature Transfer (AFT), a novel approach designed to facilitate the effective transfer of knowledge from large foundation models to smaller, task-specific downstream models. Unlike conventional transfer learning methods which predominantly use pre-trained weights as initialization, AFT focuses on the transfer of features, which allows for more efficient and targeted knowledge transfer. This methodology decouples the choice of the pre-trained model from the smaller downstream model elucidating a new pathway for combining multiple pre-trained models that encapsulate complementary information with minimal computational overhead.

Methodology

AFT diverges significantly from standard transfer learning techniques by prioritizing features over weights, providing a more flexible and efficient mechanism to adapt pre-trained knowledge to downstream tasks. The core innovation lies in effectively filtering and transferring task-relevant features from pre-trained models, thereby steering the downstream model to utilize a subset of pre-trained features optimal for the specific task.

The primary components of this methodology include:

Feature-Based Regularization: AFT employs a novel regularization term that aims to minimize the mutual information between the downstream model's features and the raw input, conditioned on the pre-trained features. This encourages the downstream model to focus on features relevant for the task and disregard irrelevant information.
Kernel-Based Reformulation: To reduce optimization complexity and enhance generalization, AFT leverages a kernel-based formulation. This involves aligning the kernels of downstream features with those derived from pre-trained models, ensuring that the downstream model captures the most informative aspects of the pre-trained representations.
Adaptive Weighting: The method dynamically learns weights for the pre-trained features, prioritizing those that provide the most utility for the downstream task. This adaptive mechanism ensures robustness even when noise is present.

Experimental Evaluation

AFT’s efficacy is rigorously tested across various datasets encompassing vision, language, and multi-modal applications. The experiments highlight the substantial performance gains provided by AFT in several key scenarios:

Vision Tasks: Using models like ViT-S/16, MLP-Mixer-B, and ResNet-50 as downstream architectures, AFT consistently outperformed standard transfer learning, knowledge distillation, and B-Tuning methods. When transferring from models such as DINOv2 ViT-G/14, the improvement was marked, with AFT achieving a significant reduction in error rates.
Language Tasks: For NLP tasks, AFT demonstrated a notable advantage when transferring from large models like Flan-T5 and LLaMA to smaller models such as BERT Small and DistillBERT. The robustness of AFT was particularly evident in its capacity to translate improvements in large pre-trained models to downstream performance efficiently.
Multi-Modal Tasks: The adaptability of AFT shone in multi-modal scenarios as well. For instance, in the SNLI-VE visual entailment task, the paper showed how combining features from DINOv2 ViT-G/14 and LLaMA 13B using AFT enhanced performance beyond using a single pre-trained model.

Implications and Future Directions

The results presented in the paper have profound implications for both theoretical and practical aspects of machine learning:

Efficiency: AFT's capability of decoupling the pre-trained model from the downstream model allows for substantial computational savings, particularly highlighted by the significant reduction in inference time.
Scalability: The method scales effectively with the quality of pre-trained models, translating improvements in foundation models linearly to enhancements in downstream performance. This ensures AFT remains relevant as models continue to grow in size and complexity.
Versatility: By efficiently transferring complementary information from multiple models, AFT demonstrates considerable versatility across different domains and tasks, paving the way for more robust and adaptive deployment of machine learning models.

Future research building on AFT could explore extending the transfer to intermediate layers and enhancing the adaptive feature selection mechanisms. Evaluating AFT in other domains such as reinforcement learning and large-scale multi-task learning could reveal additional strengths and limitations. The continuous advancements in foundation models will further illuminate the potential of techniques like AFT, ensuring the adaptability and efficiency of knowledge transfer in AI systems.

Overall, AFT represents a significant step forward in the refinement and optimization of transfer learning methodologies, offering a practical and potent tool for improving the deployment of machine learning models across diverse applications.

Markdown Report Issue