Beyond the final layer: Attentive multilayer fusion for vision transformers

Published 14 Jan 2026 in cs.CV | (2601.09322v1)

Abstract: With the rise of large-scale foundation models, efficiently adapting them to downstream tasks remains a central challenge. Linear probing, which freezes the backbone and trains a lightweight head, is computationally efficient but often restricted to last-layer representations. We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers. To leverage this distribution of information, we apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer. This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions. Across 20 diverse datasets and multiple pretrained foundation models, our method achieves consistent, substantial gains over standard linear probes. Attention heatmaps further reveal that tasks different from the pre-training domain benefit most from intermediate representations. Overall, our findings underscore the value of intermediate layer information and demonstrate a principled, task aware approach for unlocking their potential in probing-based adaptation.

Abstract PDF Upgrade to Chat

Summary

The paper introduces an attentive multilayer fusion mechanism that dynamically integrates intermediate and final layer data to enhance task-specific performance.
It employs a cross-attention technique on both CLS and average-pooled tokens, effectively capturing a range from low-level structural cues to high-level semantic features.
The method achieves an average accuracy boost of 5.54 percentage points over 20 datasets while maintaining computational efficiency.

Attentive Multilayer Fusion for Vision Transformers

Introduction

The paper "Beyond the final layer: Attentive multilayer fusion for vision transformers" (2601.09322) addresses the challenge of efficiently adapting large-scale foundation models, particularly Vision Transformers (ViTs), to diverse downstream tasks. Traditional linear probing relies on the final-layer CLS token, which often constrains performance due to the assumption that it captures all relevant information. The authors propose a sophisticated attention-based mechanism to dynamically fuse representations across all layers of ViTs, thereby leveraging both low-level structural cues and high-level semantic abstractions distributed throughout the model's hierarchy.

Methodology

Task-Relevant Information Distribution

The study begins by detailing how information pertinent to specific tasks is dispersed across the architecture of ViTs, transcending the limitations of final-layer probing. Unlike the standard practice of using only the CLS token from the last layer, this research highlights the utility of intermediate layers, which retain essential structural or textural information, particularly beneficial when addressing tasks divergent from the pre-training domain.

Attentive Multilayer Fusion

To capture the full spectrum of task-relevant data, the authors propose an attentive probing mechanism. This technique integrates data from all layers within a ViT, applying a cross-attention mechanism to both CLS and average-pooled (AP) tokens. By treating the fusion as a dynamic selection process, the method automatically prioritizes the most informative layers, adapting to various downstream requirements.

Implementation and Efficiency

The proposed algorithm maintains computational efficiency by relying solely on the attention-fusion module and the classifier, keeping the backbone model unchanged. This task-aware adaptation leads to a consistent enhancement in performance without significantly inflating the number of parameters, as it scales independently of the model's depth in terms of layer count.

Experimental Results

The method was validated across 20 datasets and multiple pretrained ViT models, demonstrating a notable average accuracy increase of 5.54 percentage points over traditional linear probing. These gains affirm the value of incorporating intermediate representations, which are shown to be particularly advantageous for tasks that depart from the model's pre-training objectives. The authors provide a granular analysis of attention heatmaps, illustrating task-specific variations in layer relevance. Interestingly, layers contribute differently depending on task characteristics—natural image domains leverage high-level features, while specialized datasets benefit more from mid-layer data, rich in structural insights.

Implications and Future Directions

Theoretical and Practical Insights

This research underscores the importance of multi-layer information fusion, challenging the long-held belief that the final layers of deep networks encapsulate most task-relevant information. By demonstrating the potential of intermediate layers, this work aligns with findings from related domains like natural language processing, where similar trends have been observed.

Future Research Directions

Given the success of this adaptive probing in vision tasks, there is fertile ground for extending similar techniques to other domains, including language and biological sequence modeling. As foundation models continue to expand in complexity and application scope, leveraging their entire representational hierarchy will likely become a critical area for exploration.

Conclusion

The paper presents a compelling case for re-evaluating how foundational models are adapted for downstream tasks. Through attentive multilayer fusion, the study reveals substantial performance enhancements by tapping into the underutilized potential of intermediate layers in Vision Transformers. This work not only offers an immediate performance boost in computationally efficient ways but also sets the stage for future explorations into transformative model adaptation strategies across diverse AI applications.