ACT-ViT: Vision Transformer Probes

Updated 1 February 2026

The paper demonstrates that fusing intermediate transformer layer representations via multi-head cross-attention significantly improves adaptation performance over conventional linear probes.
The methodology extracts CLS and average-pooled patch tokens from each layer to form a fused task representation, enabling rapid transfer learning with minimal computational cost.
Empirical evaluations reveal consistent gains with a mean +5.54 pp balanced accuracy and up to a 37-point improvement in zero-shot settings for LLM hallucination detection.

Vision Transformer-style Probes (ACT-ViT) are attention-based fusion mechanisms designed to leverage hierarchical representations from all intermediate layers of deep transformer architectures for efficient and accurate adaptation to downstream tasks. Initially developed for vision classification with pretrained Vision Transformers (ViTs), the ACT-ViT methodology generalizes to structured activation tensors from LLMs and supports practical transfer learning scenarios, outperforming conventional linear probes in diverse settings (Ciernik et al., 14 Jan 2026, Bar-Shalom et al., 30 Sep 2025).

1. Principles and Architectural Foundations

ACT-ViT operates under the premise that task-relevant information is distributed across the model’s hierarchy rather than concentrated in last-layer features. For vision tasks, the probe receives output embeddings from every transformer block—a procedure formalized by the extraction of per-layer summary tokens, specifically the [CLS] token $h^{(\ell)}_{CLS}$ and average-pooled patch token $h^{(\ell)}_{AP}$ from each layer $\ell$ , stacked as $H_C \in \mathbb{R}^{2L\times d}$ where $L$ is the number of layers and $d$ the embedding dimensionality.

The central probe head employs multi-head cross-attention. A learnable task query $Q \in \mathbb{R}^{1\times d}$ attends across the set $H_C$ , producing per-head outputs via standard scaled dot-product attention: $A^{(m)} = \mathrm{softmax}\!\bigl(Q^{(m)} (K^{(m)})^\top / \sqrt{d_h}\bigr)$ where $K^{(m)}, V^{(m)}$ are the key/value projections of $H_C$ for head $m$ , and $M$ is the number of attention heads with dimensionality $d_h$ . The fused vector is projected via a linear layer and fed to a classifier, functional for both regression and categorical outputs (Ciernik et al., 14 Jan 2026).

In the LLM context, ACT-ViT generalizes this approach by treating activation tensors as 2D grids— $(L_p, N_p)$ derived via max-pooling along layers and sequence length—to produce patch sequences akin to images for ViT ingestion. This process is facilitated by a per-LLM Linear Adapter and patch embedding layers, enabling domain and model alignment without retraining the backbone (Bar-Shalom et al., 30 Sep 2025).

2. Mathematical Formalism of Layerwise Fusion

Each attention head learns soft selection weights $\alpha^{(m)}_k$ for every summary token: $\alpha^{(m)} = \mathrm{softmax}\!\bigl(Q^{(m)} (K^{(m)})^\top / \sqrt{d_h}\bigr)$ resulting in head outputs as weighted sums of values: $h^{(m)}_{\mathrm{head}} = \sum_{k=1}^{2L} \alpha^{(m)}_k\,V^{(m)}_k$ Final task representations: $h_{\mathrm{fused}} = \Bigl[\;h^{(1)}_{\mathrm{head}\;\|\;\dots\;\|\;h^{(M)}_{\mathrm{head}\Bigr]\,W_{\mathrm{out}} + b_{\mathrm{out}}$

For LLM activation tensor probing, the flattened patches with positional encoding are projected to embeddings and processed via ViT blocks, with final decision relying on the transformer’s [CLS] token. This general fusion mechanism allows the model to attend selectively over both depth (layers) and spatial/sequence locations (tokens), adapting dynamically to the task signal location.

3. Training Protocols and Optimization

The foundation model backbone remains frozen throughout training; only the probe head (attention and fusion layers) and classifier are updated. For vision tasks, a weighted cross-entropy loss is used to balance class imbalance, with class weights $w_i = N / (K n_i)$ for $K$ classes and $n_i$ samples per class: $\mathcal{L}(\hat y,y) = -\sum_{i=1}^K w_i\,y_i\,\log\hat y_i$ AdamW optimizer is utilized with cosine-annealing schedules, batch sizes up to 2,048 (vision) or 128 (LLM), attention-dropout, and regularization via Gaussian jittering and gradient norm clipping (Ciernik et al., 14 Jan 2026, Bar-Shalom et al., 30 Sep 2025).

For cross-LLM scenarios, adaptation to novel models/tasks is achieved by fine-tuning only the Linear Adapter per model, holding the shared ViT probe backbone fixed. This facilitates transfer in both few-shot and zero-shot settings and supports multi-LLM training without extensive retraining.

4. Empirical Findings and Quantitative Evaluation

In vision classification, ACT-ViT demonstrates mean absolute gains of +5.54 pp balanced accuracy over last-layer linear probes across 20 datasets and 9 model variants, with maximal improvements ( $>$ 30 pp) on domain-shift and fine-grained classification tasks. Median and mean rank metrics consistently favor ACT-ViT, and statistical tests show significant benefit over multi-layer linear probes ( $p \leq 0.013$ , Wilcoxon FDR-corrected). Training efficiency is notable— $\sim$ 36× faster than full backbone fine-tuning (Ciernik et al., 14 Jan 2026).

In LLM hallucination detection, ACT-ViT (multi-LLM joint training) surpasses token-level linear probes (Probe[*]) and MLP baselines by an average of 2.7 ROC-AUC points, robustly achieving superior zero-shot performance. For instance, on Mis-7B·IMDB, ACT-ViT attains a +37 pt gain in zero-shot, and few-shot LA fine-tuning can surpass full-data probes using only 10% data. Efficiency benchmarks show per-instance inference time in the microseconds and model memory footprint is on the order of $\sim$ 0.5–0.8M parameters (Bar-Shalom et al., 30 Sep 2025).

5. Qualitative Behavior, Layerwise Analysis, and Probe Sensitivities

Attention heatmaps averaged across samples reveal distinct patterns: early-layer CLS tokens are consistently ignored, while intermediate-layer AP tokens receive significant weight, especially for tasks differing from the pre-training domain. For domain-matched tasks, last-layer CLS+AP tokens dominate attention. Layerwise representational similarity analysis (CKA) shows that intermediate layers can rival final-layer probes, indicating the presence of complementary information (Ciernik et al., 14 Jan 2026).

In LLMs, full-tensor attention enables dynamic localization of hallucination signals within the activation grid, eliminating the necessity for static probe design and location search (Bar-Shalom et al., 30 Sep 2025). Pooling granularity ( $L_p, N_p$ ) controls the trade-off between computational cost and accuracy, with diminishing returns for extreme grid refinement.

Key ablation findings for vision tasks are summarized as follows:

Ablation Aspect	Observed Effect	Quantitative Note
# of layers fused	Median gain increases monotonically with more layers	$p \leq 0.04$ (FDR-corrected); all-layers is optimal
# of attention heads	Best with $M = 2L$ (one per token type); degrades for $M \ll 2L$
Token selection	CLS+AP fusion best; AP-only variably effective	AP-only useful for spatial tasks, detrimental in fine-grained
Probe variants	AIM, V-JEPA, Efficient Probe yield similar gains	Suggests main benefit from fusion, not head design
MAE backbone	AAT outperforms ACT-ViT due to missing CLS; ACT-ViT still strong	Depth fusion improves even with weak summary tokens

6. Transferrability, Scalability, and Practical Implications

ACT-ViT is designed for maximal transferability and rapid adaptation. Linear Adapters permit alignment of hidden-state dimensionalities across models, supporting joint and zero-shot learning for novel architectures. Deployment is efficient: real-time inference (<10 μs/sample), minimal footprint, rapid adapter fine-tuning (<3 hours for 15 LLMs/datasets on a single GPU). The holistic activation tensor view allows ACT-ViT to outperform or match static probes in new domains and LLMs, with robust performance under scarce training data (Bar-Shalom et al., 30 Sep 2025).

A plausible implication is that the fusion of hierarchical representations via attention furnishes both accuracy and generality for model adaptation problems where information structure and localization vary across tasks and domains.

ACT-ViT builds on and surpasses standard linear probes, token-wise logistic regression, LOS-Net, and prior attention-based single-layer probes. The approach is compatible with diverse probe architectures (AIM, V-JEPA) and can recover similar gains by focusing on multi-layer fusion rather than a specific attention mechanism.

In vision applications, the fusion scheme enables adaptation of frozen foundation models to tasks outside their pretraining domain and fine-grained challenges, demonstrating strong empirical and computational superiority over last-layer and static aggregation paradigms (Ciernik et al., 14 Jan 2026).

In LLM settings, ACT-ViT's methodology introduces an inductive bias aligning activation tensor structure with transformer self-attention machinery, facilitating scalable, cross-model hallucination detection and practical deployment capabilities (Bar-Shalom et al., 30 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Beyond the final layer: Attentive multilayer fusion for vision transformers (2026)

Beyond Token Probes: Hallucination Detection via Activation Tensors with ACT-ViT (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Vision Transformer-style Probes (ACT-ViT).