On-Policy Attention Distillation
- The paper introduces a novel framework called On-Policy Attention Distillation that aligns student and teacher attention maps during self-generated rollouts.
- It leverages dense, trajectory-dependent supervision to mitigate exposure bias by guiding internal attention allocation across multimodal inputs.
- Empirical results demonstrate significant accuracy gains on image and video VQA tasks, highlighting OPAD’s scalability with larger input sequences.
On-Policy Attention Distillation (OPAD) is a principled framework for transferring inductive biases related to “where to attend” within deep neural policies, particularly in high-dimensional, sequential, or multimodal input spaces. Unlike token-level or action-level distillation, which typically match only output sequences or marginal policy statistics, OPAD focuses on aligning internal attention distributions under the student’s own rollouts, yielding denser and more behaviorally grounded supervision relevant for complex reasoning and perceptual alignment tasks.
1. Motivation and Conceptual Distinction
Traditional knowledge distillation (KD) for language and vision models supervises the student to match the teacher’s output logits (e.g., via Kullback–Leibler (KL) divergence on predicted token or action distributions). This offers a sparse, end-point signal, providing no direct guidance on how the model should allocate its internal computation across multi-modal contexts, such as determining evidence regions in a visual input conditioned on a question.
OPAD treats the attention scores within Transformer architectures as a latent policy, optimizing the student to mimic not just “what” the teacher predicts but “where” the teacher focuses its attention. The student is trained on trajectories produced under its own policy (on-policy), and its attention maps at each step are forced to match a fixed, generally stronger teacher model’s attention maps computed on the same partial context. This design directly addresses exposure bias (resulting from distributional shifts in classical KD) and allows for dense, trajectory-dependent supervision at the sub-symbolic, evidence-gathering level (Li et al., 4 Feb 2026).
2. Mathematical Formulation
Let θ denote the student parameters and φ the fixed teacher parameters. Given a generated trajectory (rollout) where contains current context (e.g., image patches and prior tokens), for each generation timestep (with the prompt length), OPAD aligns the student and teacher's pre-softmax attention logits over previous positions .
Define the attention distributions: The core distillation loss, averaged over trajectories sampled from the student’s own policy, computes the Jensen–Shannon divergence: with the mean attention distillation loss
Typically, this term is combined with on-policy token-level knowledge distillation and (optionally) a reinforcement learning surrogate, yielding the total training loss: where and are tunable weights (Li et al., 4 Feb 2026).
3. Algorithmic Implementation
A typical training iteration for OPAD consists of:
- Sampling a batch of prompts from the dataset.
- Rolling out trajectories per prompt under the current student policy to generate sequences and gather states.
- For each trajectory and timestep:
- Recording token-level RL and knowledge distillation losses.
- For each token beyond the prompt, extracting both the student and teacher attention logits on the identical partial trajectory.
- Computing softmax attention distributions and the per-token JSD.
- Summing the total loss over each rollout and taking a gradient step with respect to the student parameters.
This procedure yields supervision at every generation step on the student’s actual (on-policy) action sequence, rather than on teacher-forced (off-policy) traces, mitigating compounding error from exposure bias and making the signal highly dense and temporally aligned.
4. Empirical Performance and Comparative Analysis
On 8 image and 7 video visual question answering (VQA) datasets, OPAD demonstrates consistent and additive gains over both RL-only post-training and token-level on-policy distillation. For a 7B student model distilled from a 32B teacher with , :
- In image VQA, RL improves baseline by +1.7, token-level KD adds +1.4, and OPAD yields an additional +2.6 (total +5.7 absolute accuracy improvement).
- On the V* dataset: Baseline (70.7), GRPO RL (68.6), KD (70.9), OPAD (72.3).
- In video VQA, analogous increments (+0.8 for RL, +1.2 for KD, +2.4 for OPAD).
- On NExTQA: Baseline (73.7), RL (70.7), KD (75.3), OPAD (79.7).
These results indicate that aligning student attention with teacher guidance enables superior grounding—especially in extensive or ambiguous visual contexts where output imitation alone is insufficient. Gains scale with the visual sequence length (e.g., more image tokens or video frames), suggesting robustness to complex perceptual scenarios (Li et al., 4 Feb 2026).
5. Ablations and Hyperparameter Analysis
Comprehensive sweeps were performed:
- Attention-KD weight () in : Most robust for , with consistent drops below this threshold.
- Attention-RL coefficient () in : Best at .
- Scaling context (image tokens from 512 to 2048, video frames from 32 to 128): Larger gains for larger contexts, reflecting the scalability of attention-level supervision.
- “Zero” variant: even without generating rationales (i.e., attending only), attention-focused RL and distillation outperform vanilla RL on the majority of tested video and image tasks. This suggests that internal attention policies encode substantial cross-modal grounding even absent explicit reasoning tokens (Li et al., 4 Feb 2026).
6. Relation to Broader On-Policy Distillation and Attention Mechanisms
Recent literature explores alternative on-policy distillation paradigms with attention components, but frequently in non-multimodal or non-transformer settings. In mutual distillation frameworks such as Online Policy Distillation with Decision-Attention (OPD-DA), attention is used to weight peer policies’ contributions to distilled targets rather than directly matching Transformer attention maps (Yu et al., 2024). In block-wise video diffusion settings, on-policy distillation is implemented via distribution-matching losses, sometimes with implicit (but not explicit) attention transfer, as evidenced by LiveTalk’s improved on-policy distillation, where attention is handled through architectural adaptation and teacher score alignment without a separate attention-map loss (Chern et al., 29 Dec 2025).
A plausible implication is that direct attention distribution alignment—unique to OPAD—yields especially significant benefits when model evidence selection is itself a key determinant of downstream performance, as in multimodal reasoning or perception-heavy tasks.
7. Applications and Significance
OPAD is primarily used in post-training for Multimodal LLMs (MLLMs) to improve cross-modal grounding, visual reasoning, and perception, with state-of-the-art results on image and video VQA. Its principle—treating attention as a learnable policy and supervising alignment at every on-policy step—extends naturally to any sequential model where internal evidence gathering substantially determines utility.
The key significance lies in providing a route to bypass the bottleneck of sparse, token-level supervision, especially under challenging input distributions, by leveraging trajectories under the student’s own policy and enforcing dense, exposure-bias–resistant attention alignment. This positions OPAD as a general toolkit for grounded, robust distillation in deep sequence models (Li et al., 4 Feb 2026).