Perception Gap in Distillation

Updated 22 January 2026

Perception gap in distillation is defined as the measurable mismatch in information accessibility and representational fidelity between teacher and student models.
The gap arises from structural, capacity, and modality mismatches, often quantified using metrics like MSE and KL divergence, impacting both single- and cross-modal settings.
Effective strategies to bridge the gap include selective distillation, noise denoising, entropy correction, and progressive transfer to maintain high performance.

A perception gap in distillation denotes measurable mismatches in information accessibility, representational fidelity, or decision capability between models—and sometimes between models and humans—when transferring knowledge from a teacher network to a student. This phenomenon, which appears both in single- and cross-modal settings, is central to the study of knowledge distillation, as it limits the student’s ability to replicate teacher performance. Theoretical, operational, and empirical characterizations of the perception gap reveal key bottlenecks, often linked to representation capacity, noise, or modality-aligned priors. The closing—or effective reduction—of this gap is a principal aim in contemporary distillation research across classification, detection, generative modeling, and cross-modal reasoning.

1. Conceptualization and Measurement of the Perception Gap

The perception gap is typically quantified as the difference in task-relevant accuracy, representation similarity, or perceptual quality between a teacher and student model. In feature-based distillation, this is formalized as

$\mathrm{Gap}_{\mathrm{MSE}} = \|F^{(s)} - F^{(t)}\|_2^2,$

where $F^{(t)}$ and $F^{(s)}$ are teacher and student feature maps. For probabilistic outputs, a corresponding Kullback–Leibler divergence applies.

Beyond feature alignment, domain-specific manifestations are salient:

Modality gap: In cross-modal LLMs, the performance difference between, e.g., audio-only and visual-only models on object recognition is substantial: Qwen2-VL (visual) achieves 88.5% versus Qwen2-Audio (audio) at 71.9% on a 286-class visible sound benchmark, a 16.6 percentage point gap, significantly greater than the human ear/eye gap (2.2 pp) (Jiang et al., 11 May 2025).
Multi-view fusion: In 3D detection, moving from multi-view fusion (AP=0.836) to single-view (AP=0.573) with the same backbone reflects a gap (ΔAP=0.263) in geometric object understanding (He et al., 2023).
Viewpoint or domain shift: Ground-to-aerial observation models experience a pronounced loss in mean IoU or accuracy if directly transferred without progressive adaptation (Hu et al., 2022).
Visual perception in super-resolution: Faster, distilled one-step SR models have traditionally shown a gap in perceptual metrics (e.g., CLIPIQA, MUSIQ) compared to multi-step diffusion teachers, reducible with carefully designed perception-oriented losses (Wu et al., 3 Jun 2025).

In summary, the perception gap may span the spectrum from logit- and feature-space misalignment to more abstract cross-modal or perceptual falloff, necessitating targeted, context-sensitive remedies.

2. Mechanisms and Causes: Information, Capacity, and Modality

The primary causes of the perception gap derive from structural, capacity, and informational mismatches:

Channel capacity and representational mismatch: In vision transformers (ViTs), the large teacher encodes information across a distributed, high-frequency channel basis at late layers, but a student with much smaller channel count can only encode low-frequency mixtures. This "U-shaped" information dynamics (compression then expansion phase) produces a perception gap when naively aligning late-layer features (Tian et al., 10 Nov 2025).
Noise in student features: In CNNs, smaller student models introduce more background variance and less salient activations. DiffKD explicitly models the student's feature as a noisy version of the teacher ( $F^{(s)} = F^{(t)} + \eta$ ), arguing for explicit denoising before featural alignment (Huang et al., 2023).
Entropy disparity: From a decision-space perspective, DynamicKD shows that the entropy of the student's softmax distribution heavily influences the distillation loss. Mismatched output entropy biases the alignment objective, creating a persistent gap that can be corrected dynamically (Zhu et al., 2023).
Modality-specific priors and alignment bottlenecks: In cross-modal distillation (e.g., visual LLM teacher, audio LLM student), differences in sensory priors and modality-dependent class discriminability ground the perception gap. This is reinforced by classwise discrepancies—visual LLMs outperform audio LLMs in 226 out of 286 classes, echoing analogous (but much smaller) differences in humans (Jiang et al., 11 May 2025).

Thus, the perception gap is fundamentally linked to the student’s inability to structurally, informationally, or perceptually match the teacher, particularly across architecture, input domain, or modality boundaries.

3. Distillation Methodologies Addressing the Perception Gap

A range of distillation techniques have been developed to specifically bridge the perception gap, typically combining principled loss functions, curriculum strategies, or architectural constraints:

Cross-modal teacher selection with switches: In "Bridging Ears and Eyes," a learned binary switch (PANN classifier) identifies for each example whether visual or audio supervision is more beneficial, triggering distillation only when the teacher outperforms the student. The supervised loss combines cross-entropy, KL divergence with temperature (T=2), and an anti-forgetting self-supervision term. This selective knowledge transfer closes >90% of the audio–visual gap and raises student accuracy from 71.9% to 92.6% (Jiang et al., 11 May 2025).
Feature selection by layer or frequency: In ViT distillation, only early-to-middle layers (compression phase) are feature-aligned; late-layer alignment is suppressed or filtered in the Fourier domain. Alternatively, teacher spectra are projected into the student’s supported subspace before alignment (Tian et al., 10 Nov 2025).

Denoising, Entropy Correction, and Progressive Self-Distillation

Diffusion-based denoising: DiffKD introduces a lightweight diffusion model trained on teacher features, used to denoise student latent codes prior to their alignment. An adaptive noise-matching module controls the effective starting point for reverse diffusion, and a linear autoencoder reduces computational cost (Huang et al., 2023).
Entropy correction controller: DynamicKD augments the student’s logits with a learnable scaling (α), dynamically optimizing the student’s output entropy to minimize distillation loss, providing a direct handle on score concentration/softness (Zhu et al., 2023).
Progressive transfer: In the cross-viewpoint setting, as in "Progressive Self-Distillation for Ground-to-Aerial Perception Knowledge Transfer," models are adapted in small, incremental steps across viewpoint domains, creating accurate pseudo-labels by nearest-neighbor transfer and using MixView augmentation to enforce invariance (Hu et al., 2022).
Perceptual and semantic losses: Visual Perception Distillation for Super-Resolution (VPD-SR) introduces explicit CLIP-based semantic supervision and high-frequency perception (HFP) loss within a diffusion framework, addressing the deficiency in high-frequency and semantic fidelity of one-step student models relative to their multi-step diffusion teachers (Wu et al., 3 Jun 2025).

The following table summarizes key strategies:

Method	Gap Closed	Main Mechanism
Cross-modal switch (Jiang et al., 11 May 2025)	>90% (audio–visual LLM)	Selective teacher assignment; hybrid loss
DiffKD (Huang et al., 2023)	0.2–3.5% (various tasks)	Diffusion denoising before alignment
DynamicKD (Zhu et al., 2023)	0.5–2.6% (classification)	Dynamic entropy correction
Progressive SD (Hu et al., 2022)	16.9–23.8% (viewpoint shift)	Incremental adaptation; MixView
VPD-SR (Wu et al., 3 Jun 2025)	11–15% (perceptual metrics)	Semantic (CLIP) + high-frequency loss

4. Empirical Results and Benchmark Analyses

Empirical studies across models, modalities, and domains demonstrate the effectiveness and limitations of gap-closing strategies:

Audio–visual LLMs: Cross-modal distillation (audio ← visual) raises Qwen2-Audio accuracy from 71.9% to 92.6% (analysis set), outperforming the teacher Qwen2-VL (88.5%) and matching multimodal Qwen2.5-Omni (92.6%). The reciprocal direction (visual ← audio) yields gains from 88.5% to 91.4%. Gains are especially pronounced in the most challenging classes (e.g., +40% absolute in the 10 hardest audio classes) (Jiang et al., 11 May 2025).
ViT distillation: Late-layer feature alignment reduces accuracy (e.g., 76.99% to 76.83% with SpectralKD-Last) due to representational mismatch, while phase-selective methods mitigate negative transfer (Tian et al., 10 Nov 2025).
Multi-view 3D detection: AHD distillation closes ~60% of the ΔAP gap between multi-view and single-view settings (e.g., from 0.573 to 0.770 AP@IoU=0.7), on par or exceeding other SOTA methods (He et al., 2023).
Ground-to-aerial knowledge transfer: Progressive distillation maintains mean IoU nearly constant across maximum height intervals, with a +16.9% to +23.8% gain over baselines, whereas naive transfer yields dramatic performance drop (Hu et al., 2022).
SR perceptual quality: One-step VPD-SR elevates CLIPIQA from 0.592 (teacher) to 0.683 (student), MUSIQ score from 53.66 to 59.57, outperforming alternative models and ablation baselines in both perceptual and semantic quality (Wu et al., 3 Jun 2025).
Feature- and entropy-based KD: DynamicKD improves over classic KD and CRD in classification accuracy by up to 2.64pp, confirming that entropy correction effectively reduces alignment errors (Zhu et al., 2023).

5. Practical Recommendations and Open Directions

Comprehensive ablation studies yield actionable guidance for reducing the perception gap:

Teacher selection and hybridization: Dynamically identifying the optimal supervision source, whether by cross-modal “switches” or classwise heuristics, is critical. Selective distillation outperforms uniform transfer (Jiang et al., 11 May 2025).
Layer/frequency selection: Align features only where the student can faithfully represent the teacher, whether selecting early transformer layers or low-frequency components (Tian et al., 10 Nov 2025).
Noise and entropy calibration: Model and calibrate student-side noise or entropy, via diffusion denoising or entropy control, rather than expecting student and teacher outputs to be directly commensurate (Huang et al., 2023, Zhu et al., 2023).
Curriculum and progressive adaptation: Break large domain shifts into incremental substeps, transferring knowledge progressively to maintain semantic consistency (Hu et al., 2022).
Perceptual and semantic supervision: Integrate external metrics and supervision (e.g., CLIP, HFP) into the distillation objective to close gaps in perceptual realism and semantic alignment (Wu et al., 3 Jun 2025).

Open questions remain concerning the theoretical analysis of gap closure (no formal convergence proofs), integration with other distillation variants, and extending these strategies to detection, segmentation, and transformer-based pipelines (Tian et al., 10 Nov 2025, Zhu et al., 2023, Huang et al., 2023). More sophisticated per-layer or per-sample controllers, adaptive frequency selection, or cross-domain architectures may further narrow the gap.

6. Significance Across Modalities and Research Frontiers

The perception gap is not limited to a particular architecture or modality but recurs as a central bottleneck in neural compression, domain adaptation, distributed perception, and regenerative modeling. Its characterization and the growing canon of remedies—ranging from principled frequency filtering in ViTs, to cross-modal switch-guided distillation for LLMs, to diffusion-based explicit denoising for CNNs—offer generalized templates. A plausible implication is that perception gap frameworks will increasingly underpin distillation strategies not just for teacher–student pairs within a modality or architecture, but for heterogeneous, cross-modality, and cross-viewpoint transfer scenarios.

As neural models become more multimodal and knowledge reuse becomes a dominant concern, scientifically principled approaches to quantifying and closing the perception gap will remain at the forefront of efficiency- and capability-driven model design.