Capturing complex conditional noise in distillation-based SSL without auxiliary conditioning

Ascertain how distillation-based self-supervised learning methods, such as BYOL, can capture complex noise structures in the conditional distribution p(x^+ | x), including heteroscedasticity and multimodality, without conditioning on additional information.

Background

The paper notes that distillation-based methods use asymmetric encoders and predictors, which empirically avoid collapse, but their mechanisms are not fully understood.

Specifically, the authors highlight uncertainty in how such predictors could capture complex conditional noise (heteroscedastic or multimodal) between paired data without additional conditioning signals, motivating their introduction of a latent variable to reduce uncertainty.

References

Intuitively, the predictor accounts for cases where E[+ \mid ] \neq ; but it remains unclear how it can capture complex noise structures in p(+ \mid )—which may be heteroscedastic or even multimodal—without conditioning on additional information.

Self-Supervised Learning from Structural Invariance  (2602.02381 - Zhang et al., 2 Feb 2026) in Section 2.3 (Preliminaries: distillation-based SSL)