Practical equivalence of hidden interpolation to true out-of-distribution generalization

Ascertain the extent to which performance gains achieved via hidden interpolation across a model’s training corpus—i.e., solving evaluation tasks by interpolating among training examples that implicitly cover the test distribution—are practically equivalent to genuine out-of-distribution generalization in large language models.

Background

The authors discuss a perspective that prioritizes bringing tasks in-distribution rather than solely targeting out-of-distribution generalization. In this context, they raise a specific uncertainty: whether even perfect interpolation across an expanded training corpus would yield practical performance comparable to genuine OOD generalization.

This uncertainty matters for evaluating AI progress: if hidden interpolation can fully substitute for OOD generalization in practice, benchmark gains might still translate into real-world capability. If not, reported improvements could overstate generalization beyond the training distribution. Clarifying this relationship would help calibrate interpretations of benchmark results and model capabilities.

References

This is a valid perspective, but 1) then the deviation from the assumptions of empirical risk minimization should be explicitly noted, 2) it's unclear to what extent even perfect hidden interpolation would be practically equivalent to true OOD generalization.

Soft Contamination Means Benchmarks Test Shallow Generalization  (2602.12413 - Spiesberger et al., 12 Feb 2026) in Section: Limitations and Future Work