Reliability of Dataset-Based OOD Evaluation

Determine how reliably out-of-distribution (OOD) detection performance measured on individual benchmark datasets serves as an indicator of a model’s general ability to detect OOD examples across the broader space of plausible, untested inputs.

Background

AP-OOD is evaluated on a set of summarization and translation benchmarks and shows strong empirical performance. However, the generalization of results from specific evaluation datasets to the wide variety of possible OOD inputs encountered in deployment is not guaranteed.

The authors explicitly note uncertainty about the external validity of dataset-based OOD evaluation, highlighting a need to understand the extent to which benchmark outcomes predict overall OOD detection capability across untested distributions.

References

Second, it remains unclear how reliably the OOD detection performance on specific data sets can indicate the general ability to detect OOD examples, as a large portion of plausible OOD inputs remains untested.

AP-OOD: Attention Pooling for Out-of-Distribution Detection  (2602.06031 - Hofmann et al., 5 Feb 2026) in Section: Limitations and Future Work