Are Domain Generalization Benchmarks with Accuracy on the Line Misspecified?

Published 31 Mar 2025 in cs.LG, cs.AI, and stat.ML | (2504.00186v2)

Abstract: Spurious correlations are unstable statistical associations that hinder robust decision-making. Conventional wisdom suggests that models relying on such correlations will fail to generalize out-of-distribution (OOD), especially under strong distribution shifts. However, empirical evidence challenges this view as naive in-distribution empirical risk minimizers often achieve the best OOD accuracy across popular OOD generalization benchmarks. In light of these results, we propose a different perspective: many widely used benchmarks for evaluating robustness to spurious correlations are misspecified. Specifically, they fail to include shifts in spurious correlations that meaningfully impact OOD generalization, making them unsuitable for evaluating the benefit of removing such correlations. We establish conditions under which a distribution shift can reliably assess a model's reliance on spurious correlations. Crucially, under these conditions, we should not observe a strong positive correlation between in-distribution and OOD accuracy, often called "accuracy on the line." Yet, most state-of-the-art benchmarks exhibit this pattern, suggesting they do not effectively assess robustness. Our findings expose a key limitation in current benchmarks used to evaluate domain generalization algorithms, that is, models designed to avoid spurious correlations. We highlight the need to rethink how robustness to spurious correlations is assessed, identify well-specified benchmarks the field should prioritize, and enumerate strategies for designing future benchmarks that meaningfully reflect robustness under distribution shift.

Abstract PDF Upgrade to Chat

Summary

The paper demonstrates that standard ERM models can outperform specialized DG methods due to benchmark misspecification.
It defines well-specified DG benchmarks and establishes conditions where spurious correlations harm out-of-distribution accuracy.
Empirical evidence across diverse datasets shows strong ID-OOD accuracy correlations, urging a reevaluation of DG evaluation practices.

This paper investigates a counterintuitive observation in domain generalization (DG) benchmarking: standard empirical risk minimization (ERM) models often perform as well as or better than specialized DG algorithms designed to mitigate spurious correlations. The authors propose that this phenomenon, along with the frequently observed "accuracy on the line" (a strong positive correlation between in-distribution (ID) and out-of-distribution (OOD) accuracy across different models), stems from the misspecification of many popular DG benchmarks.

The core argument is that these benchmarks fail to incorporate distribution shifts in spurious correlations that would meaningfully degrade the performance of models relying on them. Consequently, they are unsuitable for evaluating the benefits of explicitly removing such correlations.

The paper formally defines domain-general features ( $\mathcal{Z}_c$ ), whose relationship with the label ( $Y$ ) is stable across domains, and spurious features ( $\mathcal{Z}_e$ ), whose relationship with $Y$ can change. It distinguishes between an optimal domain-general predictor ($\fc^\mathcal{E}$), which only uses $\mathcal{Z}_c$ and is optimal over a set of distributions $\mathcal{E}$ , and an optimal domain-specific predictor ($\fX^P$), which uses all features and is optimal for a specific distribution $P$ . Under assumptions that both feature types are informative and non-redundant (Assumptions 3.3 and 3.4), the optimal domain-specific predictor for a given training distribution $P_{ID}$ ($\fX^{P_{ID}}$) will leverage spurious correlations and achieve lower ID loss than a predictor that only uses domain-general features.

A key contribution is the definition of a well-specified domain generalization benchmark. An ID/OOD split ( $P_{ID}, P_{OOD} \in \mathcal{E}$ ) is well-specified if the optimal domain-specific predictor trained on $P_{ID}$ performs worse on $P_{OOD}$ than the optimal domain-general predictor: $\text{acc}_{P_{OOD}}(\fX^{P_{ID}}) < \text{acc}_{P_{OOD}}(\fc^\mathcal{E})$.

The authors derive sufficient conditions for a DG split to be well-specified (Theorem 3.5). These conditions relate to the nature of the shift in spurious correlations between $P_{ID}$ and $P_{OOD}$ . Specifically, they require a sufficient "misalignment" or "reversal" of the spurious correlation, such that the contribution of spurious features to the $P_{ID}$ -optimal predictor results in incorrect predictions on $P_{OOD}$ with high probability. For instance, if $\mathcal{Z}_e$ is sub-Gaussian, a condition involving the mean and variance of $\mathcal{Z}_e$ before and after the shift is provided.

Crucially, the paper establishes a link between well-specified benchmarks and the "accuracy on the line" phenomenon. Theorem 3.8 proves that well-specified benchmarks and those exhibiting strong positive accuracy on the line ( $\epsilon \approx 0$ in Definition 3.7, which measures the tightness of the correlation) are fundamentally at odds. Specifically, the set of shifts that are well-specified and exhibit perfect accuracy on the line has Lebesgue measure zero. The measure of well-specified shifts grows inversely with the strength of the accuracy-on-the-line correlation. This theoretical result implies that accuracy on the line serves as a test for benchmark misspecification; if a benchmark shows strong positive accuracy on the line, it likely does not effectively assess robustness to spurious correlations that should hurt OOD performance.

The paper empirically evaluates the "accuracy on the line" property across a variety of state-of-the-art DG benchmarks, including datasets from DomainBed (PACS, TerraIncognita, ColoredMNIST), WILDS (Camelyon17, FMoW, CivilComments, Waterbirds), Spawrious, and Covid-CXR. They generate a diverse set of models for each benchmark and measure the Pearson correlation between ID and OOD accuracy on probit-transformed scales.

The empirical findings show that:

Many popular benchmarks (e.g., PACS, TerraIncognita, WILDSFMoW, most Spawrious settings, and the average performance evaluation in Waterbirds and CivilComments) exhibit strong positive accuracy on the line, suggesting they may be misspecified for evaluating spurious correlation robustness.
Semi-synthetic benchmarks designed to manipulate spurious correlations (ColoredMNIST, Spawrious Env 0, Waterbirds evaluated on worst-group accuracy) and some natural datasets (certain splits of Covid-CXR and CivilComments) show weaker, zero, or even negative correlations ("accuracy on the inverse line"). These benchmarks are identified as being better suited for the DG task of removing spurious correlations, as they demonstrate that improving ID performance does not necessarily lead to improved OOD performance, or can even hurt it.

The authors discuss several practical implications:

Benchmark users should be aware of the intended scope of datasets (e.g., natural vs. worst-case shifts).
Benchmarking evaluations should prioritize datasets and splits exhibiting weak or negative accuracy on the line.
Future benchmark design should aim to avoid strong positive accuracy on the line, perhaps through explicit manipulation of spurious correlations (semisynthetic) or identification of natural experiments.
Benchmarking practices like averaging results across multiple splits or datasets may be misleading if some are misspecified. Evaluating worst-group accuracy aligns better with assessing spurious correlation robustness.
The findings have implications for benchmarking related tasks like Causal Representation Learning and Algorithmic Fairness, where distinguishing stable from spurious associations is also critical.

In summary, the paper provides theoretical grounding and empirical evidence that many existing DG benchmarks may be misspecified, leading to misleading conclusions about the effectiveness of DG algorithms. It proposes accuracy on the line as a practical test for benchmark validity and identifies specific datasets and evaluation strategies that are better aligned with the goal of developing models truly robust to spurious correlations.